David Kaplan - Bayesian Statistics For The Social Sciences-2024

Bayesian Statistics for the Social Sciences
Methodology in the Social Sciences

David A. Kenny, Founding Editor
Todd D. Little, Series Editor
www.guilford.com/MSS
This series provides applied researchers and students with analysis and research design
books that emphasize the use of methods to answer research questions. Rather than
emphasizing statistical theory, each volume in the series illustrates when a technique should
(and should not) be used and how the output from available software programs should (and
should not) be interpreted. Common pitfalls as well as areas of further development are
clearly articulated.
RECENT VOLUMES
THEORY CONSTRUCTION AND MODEL-BUILDING SKILLS:

A PRACTICAL GUIDE FOR SOCIAL SCIENTISTS, SECOND EDITION
James Jaccard and Jacob Jacoby
LONGITUDINAL STRUCTURAL EQUATION MODELING WITH Mplus:

A LATENT STATE−TRAIT PERSPECTIVE
Christian Geiser
COMPOSITE-BASED STRUCTURAL EQUATION MODELING:

ANALYZING LATENT AND EMERGENT VARIABLES
Jörg Henseler
BAYESIAN STRUCTURAL EQUATION MODELING

Sarah Depaoli
INTRODUCTION TO MEDIATION, MODERATION, AND CONDITIONAL

PROCESS ANALYSIS: A REGRESSION-BASED APPROACH, THIRD EDITION
Andrew F. Hayes
THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY, SECOND EDITION

R. J. de Ayala
APPLIED MISSING DATA ANALYSIS, SECOND EDITION

Craig K. Enders
PRINCIPLES AND PRACTICE OF STRUCTURAL EQUATION MODELING,

FIFTH EDITION
Rex B. Kline
MACHINE LEARNING FOR SOCIAL AND BEHAVIORAL RESEARCH

Ross Jacobucci, Kevin J. Grimm, and Zhiyong Zhang
BAYESIAN STATISTICS FOR THE SOCIAL SCIENCES, SECOND EDITION

David Kaplan
LONGITUDINAL STRUCTURAL EQUATION MODELING, SECOND EDITION

Todd D. Little
Bayesian Statistics
for the Social Sciences
SECOND EDITION
David Kaplan
Series Editor’s Note by Todd D. Little
THE GUILFORD PRESS

New York London
Copyright © 2024 The Guilford Press
A Division of Guilford Publications, Inc.
370 Seventh Avenue, Suite 1200, New York, NY 10001
www.guilford.com
All rights reserved
No part of this book may be reproduced, translated, stored in a retrieval

system, or transmitted, in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without written
permission from the publisher.
Printed in the United States of America
This book is printed on acid-free paper.
Last digit is print number: 9 8 7 6 5 4 3 2 1
Library of Congress Cataloging-in-Publication Data

Names: Kaplan, David, author.
Title: Bayesian statistics for the social sciences / David Kaplan.
Description: Second edition. | New York : The Guilford Press, [2024] |
Series: Methodology in the social sciences | Includes bibliographical
references and index.
Identifiers: LCCN 2023025118 | ISBN 9781462553549 (cloth)
Subjects: LCSH: Social sciences—Statistical methods. | Bayesian
statistical decision theory. | BISAC: SOCIAL SCIENCE / Statistics |
BUSINESS & ECONOMICS / Statistics
Classification: LCC HA29 .K344 2014 | DDC 519.5/42—dc23
LC record available at https://lccn.loc.gov/2023025118
Guilford Press is a registered trademark of Guilford Publications, Inc.
Series Editor’s Note
Anytime I see one of my series authors put together a second edition it’s like
falling in love again because I know two things: It’s a labor of love for the
author and you become even more enriched than the first time around. David
Kaplan’s second edition is simply lovely. As I said in the first edition, Kaplan is
in a very elite class of scholar. He is a methodological innovator who is guid-
ing and changing the way that researchers conduct their research and ana-
lyze their data. He is also a distinguished educational researcher whose work
shapes educational policy and practice. I see David Kaplan’s book as a reflec-
tion of his sophistication as both a researcher and statistician; it shows depth
of understanding that even dedicated quantitative specialists may not have
and, in my view, it will have an enduring impact on research practice. Kaplan’s
research profile and research skills are renowned internationally and his repu-
tation is globally recognized. His profile as a prominent power player in the
field brings instant credibility. As a result, when Kaplan says Bayesian is the
way to go, researchers listen. As with the first edition, his book brings his voice
to you in an engaging and highly serviceable manner.
Why is the Bayesian approach to statistics seeing a resurgence across the
social and behavioral sciences (it’s an approach that has been around for some
time)? One reason for the delay in adopting Bayes is technological. Bayesian
estimation can be computer intensive and, until about a score of years ago, the
computational demands limited the widespread application. Another reason
is that the social and behavioral sciences needed an accessible translation of
Bayes for these fields so that we could understand not only the benefits of
Bayes but also how to apply a Bayesian approach. Kaplan is clear and practical
in his presentation and shares with us his experiences and helpful/pragmatic
recommendations. I think a Bayesian perspective will continue to see wide-
spread usage now that David has updated and expanded upon this indispens-
able resource. In many ways, the zeitgeist for Bayes is still favorable given that
researchers are asking and attempting to answer more complex questions. This
second edition provides researchers with the means to address well the intri-
cate nuances of applying a Bayesian perspective to test intertwined theory-
driven hypotheses.
v
vi Series Editor’s Note
This second edition brings a wealth of new material that builds nicely on
what was already a thorough and formidable foundation. Kaplan uses the R
interface to Stan that provides a fast and stable software environment, which is
great news because the inadequacies of other software environments were an
impediment to adopting a Bayesian approach. Each of the prior chapters has
been expanded with new material. For example, Chapter 1 adds a discussion
of coherence, Dutch book bets, and the calibration of probability assessments.
In Chapter 2, there is an extended discussion of prior distributions, which is at
the heart of Bayesian estimation. Chapter 3 continues with coverage of Jeffreys’
prior and the LKJ prior for correlation matrices. In Chapter 4, the different
algorithms utilized in the Stan software platform are explained, including the
Metropolis-Hastings algorithm, the Hamiltonian Monte Carlo algorithm, and
the No-U-Turn sampler, as well as an updated discussion of convergence diag-
nostics.
Other chapters have new material such as new missing data material on
the problem of model uncertainty in multiple imputation, expanded coverage
of continuous and categorical latent variables, factor analysis, and latent class
analysis, as well as coverage of multinomial, Poisson, and negative binomial
regression. In addition, the important topics of model evaluation and model
comparison are given their own chapter in the second edition. New chapters
on other critical topics have been added—including variable selection and
sparsity, the Bayesian decision theory framework to explain model averag-
ing, and the method of Bayesian stacking as a means of combining predictive
distributions—and a remarkably insightful chapter on Bayesian workflow for
statistical modeling in the social sciences. All lovely additions to the second
edition, which, as in the first edition, was already a treasure trove of all things
Bayesian. As always, enjoy!
Todd D. Little
At My Wit’s End in Montana (the name of my home)
Preface to the Second Edition
Since the publication of the first edition of Bayesian Statistics for the Social
Sciences in 2014, Bayesian statistics is, arguably, still not the norm in the
formal quantitative methods training of social scientists. Typically, the only
introduction that a student might have to Bayesian ideas is a brief overview
of Bayes’ theorem while studying probability in an introductory statistics
class. This is not surprising. First, until relatively recently, it was not feasible
to conduct statistical modeling from a Bayesian perspective owing to its
complexity and lack of available software. Second, Bayesian statistics rep-
resents a powerful alternative to frequentist (conventional) statistics, and,
therefore, can be controversial, especially in the context of null hypothesis
significance testing.1 However, over the last 20 years or so, considerable
progress has been made in the development and application of complex
Bayesian statistical methods, due mostly to developments and availability
of proprietary and open-source statistical software tools. And, although
Bayesian statistics is not quite yet an integral part of the quantitative train-
ing of social scientists, there has been increasing interest in the application
of Bayesian methods, and it is not unreasonable to say that in terms of
theoretical developments and substantive applications, Bayesian statistics
has arrived.
Because of extensive developments in Bayesian theory and computa-
tion since the publication of the first edition of this book, I felt there was a
pressing need for a thorough update of the material to reflect new devel-
opments in Bayesian methodology and software. The basic foundations
of Bayesian statistics remain more or less the same, but this second edi-
tion encompasses many new extensions and so the order of the chapters
has changed in some instances, with some chapters heavily revised, some
chapters updated, and some chapters containing all new material.
1
We will use the term frequentist to describe the paradigm of statistics commonly used today, and
which represents the counterpart to the Bayesian paradigm of statistics. Historically, however,
Bayesian statistics predates frequentist statistics by about 150 years.
vii
viii Preface to the Second Edition
• Chapter 1 now contains new material on coherence, Dutch book bets,

and the calibration of probability assessments.
• Chapter 2 now includes an extended discussion of prior distribu-

tions, including weakly informative priors and reference priors. As
an aside, a discussion of Cromwell’s rule has also been added.
• Chapter 3 now includes a description of Jeffreys’ prior associated

with each relevant probability distribution. I also add a description
of the LKJ prior for correlation matrices (Lewandowski, Kurowicka,
& Joe, 2009).
• Chapter 4 adds new material on the Metropolis-Hastings algorithm.

New material is also added on Hamiltonian Monte Carlo and the No-
U-Turn sampler underlying the Stan software program which will be
used throughout this book. An updated discussion of convergence
diagnostics is presented. Chapter 4 also includes material on summa-
rizing the posterior distribution and provides a fully worked-through
example introducing the Stan software package. I also provide a brief
discussion of variational Bayes—an alternative algorithm that does not
require Markov chain Monte Carlo sampling. An example of varia-
tional Bayes is reserved for Chapter 11.
• Chapter 5 (Chapter 6 in the first edition) retains material on Bayes-

ian linear and generalized linear models, but expands the material
from the first edition to now cover multinomial, Poisson, and nega-
tive binomial regression.
• Chapter 6 (Chapter 5 in the first edition) covers model evaluation

and model comparison as a separate chapter and now adds a critique
of the Bayesian information criterion. This chapter also includes in-
depth presentations of the deviance information criterion, the widely
applicable information criterion, and the leave-one-out information
criterion. This chapter serves as a bridge to describe more advanced
modeling methodologies in Chapters 7 and 8.
• Chapter 7 (Chapter 8 in the first edition) retains the material on

Bayesian multilevel modeling from the first edition with the excep-
tion that all examples now use Stan.
• Chapter 8 (Chapter 9 in the first edition) has been revamped in light

of the publication of Sarah Depaoli’s (2021) excellent text on Bayes-
ian structural equation modeling. For this chapter, I have decided to
focus attention on two general latent variable models and refer the
Preface to the Second Edition ix
reader to Depaoli’s text for a more detailed treatment of Bayesian SEM

which focuses primarily on estimating a variety of latent variable
models using Mplus (Muthén & Muthén, 1998–2017), BUGS (Lunn,
Spiegelhalter, Thomas, & Best, 2009), and blavaan (Merkle, Fitzsim-
mons, Uanhoro, & Goodrich, 2021). Note that blavaan is a flexible
interface to Stan for estimating Bayesian structural equation models.
For this edition, I will focus on models for continuous and categorical
latent variables, specifically factor analysis, and latent class analysis,
respectively. In both cases, I use Stan and in the latent class example,
I will discuss the so-called label-switching problem and a possible solu-
tion using variational Bayes.
• Chapter 9 (Chapter 7 in the first edition) addresses missing data with

a focus on Bayesian perspectives. This chapter remains mostly the
same as the first edition and serves simply as a general overview
of missing data with a focus on Bayesian methods. The reader is
referred to Enders’ and Little and Rubin’s seminal texts on missing
data for more detail (Little & Rubin, 2020; Enders, 2022). This chapter
does include new material on the problem of model uncertainty in
multiple imputation.
• Chapter 10 provides new material on Bayesian variable selection and

sparsity, focusing on the ridge prior, the lasso prior, the horseshoe
prior, and the regularized horseshoe prior, all of which induce spar-
sity in the model. A worked-through example using Stan is provided.
• Chapter 11 is now dedicated to model averaging reflecting both his-

torical and recent work in this area. I frame this discussion in the
context of Bayesian decision theory. I provide a discussion of model
averaging to include the general topic of – frameworks and now
discuss the method of Bayesian stacking as a means of combining pre-
dictive distributions. This chapter is primarily drawn from Kaplan
(2021).
• Chapter 12 closes the book by providing a Bayesian workflow for

statistical modeling in the social sciences and summarizes the Bayes-
ian advantage over conventional frequentist practice of statistics in
the social sciences.
Data Sources
As in the first edition, the examples provided will primarily utilize large-
scale assessment data, and in particular data from the OECD Program for
x Preface to the Second Edition
International Student Assessment (PISA; OECD, 2019), the Early Childhood

Longitudinal Study (ECLS; NCES, 2001), and the Progress in International
Reading Literacy Study (PIRLS; Mullis & Martin, 2015). The datasets used
for the examples are readily available on the Guilford website associated
with the book (www.guilford.com/kaplan-materials).
Program for International Student Assessment (PISA)

Launched in 2000 by the Organization for Economic Cooperation and
Development (OECD), PISA is a triennial international survey that aims to
evaluate education systems worldwide by testing the skills and knowledge
of 15-year-old students. In 2018, 600,000 students, statistically representa-
tive of 32 million 15-year-old students in 79 countries and economies, took
an internationally agreed-upon 2-hour test.2 Students were assessed in sci-
ence, mathematics, reading, collaborative problem solving, and financial
literacy. PISA is arguably the most important policy-relevant international
survey that is currently operating (OECD, 2002).
Following the overview in Kaplan and Kuger (2016), the sampling
framework for PISA follows a two-stage stratified sample design. Each
country/economy provides a list of all “PISA-eligible” schools, and this list
constitutes the sampling frame. Schools are then sampled from this frame
with sampling probabilities that are proportional to the size of the school,
with the size being a function of the estimated number of PISA-eligible
students in the school. The second stage of the design requires sampling
students within the sampled schools. A target cluster size of 35 students
within schools was desired, though for some countries, this target cluster
size was negotiable. The method of assessment for PISA follows closely the
spiraling design and plausible value methodologies originally developed
for NAEP (see, e.g., OECD, 2017).
In addition to these so-called cognitive outcomes, policymakers and
researchers alike have begun to focus increasing attention on the non-
academic contextual aspects of schooling. Context questionnaires pro-
vide important variables for models predicting cognitive outcomes and
these variables have become important outcomes in their own right,
and are often referred to as non-cognitive outcomes (see, e.g., Heckman &
Kautz, 2012). PISA also assesses these non-cognitive outcomes via a one-
half-hour internationally agreed-upon context questionnaire (see Kuger,
Klieme, Jude, & Kaplan, 2016). Data from PISA are freely available at
https://www.oecd.org/pisa/ and can be accessed using their PISA
Data Explorer.
2
PISA 2022 will have 85 participating countries, but the data were not released as of the writing
of this book.
Preface to the Second Edition xi
Early Childhood Longitudinal Study:

Kindergarten Class of 1998-99
In addition to PISA, this book uses data from the Early Childhood Lon-
gitudinal Study: Kindergarten Class of 1998-99 (NCES, 2001). The ECLSK:
1998-99 implemented a multistage probability sample design to obtain a
nationally representative sample of children attending kindergarten in
1998–99. The primary sampling units at the base year of data collection
(Fall Kindergarten) were geographic areas consisting of counties or groups
of counties. The second-stage units were schools within sampled PSUs. The
third- and final-stage units were children within schools.
For ECLS-K:1998-99, detailed information about children’s kindergar-
ten experiences, as well as transition into the formal schooling from Grades
1 through 8, was collected in the fall and the spring of kindergarten (1998-
99), the fall and spring of 1st grade (1999-2000), the spring of 3rd grade
(2002), the spring of 5th grade (2004), and the spring of 8th grade (2007),
seven time points in total. For more detail regarding the sampling design
for ECLS-K:1998-1999, please see Tourangeau, Nord, Lê, Sorongon, and
Najarian (2009).
Children, their families, teachers, schools, and care providers pro-
vided information on children’s cognitive, social, emotional, and physical
development. Information was also collected on children’s home environ-
ment, home educational activities, school environment, classroom envi-
ronment, classroom curriculum, teacher qualifications, and before- and
after-school care. Data from ECLS-K:1998-1999 are freely available at
https://nces.ed.gov/ecls/.
Progress in International Reading Literacy Study (PIRLS)

The Progress in International Reading Literacy Study (PIRLS) is sponsored
by the International Association for the Evaluation of Educational Achieve-
ment (IEA) and is an international assessment and research project designed
to measure reading achievement at the fourth-grade level, as well as school
and teacher practices related to instruction. The international target popu-
lation for the PIRLS assessment is defined as the grade representing 4 years
of formal schooling, counting from the first year of primary or elemen-
tary schooling. PIRLS employs a two-stage random sample design, with a
sample of schools drawn as a first stage and one or more intact classes of
students selected from each of the sampled schools as a second stage. Intact
classes of students are sampled rather than individual students from across
the grade level or of a certain age because PIRLS pays particular attention
to students’ curricular and instructional experiences, and these typically
are organized on a classroom basis. Sampling intact classes also has the
xii Preface to the Second Edition
operational advantage of less disruption to the school’s day-to-day business

than individual student sampling.
Fourth-grade students complete a reading assessment and question-
naire that addresses students’ attitudes toward reading and their read-
ing habits. In addition, questionnaires are given to students’ teachers and
school principals to gather information about students’ school experiences
in developing reading literacy. Since 2001, PIRLS has been administered
every 5 years, with the United States participating in all past assessments.
For the example used in this book, I will draw data from PIRLS 2016. The
PIRLS databases are freely available at https://timssandpirls.bc.edu/
pirls2016/index.html.
Software
For this edition, I will demonstrate Bayesian concepts and provide applica-
tions using primarily the Stan (Stan Development Team, 2021a) software
program and its R interface RStan (Stan Development Team, 2020). Stan is
a high-level probabilistic programming language written in C++. Stan is
named after Stanislaw Ulam, one of the major developers of Monte Carlo
methods. With Stan, the user can specify log density functions, and, of
relevance to this book obtain fully Bayesian inference through Hamilto-
nian Monte Carlo and the No-U-Turn algorithm (discussed in Chapter 4).
In some cases, other interfaces for Stan, such as rstanarm (Goodrich, Gabry,
Ali, & Brilleman, 2022) and brms (Bürkner, 2021), will be used. These pro-
grams also call in other programs, and for cross-validation, we will be
using the loo program (Vehtari, Gabry, Yao, & Gelman, 2019). For peda-
gogical purposes, I have written the code to be as explicit as possible. The
Stan programming language is quite flexible, and many different ways
of writing the same code are possible. However, it should be emphasized
that this book is not a manual on Bayesian inference with Stan. For more
information on the intricacies of the Stan programming language, see Stan
Development Team (2021a).
Finally, all code will be integrated into the text and fully annotated. In
addition, all software code in the form of R files and data can be found on
the Guilford companion website. Note that due to the probabilistic nature
of Bayesian statistical computing, a reanalysis of the examples may not
yield precisely the same numerical results as found in the book.
Preface to the Second Edition xiii
Philosophical Stance
In the previous edition of this book, I wrote a full chapter on various
philosophical views underlying Bayesian statistics, including subjective
versus objective Bayesian inference, as well as a position I took arguing
for an evidence-based view of subjective Bayesian statistics. As a good
Bayesian, I have updated my views since that time, and in the interest of
space, I would rather add more practical material and concentrate less on
philosophical matters. However, whether one likes it or not, the applica-
tion of statistical methods, Bayesian or otherwise, betrays a philosophical
stance, and it may be useful to know the philosophical stance that encom-
passes this second edition. In particular, my position regarding an evidence-
based view of Bayesian modeling remains more or less unchanged, but my
updated view is also consistent with that of Gelman and Shalizi (2013) and
summarized in Haig (2018), namely, a neo-Popperian view that Bayesian
statistical inference is fundamentally (or should be) deductive in nature
and that the ”usual story” of Bayesian inference characterized by updating
knowledge inductively from priors to posteriors is probably a fiction, at
least with respect to typical statistical practice (Gelman & Shalizi, 2013, p.
8).
To be clear, the philosophical position that I take in this book can be
summarized by five general points. First, statistical modeling takes place
in a state of pervasive uncertainty. This uncertainty is inherently epistemic
in that it is our knowledge about parameters and models that is imperfect.
Attempting to address this uncertainty is valuable insofar as it impacts
one’s findings whether it is directly addressed or not. Second, parameters
and models, by their very definition, are unknown quantities and the only
language we have for expressing our uncertainty about parameters and
models is probability. Third, prior distributions encode our uncertainty by
quantifying our current knowledge and assumptions about the parameters
and models of interest through the use of continuous or categorical prob-
ability distributions. Fourth, our current knowledge and assumptions are
propagated via Bayes’ theorem to the posterior distribution which provides
a rich way to describe results and to test models for violations through the
use of posterior predictive checking via the posterior predictive distribu-
tion. Fifth, posterior predictive checking provides a way to probe deficien-
cies in a model, both globally and locally, and while I may hold a more
sanguine view of model averaging than Gelman and Shalizi (2013), I con-
cur that posterior predictive checking is an essential part of the Bayesian
workflow (Gelman et al., 2020) for both explanatory and predictive uses of
models.
xiv Preface to the Second Edition
Target Audience
Positioning a book for a particular audience is always a tricky process. The
goal is to first decide on the type of reader one hopes to attract, and then to
continuously keep that reader in mind when writing the book. For this edi-
tion, the readers I have in mind are advanced graduate students or research-
ers in the social sciences (e.g., education, psychology, and sociology) who
are either focusing on the development of quantitative methods in those
areas or who are interested in using quantitative methods to advance sub-
stantive theory in those areas. Such individuals would be expected to have
good foundational knowledge of the theory and application of regression
analysis in the social sciences and have had some exposure to mathematical
statistics and calculus. It would also be expected that such readers would
have had some exposure to methodologies that are now widely applied to
social science data, in particular multilevel models and latent variable mod-
els. Familiarity with R would also be expected, but it is not assumed that
the reader would have knowledge of Stan. It is not expected that readers
would have been exposed to Bayesian statistics, but at the same time, this
is not an introductory book. Nevertheless, given the presumed background
knowledge, the fundamental principles of Bayesian statistical theory and
practice are self-contained in this book.
Acknowledgments
I would like to thank the many individuals in the Stan community
(https:discourse.mc-Stan.org) who patiently and kindly answered
many questions that I had regarding the implementation of Stan.
I would also like to thank the reviewers who were initially anonymous:
Rens van de Schoot, Department of Methodology and Statistics, Utrecht
University; Irini Moustaki, Department of Statistics, London School of Eco-
nomics and Political Science; Insu Paek, Senior Scientist, Human Resources
Research Organization; and David Rindskopf, Departments of Educational
Psychology and Psychology, The Graduate Center, The City University of
New York. All of these scholars’ comments have greatly improved the qual-
ity and accessibility of the book. Of course, any errors of commission or
omission are strictly my responsibility.
I am indebted to my editor C. Deborah Laughton. I say, “my editor”
because C. Deborah not only edited this edition and the previous edi-
tion, but also the first edition of my book on structural equation modeling
(Kaplan, 2000) and my handbook on quantitative methods in the social sci-
ences (Kaplan, 2004) when she was editor at another publishing house. My
loyalty to C. Deborah stems from my first-hand knowledge of her extraor-
Preface to the Second Edition xv
dinary professionalism, but also from a deep-seated affection for her as my

close friend.
Since the last edition of the book, there have been a few new additions
to my family, including my daughter Hannah’s husband, Robert, and their
son, Ari, and daughter, June. My first grandchild, Sam, to whom the first
edition was dedicated, now has a brother, Jonah. I dedicate this book to my
entire family, Allison, Rebekah, Hannah, Joshua, Robert, Sam, Jonah, Ari,
and June, but especially to my wife, Allison, and my daughters, Rebekah
and Hannah, who have been the inspiration for everything I do and every-
thing I had ever hoped to become.
Contents
PART I. FOUNDATIONS
1 • PROBABILITY CONCEPTS AND BAYES’ THEOREM 3
1.1 Relevant Probability Axioms / 3
1.1.1 The Kolmogorov Axioms of Probability / 3
1.1.2 The Reńyi Axioms of Probability / 4
1.2 Frequentist Probability / 5
1.3 Epistemic Probability / 6
1.3.1 Coherence and the Dutch Book / 6
1.3.2 Calibrating Epistemic Probability Assessments / 7
1.4 Bayes’ Theorem / 9
1.4.1 The Monty Hall Problem / 10
1.5 Summary / 11
2 • STATISTICAL ELEMENTS OF BAYES’ THEOREM 13

2.1 Bayes’ Theorem Revisited / 13
2.2 Hierarchical Models and Pooling / 15
2.3 The Assumption of Exchangeability / 16
2.4 The Prior Distribution / 18
2.4.1 Non-Informative Priors / 18
2.4.2 Jeffreys’ Prior / 19
2.4.3 Weakly Informative Priors / 20
2.4.4 Informative Priors / 21
2.4.5 An Aside: Cromwell’s Rule / 22
2.5 Likelihood / 23
2.5.1 The Law of Likelihood / 23
2.6 The Posterior Distribution / 25
xvii
xviii Contents
2.7 The Bayesian Central Limit Theorem

and Bayesian Shrinkage / 27
2.8 Summary / 29
3 • COMMON PROBABILITY DISTRIBUTIONS AND THEIR PRIORS 31

3.1 The Gaussian Distribution / 32
3.1.1 Mean Unknown, Variance Known: The Gaussian Prior / 32
3.1.2 The Uniform Distribution as a Non-Informative Prior / 33
3.1.3 Mean Known, Variance Unknown:
The Inverse-Gamma Prior / 35
3.1.4 Mean Known, Variance Unknown: The Half-Cauchy Prior / 36
3.1.5 Jeffreys’ Prior for the Gaussian Distribution / 37
3.2 The Poisson Distribution / 38
3.2.1 The Gamma Prior / 38
3.2.2 Jeffreys’ Prior for the Poisson Distribution / 39
3.3 The Binomial Distribution / 40
3.3.1 The Beta Prior / 40
3.3.2 Jeffreys’ Prior for the Binomial Distribution / 41
3.4 The Multinomial Distribution / 42
3.4.1 The Dirichlet Prior / 43
3.4.2 Jeffreys’ Prior for the Multinomial Distribution / 43
3.5 The Inverse-Wishart Distribution / 44
3.6 The LKJ Prior for Correlation Matrices / 45
3.7 Summary / 46
4 • OBTAINING AND SUMMARIZING 47

THE POSTERIOR DISTRIBUTION
4.1 Basic Ideas of Markov Chain Monte Carlo Sampling / 47
4.2 The Random Walk Metropolis-Hastings Algorithm / 49
4.3 The Gibbs Sampler / 50
4.4 Hamiltonian Monte Carlo / 51
4.4.1 No-U-Turn Sampler (NUTS) / 52
4.5 Convergence Diagnostics / 53
4.5.1 Trace Plots / 53
4.5.2 Posterior Density Plots / 53
4.5.3 Autocorrelation Plots / 54
4.5.4 Effective Sample Size / 54
4.5.5 Potential Scale Reduction Factor / 55
4.5.6 Possible Error Messages When Using HMC/NUTS / 55
4.6 Summarizing the Posterior Distribution / 56
4.6.1 Point Estimates of the Posterior Distribution / 56
4.6.2 Interval Summaries of the Posterior Distribution / 57
4.7 Introduction to Stan and Example / 60
4.8 An Alternative Algorithm: Variational Bayes / 66
4.8.1 Evidence Lower Bound (ELBO) / 67
4.8.2 Variational Bayes Diagnostics / 68
4.9 Summary / 70
Contents xix
PART II. BAYESIAN MODEL BUILDING

5 • BAYESIAN LINEAR AND GENERALIZED MODELS 73
5.1 The Bayesian Linear Regression Model / 73
5.1.1 Non-Informative Priors in the Linear Regression Model / 74
5.2 Bayesian Generalized Linear Models / 85
5.2.1 The Link Function / 86
5.3 Bayesian Logistic Regression / 87
5.4 Bayesian Multinomial Regression / 91
5.5 Bayesian Poisson Regression / 94
5.6 Bayesian Negative Binomial Regression / 98
5.7 Summary / 99
6 • MODEL EVALUATION AND COMPARISON 101

6.1 The Classical Approach to Hypothesis Testing and Its
Limitations / 101
6.2 Model Assessment / 103
6.2.1 Prior Predictive Checking / 104
6.2.2 Posterior Predictive Checking / 107
6.3 Model Comparison / 112
6.3.1 Bayes Factors / 112
6.3.2 Criticisms of Bayes Factors and the BIC / 116
6.3.3 The Deviance Information Criterion (DIC) / 117
6.3.4 Widely Applicable Information Criterion (WAIC) / 118
6.3.5 Leave-One-Out Cross-Validation / 119
6.3.6 A Comparison of the WAIC and LOO / 121
6.4 Summary / 123
7 • BAYESIAN MULTILEVEL MODELING 125

7.1 Revisiting Exchangeability / 126
7.2 Bayesian Random Effects Analysis of Variance / 127
7.3 Bayesian Intercepts as Outcomes Model / 135
7.4 Bayesian Intercepts and Slopes as Outcomes Model / 137
7.5 Summary / 141
8 • BAYESIAN LATENT VARIABLE MODELING 143

8.1 Bayesian Estimation for the CFA / 143
8.1.1 Priors for CFA Model Parameters / 144
8.2 Bayesian Latent Class Analysis / 150
8.2.1 The Problem of Label-Switching and a Possible Solution / 154
8.2.2 Comparison of VB to the EM Algorithm / 158
8.3 Summary / 160
xx Contents
PART III. ADVANCED TOPICS AND METHODS

9 • MISSING DATA FROM A BAYESIAN PERSPECTIVE 165
9.1 A Nomenclature for Missing Data / 165
9.2 Ad Hoc Deletion Methods for Handling Missing Data / 166
9.2.1 Listwise Deletion / 167
9.2.2 Pairwise Deletion / 167
9.3 Single Imputation Methods / 167
9.3.1 Mean Imputation / 168
9.3.2 Regression Imputation / 168
9.3.3 Stochastic Regression Imputation / 169
9.3.4 Hot Deck Imputation / 169
9.3.5 Predictive Mean Matching / 170
9.4 Bayesian Methods of Multiple Imputation / 170
9.4.1 Data Augmentation / 171
9.4.2 Chained Equations / 172
9.4.3 EM Bootstrap: A Hybrid Bayesian/Frequentist Method / 173
9.4.4 Bayesian Bootstrap Predictive Mean Matching / 175
9.4.5 Accounting for Imputation Model Uncertainty / 176
9.5 Summary / 177
10 • BAYESIAN VARIABLE SELECTION AND SPARSITY 179

10.1 Introduction / 179
10.2 The Ridge Prior / 181
10.3 The Lasso Prior / 183
10.4 The Horseshoe Prior / 185
10.5 Regularized Horseshoe Prior / 187
10.6 Comparison of Regularization Methods / 189
10.6.1 An Aside: The Spike-and-Slab Prior / 191
10.7 Summary / 191
11 • MODEL UNCERTAINTY 193

11.1 Introduction / 193
11.2 Elements of Predictive Modeling / 194
11.2.1 Fixing Notation and Concepts / 195
11.2.2 Utility Functions for Evaluating Predictions / 195
11.3 Bayesian Model Averaging / 196
11.3.1 Statistical Specification of BMA / 197
11.3.2 Computational Considerations / 197
11.3.3 Markov Chain Monte Carlo Model Composition / 199
11.3.4 Parameter and Model Priors / 200
11.3.5 Evaluating BMA Results: Revisiting Scoring Rules / 201
11.4 True Models, Belief Models, and M-Frameworks / 210
11.4.1 Model Averaging in the M-Closed Framework / 210
11.4.2 Model Averaging in the M-Complete Framework / 211
11.4.3 Model Averaging in the M-Open Framework / 211
Contents xxi
11.5 Bayesian Stacking / 212

11.5.1 Choice of Stacking Weights / 212
11.6 Summary / 216
12 • CLOSING THOUGHTS 217

12.1 A Bayesian Workflow for the Social Sciences / 217
12.2 Summarizing the Bayesian Advantage / 220
12.2.1 Coherence / 220
12.2.2 Conditioning on Observed Data / 220
12.2.3 Quantifying Evidence / 221
12.2.4 Validity / 221
12.2.5 Flexibility in Handling Complex Data Structures / 222
12.2.6 Formally Quantifying Uncertainty / 222
LIST OF ABBREVIATIONS AND ACRONYMS 223
REFERENCES 225
AUTHOR INDEX 237
SUBJECT INDEX 241
ABOUT THE AUTHOR 249
The companion website

(www.guilford.com/kaplan-materials) provides the data
and R software code files for the book’s examples.
Part I
FOUNDATIONS
1
Probability Concepts and Bayes’
Theorem
In this chapter, we consider fundamental issues in probability that underlie

both frequentist and Bayesian statistical inference. We first discuss the axioms
of probability that underlie both frequentist and Bayesian concepts of probabil-
ity. Next, we discuss the frequentist notion of probability as long-run frequency.
We then show that long-run frequency is not the only way to conceive of prob-
ability, and that probability can be considered as epistemic belief. A key concept
derived from epistemic probability and one that requires adherence to the axioms
of probability is coherence, and we describe this concept in terms of betting systems,
particularly, the so-called Dutch book. The concepts of epistemic probability and
coherence lead to our discussion of Bayes’ theorem, and we then show how these
concepts relate by working through the famous Monty Hall problem.
1.1 Relevant Probability Axioms

Most students in the social sciences were introduced to the axioms of probability
by studying the properties of the coin toss or the dice roll. These studies address
questions such as (1) What is the probability that the flip of a fair coin will return
heads? and (2) What is the probability that the roll of two fair die will return a
value of 7? To answer these questions requires enumerating the possible outcomes
and then counting the number of times the event could occur. The probabilities
of interest are obtained by dividing the number of times the event occurred by
the number of possible outcomes, that is, the relative frequency of events. Before
introducing Bayes’ theorem, it is useful to review the axioms of probability that
have formed the basis of frequentist statistics. These axioms of can be attributed
primarily to the work of Kolmogorov (1956).
1.1.1 The Kolmogorov Axioms of Probability

Consider two events denoted as A and B. To keep the things simple, consider these
both to be the flip of a fair coin. Then, using the standard notation for the union ∪
and intersection ∩ of sets, the following are the axioms of probability, namely
3
4 Bayesian Statistics for the Social Sciences
1. p(A) ≥ 0.
2. The probability of the sample space is 1.0.
3. Countable additivity: If A and B are mutually exclusive, then p(A or B) ≡
p(A ∪ B) = p(A) + p(B). Or, more generally,
 

[ ∞ 
 X∞
=
 
p A j p(A j ) (1.1)
 


 j=1   j=1
A number of other axioms of probability can be derived from these three basic
axioms. Nevertheless, these three axioms can be used to deal with the relatively
easy case of the coin flipping example mentioned above. For example, if we toss a
fair coin an infinite number of times, we expect it to land heads 50% of the time.1
This probability, and others like it, satisfy the first axiom that probabilities must be
greater than or equal to zero. The second axiom states that over an infinite number
of coin flips, the sum of all possible outcomes (in this case, heads and tails) is equal
to one. Indeed, the number of possible outcomes represents the sample space, and
the sum of probabilities over the sample space is one. Finally, with regard to
the third axiom, assuming that one outcome precludes the occurrence of another
outcome (e.g., the coin landing heads precludes the occurrence of the coin landing
tails), then the probability of the joint event p(A ∪ B) is the sum of the separate
probabilities, that is, p(A ∪ B) = p(A) + p(B).
We may wish to add to these three axioms a fourth axiom that deals with the
notion of independent events. If two events are independent, then the occurrence of
one event does not influence the probability of another event. For example, with
two coins A and B, the probability of A resulting in “heads,” does not influence the
result of a flip of B. Formally, we define independence as p(A and B) ≡ p(A ∩ B) =
p(A)p(B). The notion that independent events allow the individual probabilities to
be simply their product plays a critical role in the derivation Bayes’ theorem.
1.1.2 The Rényi Axioms of Probability

Note that the Kolmogorov axioms do not take into account how probabilities
might be affected by conditioning on the dependency of events. An extension of
the Kolmogorov system that accounts for conditioning was put forth by Alfred
Rényi. As an example, consider the case of observing the presence or absence of
lung cancer (C) and the behavior of smoking or not smoking (S). We may argue on
the basis of prior experience and medical research that C is not independent of S,
that is, the joint probability p(C ∩ S) , p(C)p(S). To handle this problem, we define
the conditional probability of C “given” S (i.e., p(C | S)) as
p(C ∩ S)
p(C | S) = (1.2)
p(S)
1
Interestingly, this expectation is not based on having actually tossed the coin an infinite
number times. Rather, this expectation is a prior belief, and arguably, this is one example
of how Bayesian thinking is automatically embedded in frequentist logic.
Probability Concepts and Bayes’ Theorem 5
The denominator on the right-hand side of Equation (1.2) shows that the sample
space associated with p(C ∩ S) is reduced by knowing S. Notice that if C and S
were independent, then
p(C ∩ S)
p(C | S) =
p(S)
p(C)p(S)
=
p(S)
= p(C) (1.3)
which states that knowing S tells us nothing about C.

Following Press (2003), Rényi’s axioms can be defined with respect to our lung
cancer example. Let G stand for a genetic history of cancer. Then,
1. For any events, C, S, we have P(C | S) ≥ 0 and p(C | C) = 1.

2. For disjoint events Cj and some event S
 

 [∞ 
 ∞
 X
=
 
p C | S p(C j | S)

 j 


 j=1 
 j=1
where j indexes the collection of disjoint events.

3. For every collection of events (C, S, G), with S a subset of G (i.e., S ⊆ G), and
0 < p(S | G), we have
p(C ∩ S | G)
p(C | S) =
p(S | G)
Rényi’s third axiom allows one to obtain the conditional probability of C given
S, while conditioning on yet a third variable G, for obtaining the conditional
probability of lung cancer given observed smoking, while in turn conditioning on
whether there is a genetic history of cancer.
1.2 Frequentist Probability

Underlying frequentist statistics is the idea of the long-run frequency. An example
of probability as long-run frequency is the dice roll. In this case, the number of
possible outcomes of one roll of a fair die is 6. If we wish to calculate the probability
of rolling a 2, then we simply obtain the ratio of the number of favorable outcomes
(here there is only 1 favorable outcome), to the total possible number of outcomes
(here 6). Thus, the frequentist probability is 1/6 = 0.17. However, the frequentist
probability of rolling a 2 is purely theoretical insofar as the die might not be truly
fair or the conditions of the toss might vary from trial to trial. Thus, the frequentist
probability of 0.17 relates to the relative frequency of rolling a 2 in a very large
(indeed infinite) and perfectly replicable number of dice rolls.
This purely theoretical nature of long-run frequency nevertheless plays a cru-
cial role in frequentist statistical practice. Indeed, the entire structure of Neyman –
Pearson hypothesis (Neyman & Pearson, 1928) testing and Fisherian statistics (e.g.,
Fisher, 1941/1925) is based on the conception of probability as long-run frequency.
Our conclusions regarding null and alternative hypotheses presuppose the idea
that we could conduct the same study an infinite number of times under perfectly
reproducible conditions. Moreover, our interpretation of confidence intervals also
assumes a fixed parameter with the confidence intervals varying over an infinitely
large number of identical studies.
1.3 Epistemic Probability

But there is another view of probability, where it is considered as subjective belief.
Specifically, a modification of the Kolmogorov axioms was advanced by de Finetti
(1974) who suggested replacing the (infinite) countable additivity axiom with finite
additivity and also suggested treating p(·) as a subjective probability.2
It may be of historical interest to note that de Finetti was a radical subjectivist.
Indeed, the opening line of his two-volume treatise on probability states that
’‘Probability does not exist,” by which he meant that probability does not have
an objective status, but rather represents the quantification of our experience of
uncertainty. Furthermore, the notion of probability as something external to the
individual, possessing an objective status out there is superstition - no different
from postulating the existence of “... Cosmic Ether, Absolute Space and Time, ... ,
or Fairies and Witches...”(p. x). Thus, for de Finetti (1974)
... only subjective probabilities exist – i.e. the degree of belief in the
occurrence of an event attributed by a given person at a given instant
with a given set of information. (pp. 4–5, italics de Finetti’s)
As pointed out by Press (2003), the first mention of probability as a degree of
subjective belief was made by Ramsey (1926), and it is this notion of probability as
subjective belief that led to considerable resistance to Bayesian ideas. A detailed
treatment of the axioms of subjective probability can be found in Fishburn (1986).
The use of the term subjective is perhaps unfortunate, insofar as it promotes
the idea of fuzzy, unscientific, reasoning. Lindley (2007) relates the same concern
and prefers the term personal probability to subjective probability. Howson and
Urbach (2006) adopt the less controversial term epistemic probability to reflect an
individual’s greater or lesser degree of uncertainty about the problem at hand. Put
another way, epistemic probability concerns our uncertainty about unknowns. This
is the term that will be used throughout the book.
1.3.1 Coherence and the Dutch Book

How might we operationalize the notion of epistemic probability? The canonical
example used to illustrate epistemic probability is betting behavior. If an
individual enters into a bet that does not satisfy the axioms of probability, then
they are not being coherent. Incoherent beliefs can lead a bettor to enter into a
2
A much more detailed set of axioms for subjective probability was advanced by Savage
(1954).
sequence of bets in which they are guaranteed to lose regardless of the outcome.
In other words, the epistemic beliefs of the bettor do not cohere with the axioms
of probability. This type of bet is referred as a Dutch book or lock.
Example 1.1: An Example of a Dutch Book
Table 1.1 below shows a sequence of bets that one of the following teams goes
to the World Series.
TABLE 1.1. Sequence of bets leading to a Dutch book
Team Odds Implied probability Bet price Payout

Chicago Cubs Even 1
1+1
= .50 $100 $100 + $100
Boston Red Sox 3 to 1 against 1
3+1
= .25 $50 $50 + $150
Los Angeles Dodgers 4 to 1 against 1
4+1
= .20 $40 $40 + $160
New York Yankees 9 to 1 against 1
9+1
= .10 $20 $20 + $180
Totals 1.05 $210 $200
Consider the first bet in the sequence, namely that the odds of the Chicago Cubs
going to the World series is even. This is the same as saying that the probability
implied by the odds is 0.50. The bookie sets the bet price at $100. If the Cubs
do go to the World Series, then the bettor gets back the $100 plus the bet price.
However, this is a sequence of bets that also includes the Red Sox, Dodgers, and
Yankees. Taken together, we see that the implied probabilities sum to greater than
1.0, which is a clear violation of Kolmogorov’s Axiom 2. As a result, the bookie
will pay out $200 regardless of who goes to the World Series while the bettor has
paid $210 for the bet, a net loss of $10.
1.3.2 Calibrating Epistemic Probability Assessments

It is perhaps not debatable that if an individual is making a series of epistemic
probability assessments, then it is rational for that individual to utilize feedback
to improve their judgment. Such feedback is referred to as calibration. As a simple
example, discussed in Press (2003), consider the problem of weather forecasting.
Each and every day the forecaster (probability assessor) states, let’s say, the prob-
ability of precipitation. The forecast could be the result of an ensemble of weather
forecasting models, but in any event, a forecast is stated. This forecast is then
compared to the actual event for that day, and over the long run, this forecast
should improve. But how do we quantify the difference between the forecast and
the actual event in a fashion that provides useful feedback to the forecaster? To
provide useful feedback, we use scoring rules.
Scoring rules provide a measure of the accuracy of epistemic probabilistic
assessments, and a probability assessment can be said to be well-calibrated if the
assigned probability of the outcome matches the actual proportion of times that
the outcome occurred (Dawid, 1982).
A scoring rule is a utility function (Gneiting & Raftery, 2007), and the goal
of the assessor is to be honest and provide a forecast that will maximize his/her
utility. The idea of scoring rules is quite general, but one can consider scoring rules
from a subjectivist Bayesian perspective. Here, Winkler (1996) quotes de Finetti
(1962, p. 359)
The scoring rule is constructed according to the basic idea that the
resulting device should oblige each participant to express his true
feelings, because any departure from his own personal probability
results in a diminution of his own average score as he sees it.
Because scoring rules only require the stated probabilities and realized out-
comes, they can be developed for ex-post or ex-ante probability evaluations. Ex-post
probability assessments utilize the existing historical probability assessments to
gauge accuracy whereas ex-ante probability assessments are true forecasts into the
future before the realization of the outcome. However, as suggested by Winkler
(1996), the ex-ante perspective of probability evaluation should lead us to consider
strictly proper scoring rules because these rules are maximized if and only if the
assessor is honest in reporting their scores.
Following the discussion and notation given in Winkler (1996, see also; Jose,
Nau, & Winkler, 2008), let p ≡ (p1 , . . . pn ) represent the assessor’s epistemic prob-
ability distribution of the outcomes of interest, let r ≡ (r1 . . . rn ) represent the
assessor’s reported epistemic probability of the outcomes of interest, and let ei
represent the probability distribution that assigns a probability of 1 if the event i
occurs and a probability of 0 for all other events. Then, a scoring rule, denoted as
S(r, p), provides a score S(r, ei ) if the event occurs. The expected score obtained
when the assessor reports r when their true distribution is p is
X
S(r, p) = pi S(r, ei ). (1.4)
i
The scoring rule is strictly proper if S(p, p) ≥ S(r, p) for every r and p with equality
when r = p (Jose et al., 2008, p. 1147). We will discuss scoring rules in more detail
in Chapter 11 when we consider model averaging. Suffice it to say here that there
are three popular types of scoring rules.
1. Quadratic scoring rule (Brier score)
n
X
Sk (r) = 2rk − r2i (1.5)
i=1
2. Spherical scoring rule

v
t n
X
Sk (r) = rk r2i (1.6)
i=1
3. Logarithmic scoring rule

Sk (r) = log rk (1.7)
Example 1.2: An Example of Scoring Rules
As a simple example, consider a hypothetical forecast of the amount of snowfall

in Madison, Wisconsin, on February 2, 2023. The forecaster is required to provide
forecasts for four possible intervals of snowfall in inches: [0, 3], (1, 4], [2, 5], [5, ∞).
Now suppose the forecaster provides the following forecasts r = 0.6, 0.2, 0.1, 0.1).
On February 2, 2023 we observe that actual snowfall is in the interval [0, 3]. Then,
the forecaster’s quadratic (Brier) scores would be (0.78, −0.02, −0.22, −0.22. The
forecaster’s spherical scores would be (0.39, 0.13, 0.06, 0.06). Finally, the fore-
caster’s logarithmic scores would be (−0.51, −1.69, −2.30, −2.30). We see that in
each case, the forecaster’s score is maximized when the reported forecast aligns
with the observed outcome. We will take up the idea of scoring rules in Chapter
11 when we discuss Bayesian model averaging.
1.4 Bayes’ Theorem

Having set the stage with our discussion of epistemic probability, we are now ready
to introduce Bayes’ theorem. To begin, an interesting feature of Equation (1.2) is
that joint probabilities are symmetric, namely, p(C ∩ S) = p(S ∩ C). Therefore, we
can also express the conditional probability of smoking, S, given the observation
of lung cancer, C, as
p(S ∩ C)
p(S | C) = (1.8)
p(C)
Because of the symmetry of the joint probabilities, we obtain
p(C | S)p(S) = p(S | C)p(C) (1.9)

Therefore,
p(S | C)p(C)
p(C | S) = (1.10)
p(S)
Equation (1.10) is Bayes’ theorem, also referred to as the inverse probability theorem
(Bayes, 1763).3 In words, Bayes’ theorem states that the conditional probability
of an individual having lung cancer given that the individual smokes is equal to
the probability that the individual smokes given that he/she has lung cancer times
the probability of having lung cancer. The denominator of Equation (1.10), p(S),
is the marginal probability of smoking. This can be considered the probability
of smoking across individuals with and without lung cancer, which we write as
p(S) = p(S | C) + p(S | ¬C).4 Because this marginal probability is obtained over all
possible outcomes of observing lung cancer it does not carry information relevant
to the conditional probability. In fact, p(S) can be considered a normalizing factor,
ensuring that the probability does not exceed 1, as required by Kolmogorov’s
3
What we now refer to as Bayesian probability should perhaps be referred to as Laplacian
probability.
4
The symbol ¬ denotes “not.”
second axiom described earlier. Thus, it is not uncommon to see Bayes’ theorem
written as
p(C | S) ∝ p(S | C)p(C) (1.11)
Equation (1.11) states that the probability of observing lung cancer given smoking
is proportional to the probability of smoking given observing lung cancer times
the marginal probability of observing lung cancer.
It is interesting to note that Bayesian reasoning resolves the so-called base-rate
fallacy, that is, the tendency to equate p(C | S) with P(S | C). Specifically, without
knowledge of the base-rate p(C) (the prior probability) and the total amount of
evidence in the observation P(S), it is a fallacy to believe that p(C | S) = p(S | C).
1.4.1 The Monty Hall Problem

As a means of understanding connections among the concepts of conditional
probability, coherence, and Bayes’ theorem, let’s consider the famous “Monty
Hall” problem. In this problem, named after the host of a popular old television
game show, a contestant is shown three doors, one of which has a desirable
prize, while the other two have undesirable prizes. The contestant picks a door,
but before Monty opens the door, he shows the contestant another door with
an undesirable prize and asks the contestant whether they wish to stay with the
chosen door or switch.
At the start of the game, it is assumed that there is one desirable prize and that
the probability that the desirable prize is behind any of the three doors is 1/3. Once
a door is picked, Monty shows the contestant a door with an undesirable prize and
asks if they would like to switch from the door she originally chose. It is important
to note that Monty will not show the contestant the door with the desirable prize.
Also, we assume that because the remaining doors have undesirable prizes, which
door Monty opens is basically random. Given that there are two doors remaining
in this three-door problem, the probability is 1/2. Thus, Monty’s knowledge
of where the prize is located plays a crucial conditioning role in this problem.
Moreover, at no point in the game are the axioms of probability violated. With
this information in hand, we can obtain the necessary probabilities with which to
apply Bayes’ theorem. Assume the contestant picks door A. Then, the necessary
conditional probabilities are
1. p(Monty opens door B|prize is behind A) = 12 .

2. p(Monty opens door B|prize is behind B) = 0.
3. p(Monty opens door B|prize is behind C) = 1.
The final probability is due to the fact that there is only one door for Monty to
choose given that the contestant chose door A and the prize is behind door B.
Let M represent Monty opening door B. Then, the joint probabilities can be
obtained follows:
1 1 1
p(M ∩ A) = p(M | A)p(A) = × =
2 3 6
1
p(M ∩ B) = p(M | B)p(B) = 0 × = 0
3
1 1
p(M ∩ C) = p(M | C)p(C) = 1 × =
3 3
Before applying Bayes’ theorem, note that we have to obtain the marginal proba-
bility of Monty opening door B. This is
p(M) = p(M ∩ A) + p(M ∩ B) + p(M ∩ C)

1 1 1
= +0+ =
6 3 2
Finally, we can now apply Bayes’ theorem to obtain the probabilities of the prize
lying behind door A or door C.
1 1
p(M | A)p(A) 2 × 3 1
p(A | M) = = 1
=
p(M) 2
3
1
p(M | C)p(C) 3 2
p(C | M) = =1× 1
=
p(M) 2
3
Thus, from Bayes’ theorem, the best strategy on the part of the contestant is to
switch doors. Crucially, this winning strategy is conceived of in terms of long-run
frequency. That is, if the game were played an infinite number of times, then
switching doors would lead to the prize approximately 66% of the time. This is an
example of where long-run frequency can serve to calibrate Bayesian probability
assessments (Dawid, 1982), as we discussed in Section 1.3.2.
1.5 Summary
This chapter provided a brief introduction to probabilistic concepts relevant to
Bayesian statistical inference. Although the notion of epistemic probability pre-
dates the frequentist conception of probability, it had not significantly impacted the
practice of applied statistics until computational developments brought Bayesian
inference back into the limelight. This chapter also highlighted the conceptual
differences between the frequentist and epistemic notions of probability. The
importance of understanding the differences between these two conceptions of
probability is more than just a philosophical exercise. Rather, their differences
are manifest in the elements of the statistical machinery needed for advancing a
Bayesian perspective for research in the social sciences. We discuss the statistical
elements of Bayes’ theorem in the following chapter.
2
Statistical Elements of Bayes’ Theorem
The material presented thus far concerned frequentist and epistemic conceptions
of probability, leading to Bayes’ theorem. The goal of this chapter is to present
the role of Bayes’ theorem as it pertains specifically to statistical inference. Set-
ting the foundations of Bayesian statistical inference provides the framework for
applications to a variety of substantive problems in the social sciences.
The first part of this chapter introduces Bayes’ theorem using the notation
of random variables and parameters. This is followed by a discussion of the
assumption of exchangeability. Following that, we extend Bayes’ theorem to more
general hierarchical models. Next are three sections that break down the elements
of Bayes’ theorem with discussions of the prior distribution, the likelihood, and
the posterior distribution. The final section introduces the Bayesian central limit
theorem and Bayesian shrinkage.
2.1 Bayes’ Theorem Revisited

To begin, denote by Y a random variable that takes on a realized value y. For
example, a student’s score on a mathematics achievement test could be considered
a random variable Y taking on a large set of possible values. Once the student
receives a score on the mathematics test, the random variable Y is now realized as
y. Because Y is an unobserved random variable, we need to specify a probability
model to explain how we obtained the actual data values y. We refer to this model
as the data generating process or DGP.
Next, denote by θ a parameter that we believe characterizes the probability
model of interest. The parameter θ can be a scalar, such as the mean or the variance
of a distribution, or it can be vector-valued, such as a set of regression coefficients
in regression analysis. To avoid too much notational clutter, for now we will use
θ to represent either scaler or vector valued parameters where the difference will
be revealed by the context.
In statistical inference, the goal is to obtain estimates of the unknown parame-
ters given the data. The key difference between Bayesian statistical inference and
frequentist statistical inference involves the nature of the unknown parameters θ.
In the frequentist tradition, the assumption is that θ is unknown, but has a fixed
value that we wish to estimate. In Bayesian statistical inference, θ is also consid-
13
ered unknown, but instead of being fixed, it is assumed, like Y, to be a random

variable possessing a prior probability distribution which reflects our uncertainty
about the true value of θ before having seen the data. Because both the observed
data y and the parameters θ are considered random variables, the probability cal-
culus allows us to model the joint probability of the parameters and the data as a
function of the conditional distribution of the data given the parameters, and the
prior distribution of the parameters. More formally,
p(θ, y) = p(y | θ)p(θ) (2.1)

where p(θ, y) is the joint distribution of the parameters and the data. Using Bayes’
theorem from Equation (1.10), we obtain the following:
p(θ, y) p(y | θ)p(θ)

p(θ | y) = = (2.2)
p(y) p(y)
where p(θ | y) is referred to as the posterior distribution of the parameters θ given
the observed data y. Thus, from Equation (2.2) the posterior distribution of θ
given y is equal to the data distribution p(y | θ) times the prior distribution of
the parameters p(θ) normalized by p(y) so that the posterior distribution sums (or
integrates) to 1. For discrete variables,
X
p(y) = p(y | θ)p(θ) (2.3)
θ
and for continuous variables,

Z
p(y) = p(y | θ)p(θ)dθ (2.4)
θ
As an aside, for complex models with many parameters, Equation (2.4) will be
very hard to evaluate, and it is for this reason we need the computational methods
that will be discussed in Chapter 4.
In line with Equation (1.11), the denominator of Equation (2.2) does not involve
model parameters, so we can omit the term and obtain the unnormalized posterior
distribution:
p(θ | y) ∝ p(y | θ)p(θ) (2.5)

Consider the data density p(y | θ) on the right-hand side of Equation (2.5).
When expressed in terms of the unknown parameters θ for fixed values of y, this
term is the likelihood L(θ | y), which we will discuss in more detail in Section 2.5.
Thus, Equation (2.5) can be rewritten as
p(θ | y) ∝ L(θ | y)p(θ) (2.6)

Equations (2.5) or (2.6) represent the core of Bayesian statistical inference and it
is what separates Bayesian statistics from frequentist statistics. Specifically, Equa-
tion (2.6) states that our uncertainty regarding the true values of the parameters of
our model, as expressed by the prior distribution p(θ), is weighted by the actual data
Statistical Elements of Bayes’ Theorem 15
p(y | θ) (or equivalently, L(θ | y)), yielding (up to a constant of proportionality)

an updated estimate of our uncertainty, as expressed in the posterior distribution
p(θ | y). Following a brief digression to discuss the assumption of exchangeability,
we will take up each of the elements of Equation (2.2) – the prior distribution, the
likelihood, and the posterior distribution.
2.2 Hierarchical Models and Pooling

At this point, the careful reader might have noticed that because θ is a random
variable described by a probability distribution, then perhaps θ can be modeled by
its own set of parameters. Indeed this is the case, and such parameters are referred
to as hyperparameters. Hyperparameters control the degree of informativeness in
prior distributions. For example, if we use a normal prior for the mean of some
outcome of interest, then a more accurate expression for this could be written as
p(θ | ϕ) ∼ N(0, 1) (2.7)
where ϕ contains the hyperparameters µ = 0, σ = 1. Naturally, the hierarchy can

continue, and we can specify a hyperprior distribution for the hyperparameters —
that is,
p(ϕ | δ) ∼ N(· | ·) (2.8)
and so on. Although this feels like “turtles all the way down,” for practical
purposes, the hierarchy needs to stop, at which point the parameters at the end
of the hierarchy are assumed to be fixed and known. Nevertheless, Bayesian
methods do allow one to examine the sensitivity of the results to changes in the
values of those final parameters.
Our discussion about hyperparameters leads to the important concept of pool-
ing in Bayesian analysis. Specifically, we can distinguish between three situations
related to Bayesian models: (a) no pooling, (b) complete pooling, and (c) partial pool-
ing. To give an example, consider the situation of modeling mastery of some
mathematics content for a sample of n students. In the no pooling case, this would
indicate that each student has his/her own chance of mastery of the content. From a
population perspective, this would imply an infinitely large variance. In the com-
plete pooling case, it is assumed that each student has exactly the same chance of
mastering the content, and again from the population perspective, this would im-
ply zero variance. A balance of these extremes is achieved through partial pooling
wherein each student is assumed to have their own chance at mastering the con-
tent, but data from other students help inform each student’s chance of mastery.
Partial pooling is achieved through specifying Bayesian hierarchical models such
as Equation (2.8).
Note that our discussion also extends to groupings of observations, such as
students nested in schools. Here in the no pooling case, each of g schools could
have its own its own separate parameters, say, θ g (g = 1, 2, . . . , G), or there could
be one set of parameters that describes the characteristics of all schools as in the
complete pooling case. Finally, each school could have its own set of parameters,
in turn modeled by a set of hyperparameters as in the partial pooling case. This
leads to a very natural way to conceive of multilevel models which we will take
up in Chapter 7.
2.3 The Assumption of Exchangeability

In most discussions of statistical modeling, it is common to assume that the data
y1 , y2 , . . . yn are independently and identically distributed — often referred to as
the iid assumption. As Lancaster (2004, p. 27) has pointed out, the iid assump-
tion suggests that probability has an “objective” existence; that there exists a
random number generator that produces a sequence of independent random vari-
ables. However, such an objectivist view of probability does not accord with the
Bayesian notion that probabilities are epistemic. As such, Bayesians invoke the
deeper notion of exchangeability to produce likelihoods and address the issue of
independence.
Exchangeability arises from de Finetti’s representation theorem (de Finetti,
1974, see also; Bernardo, 1996) and implies a judgment of the “symmetry” or
“similarity” of the information provided by the observations. This judgment is
expressed as
p(yi , . . . yn ) = p(yπ[i] , . . . , yπ[n] ) (2.9)
for all subscript permutations π. This sequence of random variables is said to be
(finite) exchangeable if the sequence in Equation (2.9) holds for a finite sequence of
the random variables (Bernardo, 1996, p. 2). In other words, the joint distribution
of the data, p(y1 , y2 , . . . yn ) is invariant to permutations of the subscripts. The idea
of infinite exchangeability requires that the infinite sequence of random variables
yπ[i] , . . . is finite exchangeable for every subsequence, with finite exchangeability
defined in Equation (2.9) (Bernardo & Smith, 2000, p. 171)
Example 2.1: An Example of Finite Exchangeability
Consider the response that student i (i = 1, 2, . . . , 10) makes to a question on an

educational questionnaire assessing attitudes toward their teacher, such as “My
teacher is supportive of me,” where

1, if student i agrees

yi = 

(2.10)
0, if student i disagrees

Next, consider three patterns of responses by 10 randomly selected students:
p(1, 0, 1, 1, 0, 1, 0, 1, 0, 0) (2.11a)
p(1, 1, 0, 0, 1, 1, 1, 0, 0, 0) (2.11b)
p(1, 0, 0, 0, 0, 0, 1, 1, 1, 1) (2.11c)
We have just presented three possible patterns, but notice that there are 210 =
1, 024 possible patterns of agreement and disagreement among the 10 students. If
our task were to assign probabilities to all possible outcomes, this could become
prohibitively difficult. However, suppose we now assume that student responses

are independent of one another, which might be reasonable if each student is
privately asked to rate the teacher on supportiveness. Then, exchangeability
implies that only the proportion of agreements matter, not the location of those
agreements in the vector. In other words, given that the sequences are the same
length n, we can exchange the response of student i for student j without changing
our belief about the probability model that generated that sequence.
Exchangeability is a subtle assumption insofar as it means that we believe that
there is a parameter θ that generates the observed data via a stochastic model
and that we can describe that parameter without reference to the particular data
at hand. As Jackman (2009) points out, the fact that we can describe θ without
reference to a particular set of data is, in fact, what is implied by the idea of a prior
distribution. Indeed, as Jackman notes, “the existence of a prior distribution over
a parameter is a result of de Finetti’s Representation Theorem (de Finetti, 1974),
rather than an assumption” (p. 40, italics Jackman’s).
It is important to note that iid random variables implies exchangeability. Specif-
ically, iid implies that
iid
p(yi , . . . yn ) = p(y1 ) × p(y2 ) × . . . × p(yn ) (2.12)
Because the right-hand side can be multiplied in any order, it follows that the left-
hand side is symmetric, and hence exchangeable. However, exchangeability does
not necessarily imply iid. A simple example of this idea is the case of drawing balls
from an urn without replacement (Suppes, 1986, p. 348). Specifically, suppose we
have an urn containing one red ball and two white balls and we are told to draw
one ball out without replacement. Then,

1, if the ith ball is red

yi = 

(2.13)
0, otherwise

To see that yi is exchangeable, note that

1 1
P(y1 = 1, y2 = 0, y3 = 0) = ×1×1= (2.14a)
3 3
2 1 1
P(y1 = 0, y2 = 1, y3 = 0) = × × 1 = (2.14b)
3 2 3
2 1 1
P(y1 = 0, y2 = 0, y3 = 1) = × × 1 = (2.14c)
3 2 3
We see that the permutations of (1,0,0) are exchangeable, however they are not iid,
because given that the first draw from the urn is red, we definitely will not get a
red ball on the second draw, but if we don’t obtain a red ball on the first draw, we
will have a higher probability of getting a red ball on the second draw, and thus
there is a dependency in the sequence of draws.
Our discussion of exchangeability has so far rested on the assumption of
independent responses. In the social sciences it is well understood that this is a
very heroic assumption. Perhaps the best example of a violation of this assumption
concerns the problem of modeling data from clustered samples, such as assessing
the responses of students nested in classrooms. To address this issue, we will

need to consider the problem of conditional exchangeability, which we will discuss
in Chapter 7 where we will deal with the topic Bayesian multilevel modeling.
2.4 The Prior Distribution

Returning to Equation (2.2), we begin with a discussion of the prior distribution,
which is the key component that defines Bayesian statistics. It is useful to remind
ourselves that no study is conducted in the complete absence of knowledge derived
from previous research. Whether we are designing randomized experiments or
sketching out path diagrams, the information gleaned from previous research is
almost always incorporated into our choice of designs, the variables we choose to
include in our models, or conceptual diagrams that we draw. Indeed, researchers
who postulate a directional hypothesis for an effect are almost certainly drawing
on prior information about the direction that an estimate must reasonably take.
Thus, when a researcher encounters a result that does not seem to align with
previous established findings, then all other things being equal, the researcher’s
surprise is related to having held a belief about what is reasonable, usually based on
prior research and/or elicited from expert judgment. Bayesian statistical inference
simply requires that this prior knowledge be made explicit, but then confronts the
prior knowledge by the actual data in hand. Moderation of prior knowledge by
the data in hand is the key meaning behind Equation (2.2).
How then might we defend the use of prior distributions when conducting
Bayesian analyses? A straightforward argument is that prior distributions directly
encode our assumptions regarding reasonable values of model parameters. These
assumptions reflect existing knowledge (or lack thereof) about the parameters
of interest, and, consistent with our philosophical stance discussed in the Pref-
ace, these assumptions are testable against other assumptions, and therefore falls
squarely into a neo-Popperian framework (see Gelman & Shalizi, 2013). As we
will see, these assumptions lie at the level of the model parameters as well as the
models themselves.
2.4.1 Non-Informative Priors

In some cases, we may not be in possession of enough prior information to aid in
drawing posterior inferences. Or, from a policy or clinical perspective, it may be
necessary to refrain from providing substantive information regarding effects of
interest and, instead, let the data speak. Regardless, from a Bayesian perspective,
this lack of information is still important to consider and incorporate into our
statistical analyses. In other words, it is equally important to quantify a lack of
information as it is to quantify the cumulative understanding of a problem at hand.
The standard approach to quantifying a lack of information is to incorpo-
rate non-informative prior probability distributions into our model specifications.
Non-informative priors are also referred to as vague or diffuse priors. In the situa-
tion in which there is no prior knowledge to draw from, perhaps the most obvious
non-informative prior distribution to use is the uniform probability distribution
U(α, β) over some sensible range of values from α to β. Application of the uni-
form distribution is based on the Principle of Insufficient Reason, first articulated
by Laplace (1774/1951), which states that in the absence of any relevant (prior)
evidence, one should assign their degrees-of-belief equally among all the possible
outcomes. In this case, the uniform distribution essentially indicates that our as-
sumption regarding the value of a parameter of interest is that it lies in the range
β − α and that all possible values have equal probability. Care must be taken
in the choice of the range of values over the uniform distribution. For example,
U[−∞, ∞] is an improper prior distribution insofar as it does not integrate to 1.0 as
required of probability distributions. We will discuss the uniform distribution in
more detail in Chapter 3.
2.4.2 Jeffreys’ Prior

A problem with the uniform prior distribution is that it is not invariant to simple
transformations. In fact, a transformation of a uniform prior can result in a prior
that is not uniform and will end up favoring some values more than others.
Moreover, the expected values of the distributions are not the same. As a very
simple example, suppose we assign a U(0, 1) prior distribution to a parameter θ
and let θ′ = θ2 , then the expected values of these two distributions would be
Z 1 Z 1
1
E[p(θ)] = θp(θ)dθ = θdθ = (2.15)
0 0 2
and Z 1 Z 1
1 ′1/2 1
E[p(θ )] =
′
θ p(θ )dθ =
′ ′ ′
θ = (2.16)
0 0 2 3
which shows the lack of invariance to a simple transformation of the parameters.
In addressing the invariance problem associated with the uniform distribution,
Jeffreys (1961) proposed a general approach that yields a prior that is invariant
under transformations. The central idea is that knowledge and information con-
tained in the prior distribution of a parameter θ should not be lost when there
is a one-to-one transformation from θ to another parameter, say ϕ, denoted as
ϕ = h(θ). More specifically, using transformation-of-variables calculus, the prior
distribution p(ϕ) will be equivalent to p(θ) when obtained as
dθ
p(ϕ) = p(θ) (2.17)
dϕ
On the basis of the relationship in Equation (2.17), Jeffreys (1961) developed a non-
informative prior distribution that is invariant under transformations, written as
p(θ) ∝ [I(θ)]1/2 (2.18)

where I(θ) is the Fisher information matrix for θ. The derivation of this result is as
follows. Write the Fisher information matrix for a parameter θ as
∂2 log p(y | θ)
!
I(θ) = −E y|θ (2.19)
∂θ2
Next, we write the Fisher information matrix for ϕ as

∂2 log p(y | ϕ)
!
I(ϕ) = −E y|ϕ (2.20)
∂ϕ2
From the change-of-variables expression in Equation (2.17), we can rewrite Equa-
tion (2.20) as
2
 ∂ (log p(y | θ))
 2
dθ 
I(ϕ) = −E y|θ  × (2.21)
∂θ2 dϕ 

2
dθ
= I(θ)
dϕ
Therefore,
dθ
I(ϕ)1/2 = I(θ)1/2 (2.22)
dϕ
from which we obtain the relationship to Equation (2.18).
It should be noted that Jeffreys’ prior is part of a class of so-called reference
priors that are designed to place the choice of priors on objective grounds based
on a set of agreed upon rules (Bernardo, 1979; Kass & Wasserman, 1996; Berger,
2006). In other words, reference priors are an algorithmic approach to choosing a
prior, and ideally, should be invariant to transformations of the parameter space.
Jeffreys’ prior is one such reference prior, and is considered the most popular. We
will provide the Jeffreys’ priors for common probability distributions used in the
social and behavioral sciences in Chapter 3.
2.4.3 Weakly Informative Priors

Situated between non-informative and informative priors is the class of weakly
informative priors. Weakly informative priors are probability distributions that
provide one with a method for incorporating less information than one actually
has in a particular situation. Specifying weakly informative priors can be useful
for many reasons. First, it is doubtful that one has absolutely no prior information
on which to base assumptions about parameters and for which a non-informative
prior would be appropriate. Rather, it is likely that one can consider a range of
parameter values that are reasonable. As an example, we know that the PISA 2018
reading assessment was internationally scaled to have a normal distribution with
a mean of 500 and standard deviation of 100, but of course the means and standard
deviations of the reading scores will be different for individual countries. Thus,
if we were to model the reading scores for the United States, we would not want
to place a non-informative prior distribution on the mean, such as the (improper)
U(−∞, +∞) or even a vague Gaussian prior such as N(0, 100). Rather, a weakly
informative prior distribution that contains a range of reasonable values might be
N(500, 10) to reflect reasonable assumptions about the mean but also recognizing
uncertainties in a given analysis.
Second, weakly informative priors are useful in stabilizing the estimates of a
model. As we will see, Bayesian inference can be computationally demanding,
particularly for hierarchical models, and so although one may have information
about, say, higher level variance terms, such terms may not be substantively
important, and/or they may be difficult to estimate. Therefore, providing weakly
informative prior information may help stabilize the analysis without impacting
inferences.
Finally, as discussed in Gelman, Carlin, et al. (2014), weakly informative priors
can be useful in theory testing where it may appear unfair to specify strong priors
in the direction of one’s theory. Rather, specifying weakly informative priors in
the opposite direction of a theory would then require the theory to pass a higher
standard of evidence.
In suggesting an approach to constructing weakly informative priors, Gelman,
Carlin, et al. (2014) consider two procedures: (1) Start with non-informative priors
and then shift to trying to place reasonable bounds on the parameters according to
the substantive situation. (2) Start with highly informative priors and then shift to
trying to elicit a more honest assessment of uncertainty around those values. From
the standpoint of specifying weakly informative priors, the first approach seems
the most sensible. The second approach appears more useful when engaging in
sensitivity analyses — a topic we will take up later.
2.4.4 Informative Priors

In the previous section, we considered the situation in which there may not be
much prior information that can be brought to bear on a problem. In that situation
we would rely on non-informative or weakly informative priors. Alternatively,
it may be the case that previous research, expert opinion (O’Hagan et al., 2006;
Hanea, Nane, Bedford, & French, 2021), or both, can be brought to bear on a prob-
lem and be systematically incorporated into prior distributions. Such priors are
referred to as informative – though one could argue that applying non-informative
priors in the case where one has little to no past information to rely on is, itself,
quite informative.
We will focus on one type of informative prior based on the notion of conju-
gacy. A conjugate prior distribution is one that, when combined with the likelihood
function, yields a posterior that is in the same distributional family as the prior
distribution. Conjugate distributions are convenient because if a prior is not con-
jugate, the resulting posterior distribution may have a form that is not analytically
simple to solve. Although the existence of numerical simulation methods for
Bayesian inference, such as Markov Chain Monte Carlo (MCMC) sampling (dis-
cussed in Chapter 4), renders the importance of conjugate distributions less of
a problem. Chapter 3 will outline conjugate priors for probability distributions
commonly encountered in the social sciences; and throughout this book, conjugate
priors will be used in examples when studying informative priors.
To motivate the use of informative priors, consider the problem in education
research of academic achievement and its relationship to class size. In this case, we
have a considerable amount of prior information based on previous studies regard-
ing the increase in academic achievement when reducing class size (Whitehurst
& Chingos, 2011). It may be that previous investigations used different tests of
academic achievement but when examined together, it has been found that reduc-
ing class size to approximately 15 children per classroom results in one-quarter of

a standard deviation increase (say about 8 points) in academic achievement. In
addition to a prior estimate of the average achievement gain due to reduction in
class size, we may also wish to quantify our uncertainty about the exact value of θ
by specifying a probability distribution around the prior estimate of the average.
Perhaps a sensible prior distribution would be a Gaussian distribution centered at
θ = 8. However, let us imagine that previous research has shown that achievement
gains due to class size reduction are almost never less than 5 points and almost
never more than 14 points (almost a full standard deviation). Taking this range of
uncertainty into account, we would propose as our initial assumption about the
parameter of interest a prior distribution on θ that is N(8, 1).
The careful reader might have wondered if setting hyperparameters to these
fixed values violates the essence of Bayesian philosophy that all parameters are
considered as unknown random variables requiring probability distributions to
encode uncertainty. To address that concern, note first that the Bayesian approach
does permit one to treat hyperparameters as fixed, but then they are presumably
known. Of course, as in the above class size example, this represents an hypothesis
of sorts and can be compared formally to other fixed values of hyperparameters
that have resulted from a different assumptions. In contrast, the frequentist
approach treats all parameters as unknown and fixed. Second, as we discussed
in Section 2.2, it is not necessary to set hyperparameters to known and fixed
quantities. Rather, in a fully hierarchical Bayesian model, it is possible to specify
hyperprior distributions on the hyperparameters. In the class size example, a
hierarchical model would leave the hyperparameters as unknown but modeled
in terms of a hyperprior distribution reflecting the variation in the class size
parameters in terms of, say, variation in class size policies across the United States.
Regardless, differences of opinion on the values specified for hyperparameters
can be directly compared via Bayesian model testing which we will discuss in
Chapter 6.
2.4.5 An Aside: Cromwell’s Rule

An important feature of specifying prior probabilities is that even though prob-
abilities range from zero to one, prior probabilities have to lie between zero and
one. That is, a prior probability cannot be either exactly zero or exactly one. This
feature of prior probability is referred to as Cromwell’s rule
Cromwell’s rule was coined by Dennis V. Lindley (2007), and the reference is
to Oliver Cromwell, who wrote to the General Assembly of the Church of Scotland
on 3 August 1650, including a phrase that has become well known and frequently
quoted:
I beseech you, in the bowels of Christ, think it possible that you may
be mistaken.
That is, there must be some allowance for the possibility, however slim, that you
are mistaken. If we look closely at Bayes theorem, we can see why. Recall that
Bayes theorem is written as
p(y | θ)p(θ)
p(θ | y) = (2.23)
p(y)
and so, if you state your prior probability of an outcome to be exactly zero, then
p(y | θ) ∗ 0
p(θ | y) = =0 (2.24)
p(y)
and thus no amount of evidence to the contrary would change your mind.
What if you state your prior probability to be exactly 1? In this case recall that
the denominator p(y) is a marginal distribution across all possible values of θ. So,
if p(θ) = 1, the denominator p(y) collapses to only your hypothesis p(y | θ) and
therefore
p(y | θ)
p(θ | y) = =1 (2.25)
p(y | θ)
Again, no amount of evidence to the contrary would change your mind; the prob-
ability of your hypothesis is 1.0 and you’re sticking to your hypothesis, no matter
what. Cromwell’s rule states that unless the statements are deductions of logic,
then one should leave some doubt (however small) in probability assessments.
2.5 Likelihood
Whereas the prior distribution encodes our accumulated knowledge/assumptions
of the parameters of interest, this prior information must, of course, be moderated
by the data in hand before yielding the posterior distribution — the source of our
current inferences. In Equation (2.5) we noted that the probability distribution of
the data given the model parameters, p(y | θ) could be written equivalently as the
L(θ | y), the likelihood of the parameters given the data.
The concept of likelihood is extremely important for both the frequentist and
Bayesian schools of statistics. Excellent discussions of likelihood can be found in
Edwards (1992) and Royall (1997). In this section, we briefly review the law of
likelihood and then present simple expressions of the likelihood for the binomial
probability and normal sampling models.
2.5.1 The Law of Likelihood

The likelihood can be defined as proportional to the probability of the data y given
the parameter(s) θ. That is
L(θ | y) ∝ p(y | θ), (2.26)
where the constant of proportionality does not depend on θ. As a stand-alone
quantity, the likelihood is of little value. Rather, what matters is the ratio of the
likelihoods under different hypotheses regarding θ, and this leads to the law of
likelihood. We define the law of likelihood as follows (see also Royall, 1997):
Definition 2.5.1. If hypothesis θ1 implies that Y takes on the value y with prob-
ability p(y | θ1 ) while hypothesis θ2 implies that Y takes on the value y with
probability p(y | θ2 ), then the law of likelihood states that the realization Y = y is
evidence in support of θ1 over θ2 if and only if L(θ1 | y) > L(θ2 | y). The likelihood
ratio, L(θ1 | y)/L(θ2 | y), measures the strength of that evidence.
Notice that the law of likelihood implies that only the information in the data as
summarized by the likelihood serve as evidence in corroboration (or refutation)
of a hypothesis. This latter idea is referred to as the likelihood principle. Notice
also that frequentist notions of repeated sampling do not enter into the law of
likelihood or the likelihood principle. The issue of conditioning on data that was
not observed will be revisited in Chapter 6 when we take up the problem of null
hypothesis significance testing.
Example 2.2: The Binomial Probability Model
First, consider the number of correct answers on a test of length n. Each item
on the test represents a Bernoulli trial, with outcomes 0 = wrong and 1 = right.
The natural probability model for data arising from n Bernoulli sequences is the
binomial sampling model. Under the assumption of exchangeability – meaning
the indexes 1 ... n provide no relevant information – we can summarize the total
number of successes by y. Letting θ be the proportion of correct responses in the
population, the binomial sampling model can be written as
!
n y
p(y | θ) = Bin(y | n, θ) = θ (1 − θ)(n−y)
y
= L(θ | n, y) (2.27)
where ny is read as “n choose y” and refers to the number of successes y in a

sequence of n “right/wrong” Bernoulli trials that can be obtained from an n item

test. The symbol Bin is shorthand for the binomial density function.
Example 2.3: The Normal Sampling Model
Next consider the likelihood function for the parameters of the simple normal
distribution which we write as
(y − µ)2
!
1
p(y | µ, σ ) = √
2
exp − (2.28)
2πσ 2σ2
where µ is the population mean and σ2 is the population variance. Under the
assumption of independent observations, we can write Equation (2.28) as
n
Y
p(y1 , y2 , . . . , yn | µ, σ2 ) = p(yi | µ, σ2 )
i
(y − µ)2 
 P 
 i i
!n/2 
1
= √

exp − 2

2πσ2  2σ 

= L(θ | y) (2.29)
where θ = (µ, σ). Thus, under the assumption of independence, the likelihood
of model the parameters given the data is simply the product of the individual
probabilities of the data given the parameters.
2.6 The Posterior Distribution

Continuing with our breakdown of Equation (2.2), notice that the posterior
distribution of the model parameters is obtained by multiplying the probability
distribution of the data given the parameters by the prior distribution of the
parameters. As we will see, the posterior distribution can be quite complex with
no simple closed form solution, and modern Bayesian statistics typically uses
computational methods such as MCMC for drawing inferences about model
parameters from the posterior distributions. We will discuss MCMC methods
in Chapter 4. Here we provide two small examples of posterior distributions
utilizing conjugate prior distributions. In Chapter 3, we will consider the prior
and posterior for other important distributions commonly encountered in the
social sciences.
Example 2.4: The Binomial Distribution with Beta Prior
Consider again the binomial distribution used to estimate probabilities for suc-
cesses and failures, such as obtained from responses to multiple choice questions
scored right/wrong. As an example of a conjugate prior, consider estimating the
number of correct responses y on a test of length n. Let θ be the proportion of cor-
rect responses. We first assume that the responses are independent of one another.
The binomial sampling model was given in Equation (2.27) and reproduced here:
!
n y
p(y | θ) = Bin(y | n, θ) = θ (1 − θ)(n−y) . (2.30)
y
One choice of a conjugate prior distribution for θ is the beta(a,b) distribution. The
beta distribution is a continuous distribution appropriate for variables that range
from 0 to 1. The terms a and b are the shape and scale parameters of the beta
distribution, respectively. The shape parameter, as the term implies, affects the
shape of the distribution. The scale parameter affects spread of the distribution,
in the sense of shrinking or stretching the distribution. For this example, a and b
will serve as hyperparameters because the beta distribution is being used as a prior
distribution for the binomial distribution. The form of the beta(a,b) distribution is
Γ(a + b) a−1
p(θ; a, b) = θ (1 − θ)b−1 (2.31)
Γ(a)Γ(b)
where Γ is the gamma(a, b) function. Multiplying Equation (2.30) and Equation
(2.31) and ignoring terms that don’t involve model parameters, we obtain the
posterior distribution by
Γ(n + a + b)
p(θ | y) = θ y+a−1 (1 − θ)n−y+b−1 (2.32)
Γ(y + a)Γ(n − y + b)
which itself is a beta distribution with parameters a′ = a + y and b′ = b + n − y.

Thus, the beta prior for the binomial sampling model is a conjugate prior that
yields a posterior distribution that is also in the family of the beta distribution.
Example 2.5: The Gaussian Distribution with Gaussian Prior: σ2 Known
This next example explores the Gaussian prior distribution for the Gaussian
sampling model in which the variance σ2 is assumed to be known. Thus, the
problem is one of estimating the mean µ. Let y denote a data vector of size n.
We assume that y follows a Gaussian distribution shown in Equation (2.28) and
reproduced here:
(y − µ)2
!
1
p(y | µ, σ2 ) = √ exp − (2.33)
2πσ 2σ2
Consider that our prior distribution on the mean is Gaussian with mean and vari-
ance hyperparameters, κ and τ2 , respectively which for this example are assumed
known. The prior distribution can be written as
(µ − κ)2
!
1
p(µ | κ, τ ) = √
2
exp − (2.34)
2πτ2 2τ2
After some algebra, the posterior distribution can be obtained as

 κ n ȳ
 τ2 + σ2

τ σ
2 2 
p(µ | y) ∼ N  1 , (2.35)

n σ2 + nτ2 
+

τ2 σ2
Thus, the posterior distribution of µ is Gaussian with mean:

κ n ȳ
τ2
+ σ2
µ̂ = (2.36)
1
τ2
+ n
σ2
and variance
τ2 σ2
σˆ2µ = (2.37)
σ2 + nτ2
We see from Equations (2.35), (2.36), and (2.37) that the Gaussian prior is conjugate
for the likelihood, yielding a Gaussian posterior.
2.7 The Bayesian Central Limit Theorem and

Bayesian Shrinkage
When examining Equation (2.36) carefully, some interesting features emerge. For
example, notice that as the sample size approaches infinity, there is no information
in the prior distribution that is relevant to estimating the mean and variance of
the posterior distribution. To see this, first we compute the asymptotic posterior
mean by letting n go to infinity:
κ n ȳ
τ2
+ σ2
lim µ̂ = lim
n→∞ n→∞ 1 + n
τ2 σ2
κσ2
2 + ȳ
= lim nτ = ȳ (2.38)
n→∞ σ + 1
2
nτ2
Thus as the sample size increases to infinity, the expected a posteriori estimate µ̂
converges to the maximum likelihood estimate ȳ
In terms of the variance, first let 1/τ2 and n/σ2 to refer to the prior precision
and data precision, respectively. The role of these two measures of precision can
be seen by once again examining the variance term for the normal distribution in
Equation (2.37). Specifically, letting n approach infinity, we obtain
1
lim σˆ2µ = lim
n→∞ n→∞ 1
τ2
+ n
σ2
σ2 σ2
= lim = (2.39)
n→∞ σ2 +n n
τ2
which we recognize as the maximum likelihood estimator of the variance of the

mean, the square root of which yields the standard error of the mean. A similar
result emerges if we consider the case where we have very little information
regarding the prior precision. That is, choosing a very large value for τ2 gives the
same result.
The fact that as n increases to infinity leads to the same result as we would
obtain from the frequentist perspective has been referred to as the Bayesian central
limit theorem. This is important from a philosophical perspective insofar as it
suggests that Bayesian and frequentist results will agree in the limit.
The posterior distribution in Equation (2.35) reveals another interesting feature
regarding the relationship between the maximum likelihood estimator of the mean,
ȳ, and the prior mean, κ. Specifically, the posterior mean, µ, can be seen as a
compromise between the prior mean, κ, and the observed data mean, ȳ. To see
this clearly, notice that we can rewrite Equation (2.35) as
σ2 nτ2
µ̂ = κ + ȳ (2.40)
σ2 + nτ2 σ2 + nτ2
Thus, the posterior mean is a weighted combination of the prior mean and
observed data mean. These weights are bounded by 0 and 1 and together
are referred to as the shrinkage factor. The shrinkage factor represents the
proportional distance that the posterior mean has shrunk back to the prior
mean, κ, and away from the maximum likelihood estimator, ȳ. Notice that
if the sample size is large, the weight associated with κ will approach zero
and the weight associated with ȳ will approach one. Thus µ̂ will approach ȳ.
Similarly, if the data variance, σ2 , is very large relative to the prior variance,
τ2 , this suggests little precision in the data relative to the prior and therefore
the posterior mean will approach the prior mean, κ. Conversely, if the prior
variance is very large relative to the data variance, this suggests greater precision
in the data compared to the prior and therefore the posterior mean will approach ȳ.
Example 2.6: The Gaussian Distribution with Gaussian Prior: σ2 Unknown
Perhaps a more realistic situation that arises in practice is when the mean
and variance of the Gaussian distribution are unknown. In this case, we need to
specify a full probability model for both the mean µ and variance σ2 . If we assume
that µ and σ2 are independent of one another, then we can factor the joint prior
distribution of µ and σ2 as
p(µ, σ2 ) = p(µ)p(σ2 ) (2.41)
We now need to specify the prior distribution of σ2 . There are two approaches that
we can take to specify the prior for σ2 . First, we can specify a uniform prior on
µ and log(σ2 ), because when converting the uniform prior on log(σ2 ) to a density
for σ2 , we obtain p(σ2 ) = 1/σ2 .1 With uniform priors on both µ and σ2 , the joint
prior distribution p(µ, σ2 ) ∝ 1/σ2 . However, the problem with this first approach is
that the uniform prior over the real line is an improper prior. Therefore, a second
approach would be to provide proper informative priors, but with a choice of
hyperparameters such that the resulting priors are quite diffused. First, again we
assume as before that y ∼ N(µ, σ2 ) and that µ ∼ N(κ, τ2 ). As will be discussed in the
next chapter, the variance parameter, σ2 , follows an inverse-gamma distribution
with shape and scale parameters, a and b, respectively. Succinctly, σ2 ∼ IG(a, b)
and the probability density function for σ2 can be written as
2
p(σ2 | a, b) ∝ (σ2 )−(a+1) e−b/(σ )
(2.42)
Even though Equation (2.42) is a proper distribution for σ2 , we can see that as
a and b approach 0, the proper prior approaches the non-informative prior 1/σ2 .
Thus, very small values of a and b can suffice to provide a prior on σ2 to be used
to estimate the joint posterior distribution of µ and σ2 .
The final step in this example is to obtain the joint posterior distribution for µ
and σ2 . Assuming that the joint prior distribution is 1/σ2 , then the joint posterior
distribution can be written as
n
(yi − µ)2
( )
1 Y 1
p(µ, σ | y) ∝ 2
2
√ exp − (2.43)
σ i=1 2πσ2 2σ2
1
Following Lynch (2007), this result is obtained via a change of variable calculus. Specifi-
cally, let k = log(σ2 ), and p(k) ∝ constant. Change of variable calculus involves the Jacobian
J = dk/dσ2 = 1/σ2 . Therefore, p(σ2 ) ∝ constant × J.
Notice, however, that Equation (2.43) involves two parameters µ and σ2 . The solu-
tion to this problem is discussed in Lynch (2007). First, the posterior distribution
of µ obtained from Equation (2.43) can be written as
nµ2 − 2n ȳµ
( )
p(µ | y, σ ) ∝ exp −
2
(2.44)
2σ2
Solving for µ involves dividing by n and completing the square yielding
(µ − ȳ)2
( )
p(µ | y, σ ) ∝ exp −
2
(2.45)
2σ2 /n
There are several ways to determine the posterior distribution of σ2 . Perhaps

the simplest is to recognize that the joint posterior distribution for µ and σ2 can be
factored into their respective conditional distributions:
p(µ, σ2 | y) = p(µ | σ2 , y)p(σ2 | y) (2.46)
The first term on the right-hand side of Equation (2.46) was solved above assuming
σ2 is known. The second term on the right hand side of Equation (2.46) is the
marginal posterior distribution of σ2 . An exact expression for p(σ2 | y) can be
obtained by integrating the joint distribution over µ – that is,
Z
p(σ2 | y) = p(µ, σ2 | y)dµ (2.47)
Although this discussion shows that analytic expressions are possible for the
solution of this simple case, in practice the advent of MCMC algorithms render
the solution to the joint posterior distribution of model parameters quite straight-
forward.
2.8 Summary
With reference to any parameter of interest, be it a mean, variance, regression
coefficient, or a factor loading, Bayes’ theorem is composed of three parts: (1) the
prior distribution representing our cumulative knowledge about the parameter
of interest; (2) the likelihood representing the data in hand; and (3) the posterior
distribution, representing our updated knowledge based on the moderation of the
prior distribution by the likelihood. By carefully decomposing Bayes’ theorem into
its constituent parts, we also can see its relationship to frequentist statistics, partic-
ularly through the Bayesian central limit theorem and the notion of shrinkage. In
the next chapter we focus on the relationship between the likelihood and the prior.
Specifically, we examine a variety of common data distributions used in social and
behavioral science research and define their conjugate prior distributions.
3
Common Probability Distributions and
Their Priors
In Chapter 2, we introduced the core concepts of Bayesian statistical inference.

Recall that the difference between Bayesians and frequentists is that Bayesians
hold that all unknowns, including parameters, are to be considered as random
variables that can be described by probability distributions. In contrast, frequen-
tists hold that parameters are also unknown, but that they are fixed quantities.
The key idea of Bayesian statistics is that the probability distribution of the data is
multiplied by the prior distribution of the parameters of the data distribution in
order to form a posterior distribution of all unknown quantities. Deciding on the
correct probability model for the data and the prior probability distribution for the
parameters is at the heart of Bayesian statistical modeling. Because the focus of
Bayesian statistical inference is the probability distribution, it is essential that we
explore the major distributions that are used in the social sciences and examine
associated prior probability distributions on the parameters.
The organization of this chapter is as follows. We will first explore the follow-
ing data distributions:
1. The Gaussian distribution
• Mean unknown/variance known
• Mean known/variance unknown
2. The Poisson distribution
3. The binomial distribution
4. The multinomial distribution
5. The inverse-Wishart distribution
6. The LKJ distribution for correlation matrices
For each of these data distributions, we will discuss how they are commonly
used in the social sciences and then describe their shape under different sets
of parameter values. We will then describe how each of these distributions is
typically incorporated in Bayesian statistical modeling by describing the conjugate
prior distributions typically associated with the respective data distributions. We
31
will also describe the posterior distribution that is derived from applying Bayes’
theorem to each of these distributions. Finally, we provide Jeffreys’ prior for
each of the univariate distributions. We will not devote space to describing more
technical details of these distributions, such as the moment generating functions or
characteristic functions. A succinct summary of the technical details of these and
many more distributions can be found in Evans, Hastings, and Peacock (2000).1
3.1 The Gaussian Distribution

The Gaussian (normal) distribution figures prominently in the field of statistics,
owing perhaps mostly to its role in the central limit theorem. Assumptions of
Gaussian distributed data abound in the social and behavioral sciences literature,
and is a particularly useful distribution in the context of measuring human traits.
The Gaussian distribution is typically assumed for test score data such as the
various literacy outcomes measured in large-scale assessments.
We write the Gaussian distribution as
y ∼ N(µ, σ2 ) (3.1)
where
E[y] = µ (3.2)
V[y] = σ 2
(3.3)
where E[·] and V[·] are the expectation and variance operators, respectively. The
probability density function of the Gaussian distribution was given in Equation
(2.28) and reproduced here:
(y − µ)2
!
1
p(y | µ, σ ) = √
2
exp − (3.4)
2πσ 2σ2
In what follows, we consider conjugate priors for two cases of the Gaussian
distribution. The first case is where the mean of the distribution is unknown
but the variance is known and the second case is where the mean is known but
variance is unknown. The Gaussian distribution is often used as a conjugate
prior for parameters that are assumed to be Gaussian in the population, such as
regression coefficients.
3.1.1 Mean Unknown, Variance Known: The Gaussian Prior

In the case where the mean is unknown but the variance is known, the prior
distribution of the mean is Gaussian, as we know from standard statistical theory.
Thus, we can write the prior distribution for the mean as
1
To create the plots, artificial data were generated using a variety of R functions avail-
able in scatterplot3d (Ligges & Mächler, 2003), pscl (Jackman, 2012), MCMCpack (Martin,
Quinn, & Park, 2011), and MVA (Everitt & Hothorn, 2012). The R code is available on the
accompanying website.
Common Probability Distributions and Their Priors 33
⎛ ⎞
1 ⎜⎜⎜ 1 2⎟
⎟
p(μ | μ0 , σ20 ) ∝ exp ⎜⎝− 2 (μ − μ0 ) ⎟⎟⎠ , (3.5)
σ0 2σ0
where μ0 and σ20 are hyperparameters
Example 3.1: The Gaussian Prior
Figure 3.1 below illustrates the Gaussian distribution with unknown prior
mean and known variance under varying conjugate priors. For each plot, the dark
dashed line is the Gaussian likelihood which remains the same for each plot. The
light dotted line is the Gaussian prior distribution which becomes increasingly
diffused. The gray line is the resulting posterior distribution.
−4 −2 −4 −2
−4 −2 −4 −2
FIGURE 3.1. Gaussian distribution, mean unknown/variance known with varying conju-
gate priors. Note how the posterior distribution begins to align with the distribution of the
data as the prior becomes increasingly flat.
These cases makes quite clear the relationship between the prior distribution
and the posterior distribution. Specifically, the smaller the variance on the prior
distribution of the mean (upper left figure), the closer the posterior matches the
prior distribution. However, in the case of a very flat prior distribution (bottom
right figure), the more the posterior distribution matches the data distribution.
3.1.2 The Uniform Distribution as a Non-Informative Prior

The uniform distribution is a common choice for a non-informative prior, and in
particular with applications to the Gaussian data distribution. Historically, Pierre-
Simon Laplace (1774/1951) articulated the idea of assigning uniform probabilities

in cases where prior information was lacking and referred to this as the principle of
insufficient reason. Thus, if one lacks any prior information favoring one parameter
value over another, then the uniform distribution is a reasonable choice as a prior.
For a continuous random variable Y, we write the uniform distribution as
Y ∼ U(α, β) (3.6)
where α and β, are the lower and upper limits of the uniform distribution, re-
spectively. The uniform distribution has α = 0 ≤ x ≤ β = 1. Under the uniform
distribution,
α+β
E[y] = (3.7)
2
(β − α)2
V[y] = (3.8)
12
Generally speaking, it is useful to incorporate the uniform distribution as a
non-informative prior for a distribution that has bounded support, such as (−1, 1).
As an example of the use of the uniform distribution distribution as a prior,
consider its role in forming the posterior distribution for a Gaussian likelihood.
Again, this would be the case where a researcher lacks prior information regarding
the distribution of the parameter of interest.
Example 3.2: Uniform Prior with Different Bounds
Figure 3.2 below shows the influence of different bounds on the uniform prior
results in different posterior distributions.
−4 −2 −4 −2
−4 −2 −4 −2
FIGURE 3.2. Influence of varying uniform priors on Gaussian distribution.
We see from Figure 3.2 that the effect of the uniform prior on the posterior distri-
bution is dependent on the bounds of the prior. For a prior with relatively narrow
bounds (upper left of figure), this is akin to having a fair amount of information,
and therefore, the prior and posterior roughly match up. However, as in the case
of Figure 3.1, if the uniform prior has very wide bounds indicating virtually no
prior information (lower right figure), the posterior distribution will match the
data distribution.
3.1.3 Mean Known, Variance Unknown: The Inverse-Gamma

Prior
When the mean of the Gaussian distribution is known but the variance is unknown,
the goal is to determine the prior distribution on the variance parameter. The
inverse-gamma (IG) distribution is often, but not exclusively, used as the conjugate
prior for the variance parameter denoted as
σ2 ∼ IG(a, b) (3.9)
where α (> 0) is the shape parameter and β (> 0) is the scale parameter. The
probability density function for the inverse-gamma distribution is written as
p(σ2 ) = (σ2 )(a+1) e−b/σ

2
(3.10)
The expected value and variance of the inverse-gamma distribution are
b
E(σ2 ) = , for a > 0 (3.11)
a−1
and
b2
V(σ2 ) = , for a > 2 (3.12)
(a − 1)2 (a − 2)
respectively.
Example 3.3: The Inverse-Gamma Prior
Figure 3.3 below shows the posterior distribution of the variance for different
inverse-gamma priors that differ only with respect to their shape.
FIGURE 3.3. Inverse-gamma prior for variance of Gaussian distribution. Note that the
“peakedness” of the posterior distribution of the variance is dependent on the shape of the
inverse-gamma prior.
3.1.4 Mean Known, Variance Unknown: The Half-Cauchy Prior

Another distribution that has been advocated as a prior for unknown variances
of the Gaussian distribution, and in particular variance terms for Bayesian
hierarchical models, is the half-Cauchy prior, denoted throughout this book as
C+ . (Gelman, 2006). The C+ distribution is parameterized by a location x0 and
scale β. As it is a C+ distribution that we use for variance terms, the location is set
to 0 and large scale values confer a greater degree of non-informativeness.
Example 3.4: The C+ Prior
Figure 3.4 below shows the C+ distribution for various values of the scale
parameter δ.
FIGURE 3.4. C+ distribution for the variance of the Gaussian distribution.
We see, as expected, that as δ →

− ∞, then p(σ2 ) →
− U(0,∞). We will be employing
+
the C prior for variance parameters throughout many of the examples in this
book.
3.1.5 Jeffreys’ Prior for the Gaussian Distribution

Consider the case of the Gaussian distribution with unknown mean and unknown
variance and treated as independent. The information matrix for the Gaussian
distribution with unknown mean μ and unknown variance σ2 is defined as
1
0
I(μ, σ ) =
2 σ2
(3.13)
0 2σ1 4
Then, following our discussion in Section 2.4.2, Jeffreys’ prior is the square root of
the determinant of the information matrix - viz.,
1/2
σ12 0 1 1
0 1 = ∝ 3 (3.14)
2σ 4 2σ 6 σ
Often we see Jeffreys’ prior for this case written as 1/σ2 . This stems from the
transformation found in Equation (2.18). Namely, the prior 1/σ3 based on p(μ, σ2 )
is the same as having the prior 1/σ2 on p(µ, σ). To see this, note from Equation
(2.18) that
p(µ, σ) = p(µ, σ2 )det(J) (3.15)

1 1
∝ 3 2σ ∝ 2 (3.16)
σ σ
where !
1 0
J= (3.17)
0 2σ
is a Jacobian transformation matrix.
3.2 The Poisson Distribution

Often in the social sciences, the outcome of interest might be the count of the
number of events that occur within a specified period of time. For example, one
may be interested in the number of colds a child catches within a year – modeled
as a function of access to good nutrition or health care. In another instance, interest
might be in the number of new teachers who drop out of the teaching profession
after 5 years, modeled as a function of teacher training and demographic features
of the school. In cases of this sort, a random variable k, representing the number of
occurrences of the event is assumed to follow a Poisson distribution with parameter
θ representing the rate of occurrence of the outcome per unit time. The probability
density function for the Poisson distribution is written as
e−θ θk
p(k | θ) = , k = 0, 1, 2, . . . , θ>0 (3.18)
k!
3.2.1 The Gamma Prior

The conjugate prior for the parameter θ of the Poisson distribution is the gamma
density with scale parameter α and shape parameter β. In this context, the gamma
density is written as
G(θ) = θa−1 e−bθ (3.19)
The posterior density is formed by multiplying equations (3.18) and (3.19), yielding
p(θ | k, a, b) ∝ θk+a−1 e−(b+1) θ (3.20)
Example 3.5: Poisson Distribution with Varying Gamma Density Priors
Figure 3.5 below shows the posterior distribution under the Poisson likelihood
with varying gamma-density priors.
FIGURE 3.5. Poisson distribution with varying gamma-density priors.
Here again we see the manner in which the data distribution moderates the influ-
ence on the prior distribution to obtain a posterior distribution that balances the
data in hand with the prior information we can bring regarding the parameters
of interest. The upper left of Figure 3.5 shows this perhaps most clearly with the
posterior distribution balanced between the prior distribution and the data distri-
bution. And again, in the case of a relatively non-informative Gamma distribution,
the posterior distribution matches up to the likelihood (lower right of Figure 3.5).
3.2.2 Jeffreys’ Prior for the Poisson Distribution

To obtain Jeffreys’ prior for the Poisson rate parameter note that
∂2 k
log p(k | θ) = − 2 (3.21)
∂θ2 θ
Thus the information matrix can be written as

I(θ) = −E log p(k | θ) (3.22)
θ
= (3.23)
θ2
1
= (3.24)
θ
Jeffreys’ prior is then

p j (θ) = θ−1/2 (3.25)
Note that Jeffreys’ prior for the Poisson case is improper in that it cannot be
integrated over [0, ∞). The solution to this problem is to note that the Jeffreys’
prior can be approximated by a G(.5,0). This in turn is not a defined distribution,
thus we can use a G(.5,.00001) to approximate the Jeffreys’ prior for this case.
3.3 The Binomial Distribution

We encountered the binomial distribution in Chapter 2. To reiterate, the binomial
distribution is used to estimate probabilities for successes and failures, where any
given event follows a Bernoulli distribution, such as right/wrong responses to a
test, or agree/disagree responses to a survey item. The probability density function
for the binomial distribution can be written as
!
n y
p(y | θ) = θ (1 − θ)n−y (3.26)
y
where here θ ∈ [0, 1] is the success proportion, y ∈ {1, 2, . . . , n} is the number of
successes, and n is the number of trials. Furthermore, E(y) = nθ and V(y) =
nθ(1 − θ).
3.3.1 The Beta Prior

Perhaps one of the most important distributions in statistics and one that is com-
monly encountered as a prior distribution in Bayesian statistics is the beta distribu-
tion, denoted as B. The probability density function of the beta distribution with
respect to the success proportion parameter θ can be written as
G(α + β) α−1
B(θ | α, β) = θ (1 − θ)β−1 (3.27)
G(α)Γ(β)
where α (> 0) and β (> 0) are shape and scale parameters and G is the gamma
distribution, discussed in the previous section. The mean and variance of the beta
distribution can be written as
α
E(θ | α, β) = (3.28)
α+β
αβ
V(θ | α, β) = (3.29)
(α + β) (α + β + 1)
2
Note also that the U(0, 1) distribution is equivalent to the B(1,1) distribution.
The beta distribution is typically used as the prior distribution when data are
assumed to be generated from the binomial distribution, such as the example in
Section 2.4, because the binomial parameter θ is continuous and ranges between
zero and one.
Example 3.6: The Binomial Likelihood with Varying Beta Priors
Figure 3.6 below shows the posterior distribution under the binomial likeli-
hood with varying beta priors.
FIGURE 3.6. Binomial distribution with varying beta priors. Note that the figure on the
lower right is Jeffreys’ prior for the binomial distribution.
We see that the role of the beta prior on the posterior distribution is quite similar
to the role of the Gaussian prior on the posterior distribution in Figure 3.1. Note
that the B(1, 1) prior distribution in the lower left-hand corner of Figure 3.6 is
equivalent to the U(0, 1) distribution. The lower right-hand display is Jeffreys’
prior, which is discussed next.
3.3.2 Jeffreys’ Prior for the Binomial Distribution

Recall that the uniform distribution is not invariant to transformations, and that
invariance can be achieved by deriving Jeffreys’ prior for the binomial likelihood.
From Equation (3.26) we have
log p(y | θ) ∝ y logθ + (n − y) log(1 − θ) (3.30)
Taking the partial derivative twice of Equation (3.30) we obtain
∂2 log p(y | θ) y n−y

=− 2 − (3.31)
∂θ 2 θ 1 − θ2
The Fisher information matrix is defined as

∂2 log p(y | θ)
!
I(θ) = −E (3.32a)
∂θ2
nθ n − nθ
= 2 + (3.32b)
θ (1 − θ)2
n
= , (3.32c)
θ(1 − θ)
∝ θ−1 (1 − θ)−1 (3.32d)
Jeffreys’ prior is then the square root of Equation (3.32d). That is,
p j = θ−1/2 (1 − θ)−1/2 (3.33)
which is equivalent to the Beta(1/2, 1/2) distribution.

The lower right-hand corner of Figure 3.6 shows Jeffreys’ prior for the binomial
distribution and its impact on the posterior distribution. The unusual shape of
Jeffreys’ prior has a very interesting interpretation. Note that, in general, the data
have the least impact on the posterior distribution when the true value of θ is
0.5 and greater impact on the posterior when θ is closer to 0 or 1. Jeffreys’ prior
compensates for this fact by placing more mass on the extremes of the distribution,
thus balancing out impact of the data on the posterior distribution.
3.4 The Multinomial Distribution

Another very important distribution used in the social sciences is the multino-
mial distribution. As an example, the multinomial distribution can be used to
characterize responses to items on questionnaires where there are more than two
alternatives. The multinomial distribution can also be used in models for categor-
ical latent variables, such as latent class analysis (Clogg, 1995), where individuals
are assigned to one and only one of, say, C latent classes. An example of the
multinomial distribution for latent classes will be discussed in Chapter 8 where
we examine the classification of reading skills in young children. A special case of
the multinomial distribution is the binomial distribution discussed earlier.
The probability density function for the multinomial distribution is written as
p(X1 = x1 , . . . , XC = xc ) (3.34)
 PC
x xC
 x1 !···xC ! π11 · · · πC , when c=1 xc = n
 n!
=

0,
 otherwise
where n is the sample size and π1 , . . . , πC are parameters representing category

proportions. The mean and variance of the multinomial distribution are written
as
E(xc ) = nπc (3.35)

V(xc ) = nπc (1 − πc ) (3.36)
and the covariances among any two categories c and d can be written as
cov(xc , xd ) = −nπc πd (3.37)
3.4.1 The Dirichlet Prior

The conjugate prior distribution for the parameters of the multinomial distribution,
πc , follows a Dirichlet distribution. The Dirichlet distribution is the multivariate
generalization of the beta distribution. The probability density function can be
written as
C
1 Y ac −1
f (π1 , . . . , πC−1 ; a1 , . . . , aC ) = πc (3.38)
B(a)
c=1
where QC
c=1 Γ(ac )
B(a) = PC (3.39)
Γ( c=1 ac )
is the multinomial beta function expressed in terms of a gamma function.
3.4.2 Jeffreys’ Prior for the Multinomial Distribution

As the multinomial distribution is an extension of the binomial distribution,
we find that Jeffreys’ prior for the multinomial distribution is Dirichlet with
αc = 1/2, (c = 1, . . . , C).
Example 3.7: Multinomial Likelihood with Varying Precision on the Dirichlet Prior
Figure 3.7 below shows the the multinomial likelihood and posterior distribu-
tions with varying degrees of precision on the Dirichlet prior.
FIGURE 3.7. Multinomial distribution with varying Dirichlet priors.
As in the other cases, we find that a highly informative Dirichlet prior (top row)
yields a posterior distribution that is relatively precise with a shape that is similar
to that of the prior. For a relatively diffused Dirichlet prior (bottom row), the
posterior more closely resembles the likelihood.
3.5 The Inverse-Wishart Distribution

Many applied problems in the social sciences focus on multivariate outcomes,
and a large array of multivariate statistical models are available to address these
problems. For this book we will focus on factor analysis, which can be considered
a multivariate model for covariance matrices and which will be demonstrated in
Chapter 8.
Denote by y a P-dimensional vector of observed responses for n individuals.
For example, these outcomes could be answers to items asking n students to
answer a P item survey (p = 1, . . . , P) on their school environment For simplicity,
assume that y is generated from a multivariate Gaussian distribution with a P-
dimensional mean vector μ and a P × P covariance matrix Σ. The conjugate prior
for Σ is the inverse-Wishart distribution (IW) denoted as
Σ ∼ IW(Ψ, ν) (3.40)
where Ψ is a scale matrix that is used to situate the IW distribution in the parameter
space and ν > P − 1 are the degrees-of-freedom (df) used to control the degree of
uncertainty about Ψ. The expected value and variance of the IW distribution are
Ψ
E= (3.41)
ν−P−1
and
2ψ2PP
V= (3.42)
(ν − P − 1)2 (ν − P − 3)
From Equation (3.41) and Equation (3.42) we see that the informativeness of
the IW prior is dependent on the scale matrix Ψ and the degrees-of-freedom ν -
that is, the IW prior becomes more informative as either the elements in Ψ become
smaller or ν becomes larger. Finding the balance between these elements is tricky,
and so common practice is to set Ψ = I and vary the values of ν.
3.6 The LKJ Prior for Correlation Matrices

Finally, in many cases interest centers on inferences for models involving corre-
lation matrices. An example would be the estimation of correlations of random
effects in multilevel models. Indeed, factor analysis can also be considered a
multivariate model for correlation matrices. Because correlations are model pa-
rameters, prior distributions are required. One approach could be to use the IW
distribution for the covariance matrix and then standardize within the algorithm
to provide correlations. However, practical experience has shown that the IW
distribution is often difficult to work with, particularly in the context of modern
Bayesian computation tools such as Stan, which will be discussed in Chapter 4.
Addressing this problem requires recognizing that the covariance matrix Σ can be
decomposed into a so-called quadratic form as
Σ = σRσ (3.43)
where σ is a diagonal matrix of standard deviations. Based on work concerning

the generation of random correlation matrices, Lewandowski et al. (2009)
developed the so-called LKJ distribution that can easily be applied to correlation
matrices matrices. A parameter η controls the degree of informativeness across
the correlations. Specifically, η = 1 specifies a uniform distribution over valid
correlations, that is, the correlations are all equally likely. Values of η greater than
1 place less prior mass on extreme correlations.
Example 3.8: The LKJ Prior
Figure 3.8 below shows the probability density plots for the LKJ distribution
with different values of η. Notice that higher values of η place less mass on extreme
correlations.
FIGURE 3.8. LKJ prior distribution with different degrees of informativeness.
3.7 Summary
This chapter presented the most common distributions encountered in the social
sciences along with their conjugate priors and associated Jeffreys’ priors. We
also discussed the LKJ prior for correlation matrices which is quite useful when
specifying inverse-Wishart priors for covariance matrices. The manner in which
the prior and the data distributions balance each other to result in the posterior
distribution is the key point of this chapter. When priors are very precise, the
posterior distribution will have a shape closer to that of the prior. When the prior
distribution is non-informative, the posterior distribution will adopt the shape
of the data distribution. This finding can be deduced from an inspection of the
shrinkage factor given in Equation (2.40). In the next chapter we focus our attention
on the computational machinery for summarizing the posterior distribution.
4
Obtaining and Summarizing the
Posterior Distribution
As stated in the Introduction, the key reason for the increased popularity of
Bayesian methods in the social sciences has been the (re)discovery of numerical
algorithms for estimating posterior distributions of the model parameters given
the data. Prior to these developments, it was virtually impossible to derive sum-
mary measures of the posterior distribution, particularly for complex models with
many parameters. The numerical algorithms that we will describe in this chapter
involve Monte Carlo integration using Markov chains – also referred to as Markov
chain Monte Carlo (MCMC) sampling. These algorithms have a rather long his-
tory, arising out of statistical physics and image analysis (Geman & Geman, 1984;
Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953). For a nice introduction
to the history of MCMC, see Robert and Casella (2011).
For the purposes of this chapter, we will consider three of the most common
algorithms that are available in both open source as well as commercially available
software – the random walk Metropolis-Hastings algorithm, the Gibbs sampler, and
Hamiltonian Monte Carlo. First, however, we will introduce some of the general
features of MCMC. Then, we will turn to a discussion of the individual algorithms,
and finally we will discuss the criteria used to evaluate the quality of an MCMC
sample. A full example using the Hamiltonian Monte Carlo algorithm will be
provided, which also introduces the Stan software package.
4.1 Basic Ideas of Markov Chain Monte Carlo

Sampling
Within the frequentist school of statistics, a number of popular estimation ap-
proaches are available to obtain point estimates and standard errors of model
parameters. Perhaps the most common approaches to parameter estimation are
ordinary least squares and maximum likelihood. The focus of frequentist parameter
estimation is the derivation of point estimates of model parameters that have de-
sirable asymptotic properties such as unbiasedness and efficiency (see, e.g., Silvey,
1975).
47
.
In contrast to maximum likelihood estimation and other estimation methods

within the frequentist paradigm, Bayesian inference focuses on calculating expec-
tations of the posterior distribution, such as the mean and standard deviation. For
very simple problems, this can be handled analytically. However for complex,
high-dimensional problems involving multiple integrals, the task of analytically
obtaining expectations can be virtually impossible. So, rather than attempting
to analytically solve these high-dimensional problems, we can instead use well-
established computational methods to draw samples from a target distribution of
interest (in our case, the posterior distribution) and summarize the distribution
formed by those samples. This is referred to as Monte Carlo integration.
Monte Carlo integration is based on first drawing S samples of the parameters
of interest {θs , s = 1, . . . , S} from the posterior distribution p(θ | y) and approximat-
ing the expectation by
S
1X
E[p(θ | y)] ≈ p(θs | y) (4.1)
S
t=1
Assuming the samples are independent of one another, the law of large numbers
ensures that the approximation in Equation (4.1) will be increasingly accurate as
S increases. Indeed, under independent samples, this process describes ordinary
Monte Carlo sampling. However, an important feature of Monte Carlo integration,
and of particular relevance to Bayesian inference, is that the samples do not have
to be drawn independently. All that is required is that the sequence θs , (s =
1, . . . , S) yields samples that have explored the support of the distribution (Gilks,
Richardson, & Spiegelhalter, 1996a).1
One approach to sampling throughout the support of a distribution while also
relaxing the assumption of independent sampling is through the use of a Markov
chain. Formally, a Markov chain is a sequence of dependent random variables {θs }
θ0 , θ1 , . . . , θs , . . . (4.2)
such that the conditional probability of θs given all of the past variables depends
only on θs−1 – that is, only on the immediate past variable. This conditional
probability for the continuous case is referred to as the transition kernel of the
Markov chain. For discrete random variables this is referred to as the transition
matrix.
The Markov chain has a number of very important properties, not the least
of which is that over a long sequence, the chain will forget its initial state θ0 and
converge to its stationary distribution p(θ | y), which does not depend either on
the number of samples S or on the initial state θ0 . The number of iterations prior
to the stability of the distribution is referred to as the warmup samples. Letting m
1
The support of a distribution is the smallest closed interval (or set in the multivariate case)
where the elements of the interval/set are members of the distribution. Outside the support
of the distribution, the probability of the element is zero. Technically, MCMC algorithms
explore the typical set of a probability distribution. The concept of typical set will be taken
up when discussing Hamiltonian Monte Carlo.
Obtaining and Summarizing the Posterior Distribution 49
represent the initial number of warmup samples, we can obtain an ergodic average
of the posterior distribution p(θ | y) as
S
1 X
p(θ | y) = p(θs | y) (4.3)
S−m
s=m+1
The idea of conducting Monte Carlo sampling through the construction of Markov
chains defines MCMC. The question that we need to address is how to construct
the Markov Chain, that is, how to move from one parameter value to the next.
Three popular algorithms have been developed for this purpose and we take them
up next.
4.2 The Random Walk Metropolis-Hastings

Algorithm
One of the earliest, yet still common, methods for constructing a Markov chain
is referred to as the random walk Metropolis-Hastings algorithm or M-H algorithm
for short (Metropolis et al., 1953). The basic steps of the M-H algorithm are as
follows. First, a starting value θ0 is obtained to begin forming a sequence of
draws θ0 , . . . θs−1 . The next element in the chain, θs , is obtained by first obtaining a
proposal value θ∗ from a so-called proposal distribution (also referred to as a jumping
distribution, which we will denote as q(θ∗ | θs−1 ).2 This proposal distribution
could be, for example, a standard Gaussian distribution with mean zero and some
variance. Second, the algorithm accepts the proposal value with an acceptance
probability
p(θ∗ )q(θs−1 | θ∗ )
( )
p(θ | θ ) = min 1,
s−1 ∗
(4.4)
p(θs−1 )q(θ∗ | θs−1
where the ratio inside the brackets of Equation (4.4) is referred to as the Metropolis-
Hastings ratio. Equation (4.4) can be simplified by noting that the M-H algorithm
uses symmetric proposal distributions implying that q(θ∗ | θs−1 ) = q(θs−1 | θ∗ ) and
therefore the ratio of these distributions are 1.0. Thus, Equation (4.4) can be written
as
p(θ∗ )
( )
p(θ | θ ) = min 1,
∗ s−1
(4.5)
p(θs−1 )
Notice that the numerator of M-H ratio in Equation (4.5) is the probability of the
proposal value and the denominator is the probability of the current value. In
essence, Equation (4.5) states that if the odds ratio p(θ∗ )/p(θs−1 ) > 1.0, then the
probability of acceptance of the proposal value is 1.0 – that is the algorithm will
accept the proposal value with certainty. However, if the odds ratio is less than
1.0, then the algorithm can either move to the next value or stay at the current
value. To decide this, the third step of the algorithm draws a random value from
a U(0, 1) distribution. If the sample values is between 0 and p(θ∗ | θs−1 ), then
2
For ease of notation, we are suppressing conditioning on the data y.
the algorithm moves to the next state, i.e., θs = θ∗ . Otherwise, if it is rejected,

then the algorithm does not move, and θs = θs−1 . A remarkable feature of the
M-H algorithm is that regardless of the proposal distribution, the stationary
distribution of the algorithm will be p(·), which for our purposes is the posterior
distribution. The technical details of this fact can be found in Gilks et al. (1996a).
4.3 The Gibbs Sampler

Another popular algorithm for MCMC is the Gibbs sampler. Consider that the
goal is to obtain the joint posterior distribution of two model parameters – say, θ1
and θ2 , given some data y, written as p(θ1 , θ2 | y). These two model parameters
can be, for example, two regression coefficients from a multiple regression model.
Dropping the conditioning on y for notational simplicity, what is required is to
sample from p(θ1 | θ2 ) and p(θ2 | θ1 ). In the first step, an arbitrary value for θ2
is chosen, say θ02 . We next obtain a sample from p(θ1 | θ02 ). Denote this value as
θ11 . With this new value, we then obtain a sample θ12 from p(θ2 | θ11 ). The Gibbs
algorithm continues to draw samples using previously obtained values until two
long chains of values for both θ1 and θ2 are formed. After discarding the warmup
samples, the remaining samples are then considered to be drawn from the marginal
distributions p(θ1 ) and p(θ2 ).
The formal algorithm can be outlined as follows. For notational clarity, let θ be
a P-dimensional vector of model parameters with elements θ = {θ1 , . . . , θP }. Note
that information regarding θ is contained in the prior distribution p(θ). The Gibbs
sampler begins with an initial set of starting values for the parameters, denoted as
θ0 = (θ01 , . . . , θ0P ). Given this starting point, the Gibbs sampler generates θs from
θs−1 as follows:
1. sample θs1 ∼ p(θ1 | θs−1

2 , θ3 , . . . , θP )
s−1 s−1
2. sample θs2 ∼ p(θ2 | θs1 , θs−1

3 , . . . , θP )
s−1
..
.
P. sample θsP ∼ p(θp | θs1 , θs2 , . . . , θsP−1 )
So, for example, in Step 1, a value for the first parameter θ1 at iteration s = 1 is
drawn from the conditional distribution of θ1 given other parameters with start
values at iteration 0 and the data y. At Step 2, the algorithm draws a value for
the second parameter θ2 at iteration s = 1 from the conditional distribution of θ2
given the value of θ1 drawn in Step 1, the remaining parameters with start values
at iteration zero, and the data. This process continues until ultimately a sequence
of dependent vectors are formed:
θ1 = {θ11 , . . . θ1P }
θ2 = {θ21 , . . . θ2P }
..
.
θS = {θS1 , . . . θSP }
This sequence exhibits the so-called Markov property insofar as θs is condition-

ally independent of {θ01 , . . . θs−2
P
} given θs−1 . Under some general conditions, the
sampling distribution resulting from this sequence will converge to the target
distribution as s → ∞.
4.4 Hamiltonian Monte Carlo

The development of the Metropolis-Hastings and Gibbs sampling algorithms and
their implementation in Bayesian software programs have made it possible to bring
Bayesian statistics into mainstream practice. However, these two algorithms suffer
from a severe practical limitation — namely, as the number of parameters increases,
the number of directions that the algorithm can search increases exponentially
while the M-H acceptance probability in Equation (4.5) decreases. Thus, these
two algorithms can take an unacceptably long time to converge to the posterior
distribution, resulting in a highly inefficient use of computer resources (Hoffman
& Gelman, 2014). An approach for addressing this problem has emerged from the
development of Hamiltonian Monte Carlo (HMC). The mathematics behind HMC
arises from the field of Hamiltonian dynamics which was designed to address
problems in quantum chromodynamics in the context of the orbital dynamics of
fundamental particles. Hamiltonian Monte Carlo underlies the Stan programming
environment, which we will be using for our examples throughout the book. In
what follows, we draw on excellent intuitive introductions to HMC by Betancourt
(2018b, 2019).
The problem associated with the inefficient use of computer resources when
implementing M-H or Gibbs algorithms is a result of the geometry of probability
distributions when the number of parameters increases. In particular, although
the density of a distribution is largest in the neighborhood near the mode, the
volume of that neighborhood decreases and thus has an inconsequential impact
on the calculation of expectations. At the same time, as the number of parameters
increases, the region far away from the mode has greater volume but much smaller
density and thus also contributes negligibly to the calculation of expectations. The
neighborhood between these extremes is called the typical set, which is a subspace
of the support of the distribution. This “Goldilocks zone” represents a region
where the volume and density are just right, and where the mass is sufficient to
produce reasonable expectations. Again, outside of the typical set, the contribution
to the calculation of expectations is inconsequential and thus a waste of computing
resources (Betancourt, 2018b)
The difficulty with the M-H and Gibbs algorithms is that although they will
eventually explore the typical set of a distribution, it might be so slow that com-
puter resources will be expended. This problem is due to the random walk nature
of these algorithms. For example, in the ideal situation for a small number of pa-
rameters, the proposal distribution of M-H algorithm (usually a Gaussian proposal
distribution) will be biased toward the tails of the distribution where the volume
is high while the algorithm will reject proposal values if the density is small. This
then pushes the M-H algorithm toward to typical set as desired. However, as the
number of parameters increase, the volume outside the typical set will dominate
the volume inside the typical set and thus the Markov chain will mostly end up
outside the typical set yielding proposals with low probabilities and hence more
rejections by the algorithm. This results in the Markov chain getting stuck outside
the typical set and thus moving very slowly, as is often observed when employing
M-H in practice. The same problem just described holds for the Gibbs sampler as
well.
The solution to the problem of the Markov chain getting stuck outside the
typical set is to come up with an approach that is capable of making large jumps
across regions of the typical set, such that the typical set is fully explored with-
out the algorithm jumping outside the typical set. This is the goal of HMC.
Specifically, HMC exploits the geometry of the typical set and constructs transi-
tions that “...glide across the typical set towards new, unexplored neighborhoods”
(Betancourt, 2018b, p. 18). To accomplish this controlled sojourn across the typical
set, HMC exploits the correspondence between probabilistic systems and physical
systems. As discussed in Betancourt (2018b), the physical analogy is one of placing
a satellite in a stable orbit around Earth. A balance must be struck between the
momentum of the satellite and the gravity of Earth. Too much momentum and
the satellite will fly off into space. Too little, and the satellite will crash into Earth.
Thus, the key to gliding across the typical set is to carefully choose an auxiliary
momentum parameter to the probabilistic system. This momentum parameter is
essentially a first-order gradient calculated from the log-posterior distribution.
4.4.1 No-U-Turn Sampler (NUTS)

Hamiltonian Monte Carlo yields a much more efficient exploration of the posterior
distribution compared to random-walk M-H and Gibbs. However, HMC requires
user-specified parameters that can still result in a degree of computational inef-
ficiency. These parameters are referred to as the step size ϵ and the number of
so-called leapfrog steps L. If ϵ is too large, then the acceptance rates will be too
low. On the other hand, if ϵ is too small, then computation time is being wasted
because the algorithm is taking unnecessarily small steps. With regard to the
leapfrog steps, if L is too small, then the draws will be too close to each other,
resulting in random walk behavior and slow mixing of the chains. If L is too large,
then computational resources will be wasted because the algorithm will loop back
and repeat its steps (Hoffman & Gelman, 2014). Although ϵ can be adjusted “on
the fly” through the use of adaptive MCMC, deciding on the appropriate value of
L is more difficult, and a poor choice of either parameter can lead to serious com-
putational inefficiency. To solve these problems, the No-U-Turn Sampler (NUTS)
algorithm was developed by Hoffman and Gelman (2014), which is designed to

mimic the dynamics of HMC, while not requiring the user to specify ϵ or L. The
NUTS algorithm is implemented in Stan (Stan Development Team, 2021a).
4.5 Convergence Diagnostics

Given the computational intensity of MCMC, it is absolutely essential for Bayesian
inference that the convergence of the MCMC algorithm be assessed. The impor-
tance of assessing convergence stems from the very nature of MCMC in that it is
designed to converge in distribution rather than to a point estimate. Because there
is not a single adequate assessment of convergence, it is important to inspect a
variety of diagnostics that examine varying aspects of convergence. We will con-
sider five methods of assessing convergence that are used across M-H, Gibbs, and
HMC, and then mention a few others that are specific to HMC. These diagnostics
will be presented for many of the examples used throughout this book.
4.5.1 Trace Plots

Perhaps the most common diagnostic for assessing MCMC convergence is to
examine the so-called trace or history plots produced by plotting the posterior
parameter estimate obtained for each iteration of the chain. Typically, a parameter
will appear to converge if the sample estimates form a tight horizontal band across
the history of the iterations forming the chain. However, using this method as an
assessment for convergence is rather crude since merely viewing a tight trace
plot does not indicate that convergence was actually obtained. As a result, this
method is more likely to be an indicator of non-convergence (Mengersen, Robery,
& Guihenneuc-Jouyax, 1999). For example, if two chains for the same parameter
are sampled from different areas of the target distribution and the estimates over
the history of the chain stay separated, that would be evidence of non-convergence.
Likewise, if a plot shows substantial fluctuation or jumps in the chain, it is likely
that the chain linked to that parameter has not reached convergence.
4.5.2 Posterior Density Plots

Another useful tool for diagnosing any issues with the convergence of the Markov
chain is the posterior density plot. This is the plot of the draws for each parameter
in the model and for each chain. This plot is very important to inspect insofar
as the summary statistics for the parameters of the model are calculated off of
these posterior draws. If we consider, for example, a regression coefficient with
a Gaussian prior, then we would expect the posterior density of that plot to be
normally distributed. Any strong deviations from normality, and in particular
any serious bi-modality in the plot, would suggest issues with convergence of
that parameter, which could possibly be resolved through more iterations, better
choice of priors, or both.
4.5.3 Autocorrelation Plots

In addition to the trace plots, it is important to examine the speed in which the
draws from the posterior distribution achieve independence. As noted earlier,
draws from the posterior distribution using a Markov chain are not, in the be-
ginning, independent of one another. However, we expect that the chain will
eventually “forget” its initial state and converge to a set of independent and sta-
tionary draws from the posterior distribution. One way to determine how quickly
the chain has forgotten its initial state is by inspection of the autocorrelation func-
tion (ACF) plot, defined as follows. Let θs (s = 1, . . . , S) be the sth draw of a
stationary Markov chain. Then the lag-l autocorrelation can be written as
ρl = cor(θs , θs+l ) (4.6)
In general, we expect that the lag-1 autocorrelation will be close to 1.0. However,
we also expect that the components of the Markov chain will become independent
as l increases. Thus, we prefer that the autocorrelation decrease quickly over the
number of iterations. If this is not the case, it is evidence that the chain is “stuck”
and thus not providing a full exploration over the support of the target distribution.
In general, positive autocorrelation will be observed, but in some cases negative
autocorrelation is possible, indicating fast convergence of the estimated value to
the equilibrium value.
4.5.4 Effective Sample Size

Related to the autocorrelation diagnostic is the effective sample size denoted as n eff
in the Stan output, which is an estimate of the number of independent draws
from the posterior distribution. In other words, it is the number of independent
samples with the same estimation power as the T autocorrelated samples. Staying
consistent with Stan notation, the n eff is calculated as
S
n eff = P∞ (4.7)
1+2 s=1 ρs
where S is the total number of samples. Because the samples from the posterior
distribution are not independent, we expect from Equation (4.7) that the n eff will
be smaller than the total number of draws. If the ratio of the effective sample size
to the total number of draws is close to 1.0, this is evidence that the algorithm
has achieved mostly independent draws. Much lower values could be a cause for
concern as it signals that the draws are not independent, but it is important to note
that this ratio is highly dependent on the choice of MCMC algorithm, number of
warmup iterations, and number of post-warmup iterations.
One approach to addressing the problem of autocorrelation and the associated
problem of lower effective sample sizes involves the use of thinning. Suppose
we request that the algorithm take 3,000 draws from the posterior distribution.
This would be the same as having the algorithm take 30,000 draws but to thin the
sample by keeping only every 10th draw. Notice that while this is simply a way
to reduce memory burden, the advantage is that the autocorrelation is typically
reduced, which also results in a higher effective sample size.
4.5.5 Potential Scale Reduction Factor

When implementing an MCMC algorithm, one of the most important diagnostics
is the potential scale reduction factor (see, e.g., Gelman & Rubin, 1992a; Gelman,
1996; Gelman & Rubin, 1992b), often denoted as rhat or R̂. This diagnostic is based
on analysis of variance and is intended to assess convergence among several
parallel chains with varying starting values. Specifically, Gelman and Rubin
(1992a) proposed a method where an overestimate and an underestimate of the
variance of the target distribution is formed. The overestimate of the variance
of the target distribution is measured by the between-chain variance and the
underestimate is measured by the within-chain variance (Gelman, 1996). The idea
is that if the ratio of these two sources of variance is equal to 1, then this is evidence
that the chains have converged. If the R̂ > 1.01, this may be a cause for concern.
Brooks and Gelman (1998) added an adjustment for sampling variability in the
variance estimates and also proposed a multivariate extension of the potential
scale reduction factor which does not include the sampling variability correction.
The R̂ diagnostic is calculated for all chains over all iterations. A problem
with R̂ originally noted by Gelman, Carlin, et al. (2014) and further discussed in
Vehtari, Gelman, Simpson, Carpenter, and Bürkner (2021) is that it sometimes does
not detect non-stationarity, in the sense of the average or variability in the chains
changing over the iteration history. A relatively new version of the potential scale
reduction factor is available in Stan. This version is referred to as the Split R̂, and
is designed to address the problem that the conventional R̂ cannot reliably detect
non-stationarity. The Split R̂ which quantifies the variation of a set of Markov
chains initialized from locations points in parameter space. This is accomplished
by splitting the chain in two and then calculating the Split R̂ on twice as many
chains. So, if one is using four chains with 5,000 iterations per chain, the Split R̂ is
based on eight chains with 2,500 iterations per chain.
4.5.6 Possible Error Messages When Using HMC/NUTS

Because we will be relying primarily on the Stan software program for our exam-
ples, it may be useful to list some of the error messages that one may encounter
when using the HMC/NUTS algorithm.
Divergent Transitions
Much of what has been described so far represents the ideal case of the typical
set being a nice smooth surface that HMC can easily explore. However, in more
complex Bayesian models, particularly Bayesian hierarchical models often applied
in the social and behavioral sciences, the surface of the typical set is not always
smooth. In particular, there are can be regions of the typical set that have very high
curvature. Algorithms such as Metropolis-Hastings might jump over that region,
but the problem is that there is information in that region, and if it is ignored, then
the resulting parameter estimates may be biased. However, to compensate for not
exactly exploring this region, MCMC algorithms will instead get very close to the
boundary of this region and hover there for a long time. This can be seen through
a careful inspection of trace plots where instead of a nice horizontal band, one
sees a sudden jump in the plot followed by a long sequence of iterations at that
jump point. After a while, the algorithm will jump back, and, in fact, over correct.
In principle, if the algorithm were allowed to run forever, these discontinuities
would cancel each other out, and the algorithm would converge to the posterior
distribution under the central limit theorem. However, in finite time, the resulting
estimates will likely be biased. Excellent graphical descriptions of this issue can
be found in Betancourt (2018b)
The difficulty with the problem just describe is that M-H and Gibbs algorithms
do not provide feedback as to the conditions under which they got stuck in the
region of high curvature. With HMC as implemented in Stan, if the algorithm
diverges sharply from the trajectory through the typical set, it will throw an error
message that some number of transitions diverged. In Stan the error might read
1: There were 15 divergent transitions after warmup.
Regardless of the number of divergent transitions, this warning needs to be taken

seriously as it suggests that this region was not explored by the algorithm at all.
Generally, the problem can be diagnosed by checking potential problems with
the data, on whether the model parameters are identified (Betancourt, 2018c),
by checking on the appropriateness of the priors, and by checking and possibly
adjusting the step size. For more information, see Chapter 15 in the Stan Reference
Manual (Stan Development Team, 2021b)
4.6 Summarizing the Posterior Distribution

Once a satisfactory convergence to the posterior distribution is obtained, the next
step is to calculate point estimates and obtain relevant intervals. The expressions
for point estimates and intervals of the posterior distribution come from expres-
sions of conditional distributions generally.
4.6.1 Point Estimates of the Posterior Distribution

Specifically, for the continuous case, the mean of the posterior distribution of θ
given the data y is referred to as the expected a posteriori or EAP estimate and can
be written as
Z+∞
E(θ | y) = θp(θ | y)dθ (4.8)
−∞
Similarly, the variance of posterior distribution of θ given y can be obtained as

V(θ | y) = E[(θ − E[(θ | y])2 | y)

Z+∞
= (θ − E[θ | y])2 p(θ | y)dθ
−∞
Z+∞
= (θ2 − 2θE[θ | y]) + E[θ | y]2 )p(θ | y)dθ
−∞
= E[θ2 | y] − E[θ | y]2 (4.9)
The mean and variance provide two simple summary values of the posterior
distribution. Another common summary measure would be the mode of the pos-
terior distribution – referred to as the maximum a posteriori (MAP) estimate. The
MAP begins with the idea of maximum likelihood estimation. Maximum likeli-
hood estimation obtains the value of θ, say θ̂ML which maximizes the likelihood
function L(θ | y), written succinctly as
θ̂ML = argmax L(θ | y) (4.10)

θ
where argmax stands for the value of the argument for which the function attains
its maximum. In Bayesian inference, however, we treat θ as random and specify
a prior distribution on θ to reflect our uncertainty about θ. By adding the prior
distribution to the problem, we obtain
θ̂MAP = argmax L(θ | y)p(θ) (4.11)

θ
Recalling that p(θ | y) = L(θ | y)p(θ) is the posterior distribution, we see that
Equation (4.11) provides the maximum value of the posterior density of θ given
y, corresponding to the mode of the posterior density.
4.6.2 Interval Summaries of the Posterior Distribution

Along with point summary measures and posterior probabilities, it is usually de-
sirable to provide interval summaries of the posterior distribution. There are two
general approaches to obtaining interval summaries of the posterior distribution.
The first is the so-called posterior probability interval, also referred to as the credible
interval), and the second is the highest posterior density (HPD).
Posterior Probability Intervals

One important consequence of viewing parameters probabilistically concerns the
interpretation of confidence intervals. Recall that the frequentist confidence interval
requires that we imagine a fixed parameter, say the population mean µ. Then, we
imagine an infinite number of repeated samples from the population characterized
by µ. For any given sample, we can obtain the sample mean x̄ and then form a
100(1 − α)% confidence interval. The correct frequentist interpretation is that

100(1−α)% of the confidence intervals formed this way capture the true parameter
µ under the null hypothesis. Notice that from this perspective, the probability that
the parameter is in the interval is either 0 or 1.
In contrast, the Bayesian framework assumes that a parameter has a probabil-
ity distribution. Sampling from the posterior distribution of the model parameters,
we can obtain its quantiles. From the quantiles, we can directly obtain the prob-
ability that a parameter lies within a particular interval. So here, a 95% posterior
probability interval would mean that the probability that the parameter lies in the
interval is 0.95. Notice that this is entirely different from the frequentist interpre-
tation, and arguably aligns with common sense.3
In formal terms, a 100(1 − α)% posterior probability interval for a particular
subset of the parameter space Θ is defined as
Z
1−α= p(θ)dθ (4.12)
C
where C is some critical value.

Symmetric intervals are, of course, not the only interval summaries that can
be obtained from the posterior distribution. Indeed, a major benefit of Bayesian
inference is that any interval of substantive importance can be obtained directly
from the posterior distribution through simple functions available in R. The flexi-
bility available in being able to summarize any aspect of the posterior distribution
admits a much greater degree of nuance in the kinds of research questions one
may ask. We demonstrate how to obtain other posterior intervals of interest in the
example below.
Highest Posterior Density Intervals

The simplicity of the posterior probability interval notwithstanding, it is not the
only way to provide an interval estimate of a parameter. Following arguments set
down by G. Box and Tiao (1973), when considering the posterior distribution of
a parameter θ, there is a substantial part of the region of that distribution where
the density is quite small. It may be reasonable, therefore, to construct an interval
in which every point inside the interval has a higher probability than any point
outside the interval. Such a construction is the highest probability density or HPD
interval. More formally,
Definition 4.6.1. Let p(θ | y) be the posterior probability density function. A

region R of the parameter space θ is called the highest probability density region
of the interval 1 − α if
1. p(θ ∈ R | y) = 1 − α
2. For θ1 ∈ R and θ2 < R, p(θ1 | y) ≥ p(θ2 | y)
3
Interestingly, the Bayesian interpretation is often the one incorrectly ascribed to the fre-
quentist interpretation of the confidence interval.
In words, the first part says that given the data y, the probability that θ is in a
particular region is equal to 1−α, where α is determined ahead of time. The second
part says that for two different values of θ, denoted as θ1 and θ2 , if θ1 is in the
region defined by 1 − α, but θ2 is not, then θ1 has a higher probability than θ2 given
the data. Note that for unimodal and symmetric distributions, such as the uniform
distribution or the Gaussian distribution, the HPD is formed by choosing tails of
equal density. The advantage of the HPD arises when densities are not symmetric
and/or are multi-modal. A multi-modal distribution, for example, could arise as a
consequence of a mixture of two distributions. Following G. Box and Tiao (1973), if
p(θ | y) is not uniform over every region in θ, then the HPD region 1 − α is unique.
Also, if p(θ1 | y) = p(θ2 | y), then these points are included (or excluded) by a 1 − α
HPD region. The opposite is true as well, namely, if p(θ1 | y) , p(θ2 | y) then a
1−α HPD region includes one point but not the other (G. Box & Tiao, 1973, pg 123).
Figure 4.1 below shows the HPDs for a symmetric distribution centered at
zero on the left, and an asymmetric distribution on the right. We see that for the
FIGURE 4.1. Highest posterior density plot for symmetric and nonsymmetric distributions.
symmetric distribution, the 95% HPD aligns with the 95% confidence interval as
well as the posterior probability interval, as expected. Perhaps more importantly,
we see the role of the HPD in the case of the asymmetric distribution on the right.
Such distributions could arise from the mixture of two Gaussian distributions.
Here, the value of the posterior probability interval would be misleading. The
HPD, by contrast, indicates that, due to the asymmetric nature of this particular
distribution, there is very little difference in the probability that the parameter of
interest lies within the 95% or 99% intervals of the highest posterior density.
4.7 Introduction to Stan and Example

Throughout this book we will make use of the Stan probabilistic programming
language (Stan Development Team, 2021a). Stan was named after Stanislaw
Ulam, a Polish-born American mathematician who, along with Nicholas Metropo-
lis, were the early developers of Monte Carlo methods while working on the
atomic bomb in Los Alamos, New Mexico. Stan is an open-source, freedom-
respecting software program and is financially supported by the NUMFOCUS
project (https://numfocus.org/)
It is not the purpose of this book to provide a detailed tutorial on the Stan soft-
ware program. Instead, we will introduce elements of Stan by way of examples.
The Stan User’s Guide can be found at https://mc-stan.org/docs/229/stan-users-
guide/index.html, and a very friendly and helpful forum for questions related to
Stan and its related programs can be found at https://discourse.mc-stan.org/.
Example 4.1: Distribution of Reading Literacy
To introduce some of the basic elements of Stan and its R interface RStan
we begin by estimating the distribution of reading literacy from the United States
sample of the Program for International Student Assessment (PISA) (OECD, 2019).
For this example, we examine the distribution of the first plausible value of the
reading for the U.S. sample. For detailed discussions of the design of the PISA
assessment, see Kaplan and Kuger (2016) and von Davier (2013).
In the following block of code we load RStan, bayesplot, read in the PISA
data, and do some subsetting to extract the first plausible value of the reading
competency assessment.4 We also create an object called data.list that allows us to
rename variables and provide information to be used later. Note that we refer to
the reading score as readscore in the code.
library(rstan)
library(bayesplot)
PISA18Data <-read.csv(file.choose(),header=TRUE)
PISA18.read<- subset(PISA18Data, select=c(PV1READ))
data.list <- with(PISA18.read, list(readscore=PV1READ,
n = nrow(PISA18.read)))
Next we write the Stan code beginning with the command ReadingLit=” which
provides the name for the string of Stan code (other names can be chosen). We
declare the sample size to be an integer denoted by n with a lower value of zero.
Note that Stan uses // for comments.
4
In fact, when loading RStan, the program bayesplot and other required programs will be
loaded in automatically.
ReadingLit ="
data {
int<lower=0> n; // Declare sample size
vector[n] readscore; // Declare reading outcome
}
Notice that we are declaring the lower bound of the mean of the reading distribu-
tion to be 100. This is because we know that the scores cannot fall below 100 by
virtue of how the reading assessment was scaled.
In the following parameters block we declare the names and dimensions of
the parameters of the reading distribution.
parameters {
real<lower=100> mu;
real<lower=0> sigma;
}
In the model block we write out the probability model (likelihood) for the reading
outcome followed by the specification of the prior distributions for the mean and
standard deviation.
model {
readscore ˜ normal(mu, sigma);
mu ˜ normal(500, 10);
sigma ˜ cauchy(0,6);
}
"
The Stan code ends here and we return to R.

For this example, we are assuming that readscore is generated from a Gaussian
distribution with mean µ (mu) and standard deviation σ (sigma). We are eliciting
a Gaussian prior distribution for mu and a C+ prior distribution for sigma. The C+
distribution for the standard deviation is obtained by specifying a lower bound
of zero for sigma in the parameters block. For this example, note that we are
placing a weakly informative prior on the mean of the reading literacy scale. That
is, we are aware from the 2009 cycle of PISA that for the United States, the reading
assessment yielded a mean of 500. For 2018 we set the prior mean to be 500 but
assign a standard deviation for this prior of 10 suggesting a degree of uncertainty
around the prior mean.
Regarding the standard deviation, the PISA 2009 U.S. results show that the
standard deviation of the reading scale was 97. To elicit a prior for the standard
deviation recall that one choice for a prior on the standard deviation is the C+
distribution. An ad hoc approach to obtaining the scale parameter of the C+
distribution is as follows: First, take one-half of the interquartile range of the
outcome variable, which can be calculated in R by typing IQR(variable name)
and multiply it by 0.5. For the PISA 2009 reading scale, the interquartile range
is 151, which serves as a rough estimate of the scale parameter for the Cauchy
distribution. Because we are dealing with the C+ distribution, it is reasonable
to take the square root of the Cauchy scale parameter and then one-half of that
result to yield the scale parameter for the C+ distribution. For our case, the scale
parameter for the C+ is set to 6. Note that this example uses a large sample size so
most choices for the Cauchy scale are fine.
We now define the information needed for the algorithm. We are requesting
four chains (nChains), 30,000 iterations (nIter) with 10 thinning steps (thinSteps).5
The number of warmup iterations (warmupSteps) will be half the number of iter-
ations. Thus, the results will be based on 6,000 draws from posterior distribution.
nChains = 4
nIter= 30000
thinSteps = 10
warmupSteps = floor(nIter/2)
readscore = data.list$readscore
myfitRead = stan(data=data.list,model_code=ReadingLit,chains=nChains,
iter=nIter,warmup=warmupSteps,thin=thinSteps)
Convergence Diagnostics
As noted earlier, it is a matter of necessity to inspect the diagnostics to ensure
convergence before examining the results. We begin with the trace plots using the
command mcmc trace.
color_scheme_set("gray")
stantraceMu <- mcmc_trace(myfitRead,inc_warmup=T, pars="mu")
stantraceMu + ylim(490,510)
stantraceSigma <- mcmc_trace(myfitRead,inc_warmup=T, pars="sigma")

stantraceSigma + ylim(100,115)
5
For this simple problem considerably fewer number of iterations would be needed, but it
was of interest to show diagnostics in the “best case scenario.”
FIGURE 4.2. Trace plots for the mean and standard deviation.
An inspection of the trace plots in Figure 4.2 above reveal a reasonably tight band
across the history of the chains. We do not see any serious separations among the
chains. We conclude from these plots that there is no evidence of non-convergence.
Next we request the posterior density plots for the mean and standard devia-
tion.
stan_dens(myfitRead, pars=c("mu","sigma"), fill="gray")
FIGURE 4.3. Inverse-gamma prior for variance of Gaussian distribution.
We find that the density plots for mu and sigma in Figure 4.3 above have a relatively
smooth bell shape. Next, we request the autocorrelation plots for mu and sigma.
stanac <- stan_ac(myfitRead, pars=c("mu","sigma"),fill="gray")

stanac + ylab("Autocorrelation")
FIGURE 4.4. Autocorrelation plots for mu and sigma.
Notice that the autocorrelations in Figure 4.4 above decrease very quickly to zero as
desired. As small amount of negative autocorrelation is observed, again indicating
a fast convergence to the posterior distribution.
In this next chunk of code we print the results of this simple example shown
below in Table 4.1.
print(myfitRead,pars=c("mu","sigma"))
TABLE 4.1. Posterior results for reading literacy score
Parameter Mean SEMean SD 95% PPI n eff rhat

mu 500.18 0.02 1.52 497.18 - 503.10 5858 1
sigma 108.46 0.01 1.10 106.36 - 110.63 5667 1
Note that the ratio of the effective sample size (n eff) to the total number of post
warmup iterations is very close to 1.0 for both the mean and standard deviation.
This indicates that a very large percentage of the post warmup draws are inde-
pendent. Note also that with 6,000 draws over 4 chains, the Split R̂ (denoted as
Rhat) was calculated using 3,000 draws over 8 chains. The Rhat value is 1.0 for
both the mean and standard deviation which provides yet another indicator of
convergence of the chains to stationary posterior distributions.
The results in Table 4.1 indicate that under the assumptions made by the
chosen priors, the posterior mean 2018 reading score is estimated to be 500.18 and
the standard deviation is estimated to be 108.46. For each parameter, in addition
to the posterior mean, we also obtain the Monte Carlo standard deviation of
the posterior distribution for that parameter. So, the standard deviation of the
posterior distribution of mu is 1.52 and for sigma it is 1.10. The Monte Carlo
standard error for the estimator is obtained by the posterior standard deviation
√
divided by the square root of the effective sample size, SEMean = SD/ n eff. The
smaller the standard error, the closer the estimate is expected to be to the true
value.
Turning to the posterior probability intervals (PPIs) we find that there is a 0.95
probability that the true reading mean is between 497.18 and 503.10. It may be
interesting to note that the actual 2018 mean reading score for U.S. was 505. Thus
the analysis here yielded a very close estimate of the obtained PISA reading score
for U.S., which most likely is due to the very large sample size.
Highest Posterior Density Intervals

We can output the MCMC draws from Stan into an MCMC list and then read that
list into coda (Plummer, Best, Cowles, & Vines, 2006) to obtain the HPDs and other
information, if desired. The HPDs are shown in Table 4.2 below.
library(coda)
newMyFitRead <- As.mcmc.list(myfitRead)
HPDinterval(newMyFitRead, prob=0.95)
TABLE 4.2. Highest posterior density for each chain
Chain Parameter Lower HPD Upper HPD

1 mu 497.24 502.96
sigma 106.36 110.50
2 mu 497.21 503.20
sigma 106.45 110.69
3 mu 497.13 503.17
sigma 106.28 110.53
4 mu 497.29 502.93
sigma 106.37 110.65
The values of the HPDs for both parameters are quite similar across chains and
also similar to the 95% PPI in Table 4.1. This is expected given the mostly smooth
bell shape of the posterior distributions in Figure 4.3.
Other Interval Summaries

As noted earlier, symmetric intervals such as 95% PPI are not the only intervals
that can be obtained from the posterior distribution, nor necessarily the only
interesting interval. Using the mean and standard deviation in Table 4.1, any
interval of substantive interest can be easily obtained. For example, suppose we
wish to know the percentage of the U.S. sample with reading literacy scores above
the OCED international average of 487. A simple function in R to obtain this result
would be
> 1 - pnorm(487, mean = 500.18, sd = 108.46)

[1] 0.5483602
showing that about 55% of the U.S. sample have reading literacy scores above the
OECD international average. We may also wish to know the percentage of the
U.S. sample that lies between the top performing country in 2009 (Shangai-China
with a reading score of 556) and the U.S. average. This can be obtained as
> pnorm(555, mean = 500.18, sd = 108.46,lower.tail=TRUE)

- pnorm(500.18, mean = 500.18, sd = 108.46,,lower.tail=TRUE)
[1] 0.193375
showing that about 19% of the U.S. sample have reading scores between the U.S.
mean of 500.18 and Shangai-China mean of 555. It’s important to emphasize that
obtaining these types of interval summaries is possible due solely to the fact that
we are working directly with the posterior distribution of the model parameters.
4.8 An Alternative Algorithm: Variational Bayes

With the advent of MCMC methods described in earlier sections, and particularly
the HMC/NUTS algorithm implemented in Stan, Bayesian inference can now be
routinely applied to most problems that a substantive researcher might encounter.
However, it is still the case that for large problems, both with respect to sample
size and the number of parameters, MCMC methods can be quite expensive in
terms of time and storage. Over the last 20 years, advances have been made
primarily in the machine learning and computer science fields to move from
intensive MCMC algorithms, including HMC/NUTS, to methods that attempt
to derive an approximate distribution to take the place of the target posterior
distribution of interest and to then summarize this approximate distribution. This
method is known as variational Bayes (VB), a part of a larger area of work referred
to as variational inference (see e.g. Jordan, Ghahramani, Jaakkola, & Saul, 1999).
This section provides a brief intuitive introduction to classical VB as implemented
in Stan. Excellent tutorials on VB geared toward the statistics community can be
found in and Blei, Kucukelbir, and McAuliffe (2017) and Tran, Nguyen, and Dao
(2021). A nice introduction to VB with applications to item response theory in the
psychometric literature can be found in Ulitzsch and Nestler (2022).
The basic idea behind VB is to use optimization rather than approximation to
obtain the target posterior distribution. In the first step, a family of approximate
densities F over the (possibly vector-valued) parameters θ are proposed. These
distributions are controlled by a set of auxillary parameters that we will denote by γ
that are used in the approximation. For example, the approximating distributions
could be Gaussian with parameters γ. Next, the algorithm attempts to find a
specific distribution, say, fγ ∈ F that minimizes the Kullback-Leibler divergence
(KLD) between fγ (θ | y) and the posterior distribution p(θ | y) defined in Section
2.1. Briefly, the KLD between two distributions, fγ (θ | y) and p(θ | y) can be written
as Z !
p(θ | y)
KLD( f | p) = − fγ (θ | y)log dθ (4.13)
fγ (θ | y)
where, in this case, KLD( f | p) is the information lost when fγ (θ | y) is used to
approximate p(θ | y). The objective is to minimize the divergence from fγ (θ | y) to
p(θ | y).
4.8.1 Evidence Lower Bound (ELBO)

For the purposes of conducting VB, it is convenient to write Equation (4.13) as an
expectation – that is,
" #!
fγ (θ | y)
KLD( f | p) = E f log (4.14a)
p(θ | y)
= log[p(y)] + E f (log[ fγ (θ | y)] − E f (log[p(θ | y)] (4.14b)
R
where p(y) = p(y | θ)p(θ)dθ is the marginal likelihood and often referred to as
evidence. Then, noting that p(y) does not contain the parameters θ, we can rewrite
Equation (4.14b) so as to maximize
ELBO = E f (log[p(θ | y)p(θ)]) + E f (log[ f (θ | γ)] (4.15)
where ELBO stands for the evidence lower bound because it is the lower bound for
the evidence p(y).
The next step in VB is to specify the approximating distributions to p(θ | y).
The VB algorithm, as implemented in Stan, offers two choices: so-called mean field
VB and full rank VB. For mean-field VB, first we assume that f (θ) can be factored
into the product of individual parameters
P
Y
f (θ) = f (θp ) (4.16)
p=1
where, by assumption, the algorithm ignores any posterior dependence among

the parameters. It is important to note that even though Equation (4.16) does
not specify any particular distribution, Stan uses a coordinate ascent variational
inference algorithm to find the optimal solution, which is given by
" #!
∗
f p (θp ) ∝ exp E−p log p(θp | θ−p , y) (4.17)
Notice that Equation (4.17) has a form very similar to Gibbs sampling discussed
in Section 4.3 where the expectation is taken with respect to parameters not under
consideration (Ulitzsch & Nestler, 2022). In contrast, full-rank VB relaxes the
assumption of independent parameters, which, while realistic, becomes somewhat
burdensome for complex high-dimensional models. For this discussion, will focus
our attention on mean-field VB.
4.8.2 Variational Bayes Diagnostics

As with MCMC in general, it is important to have a sense as to whether the
approximating distribution fits the target posterior distribution. Yao, Vehtari,
Simpson, and Gelman (2018b) addressed this problem by noting that there are
many situations in which VB is flawed, such as when the family of approximating
distributions cannot capture the shape of the posterior distribution, or the true
posterior distribution is not symmetric, to name two. A diagnostic is then needed
to warn of these potential problems.
Two types of diagnostics are available in the Stan’s implementation of VB.
The first diagnostic is simply to monitor the relative changes in the ELBO as the
algorithm proceeds through the gradient descent. As noted by Yao et al. (2018b),
this issue is actually not solved by monitoring ELBO changes because a constant
change to the distribution would perhaps lead to good convergence but would be
far from the desired approximation.
The second diagnostic involves determining if f ∗ is close enough to p(θ | y) to
be used in its place. An approach developed and advocated by Yao et al. (2018b)
is to use so-called Pareto-smoothed importance sampling (PSIS) discussed in Vehtari,
Simpson, Gelman, Yao, and Gabry (2021).
Pareto-Smoothed Importance Sampling

A feature of MCMC in general is that it is sometimes necessary to make an adjust-
ment to the algorithm when draws are taken from an incorrect posterior distri-
bution. A common adjustment is to use so-called importance weighting. Consider
taking S samples from the posterior distribution p(θ). Following Vehtari, Simpson,
et al. (2021), let h(θ) be the expectation of a function h, such that
Z S
1X
Ih = E(h) = h(θ)p(θ)dθ ≈ h(θs ) (4.18)
S
s=1
The importance sampling estimate for the function h is based on finding a simpler
auxiliary distribution g(θ) and obtaining
PS
s=1 rs h(θs )
PS (4.19)
s=1 rs
where
p(θs )
rs = r(θs ) = (4.20)
g(θs )
are the importance ratios. The idea behind Equations (4.19) and (4.20) is that the
importance ratios serve as weights to correct for the fact that g(θ) is being used
instead of p(θ) to draw inferences about h(θ).
A difficulty with importance sampling is that the importance weights rs can be
noisy, and this can happen when the proposal distribution g(θ) is quite different
from the target distribution p(θ). To remedy this issue, Yao et al. (2018b) advocate
the use of the generalized Pareto distribution, where the density is defined as
 − 1k −1
y−µ

1
+ , k,0


σ
 1 k σ
p(y | µ, σ, k) = 

(4.21)
y−µ

 σ exp σ , k=0


 1
where k is a shape parameter. The idea is that the generalized Pareto distribution
√
is fit to the L largest importance ratios, where L is set at min(S/5, 3 S), where,
again, S is the total number of samples from the posterior distribution. Then, the
L largest importance ratios are replaced by their expected values under the Pareto
distribution. This defines Pareto-smoothed importance sampling (PSIS).
As pointed out by Yao et al. (2018b), the shape parameter k can be used
as a diagnostic for VB. Specifically, the value of k determines the finite sample
convergence rate for PSIS and signals how well the proposal distribution f (θ)
approximates f (θ | y). Studies by Vehtari, Simpson, et al. (2021) have shown
that k̂ < 0.5 suggests fast convergence such that f (θ) can be safely used as an
approximation to p(θ | y). If 0.5 < k̂ < 0.7, then the approximation may be useful,
but it is not perfect. Finally, if k̂ > 0.7, then convergence will be quite slow and
results from VB should not be trusted. Indeed, Stan will generate a warning that
perhaps MCMC should be used instead. We can summarize the approach in Stan
following the steps outlined in Yao et al. (2018b):
1. Run VB to obtain f (θ);
2. Take samples θs (s = 1 . . . S) from f (θ);
3. Calculate importance ratios rs ;
4. Fit the Pareto distribution to the L largest importance ratios;
5. Note the shape parameter k.
6. If k < 0.7 then,
7. The VI approximation f (θ) is close enough to p(θ | y) to be used in its place,
8. else
9. If 0.5 ≤ k < 0.7 then
10. Conclude that the approximation may be useful though not perfect;
11. else
12. If k > 0.7
13. Conclude that the approximation may not be reliable and the convergence
will be noticeably slow.
14. end if
It should be noted that the VB algorithm as implemented in Stan is somewhat
experimental and should be used with caution. Indeed, a recent paper by Ulitzsch
and Nestler (2022) that focused on item response theory models found via a
simulation study that some, but not all, parameters were severely biased when
compared to marginal maximum likelihood and MCMC, and these results were
further corroborated with a case study using data from PISA 2018. They conclude
that Stan’s implementation of VB was not viable for multidimensional IRT models,
and may be further challenged by more complicated models. Ulitzsch and Nestler
(2022) do call for additional research on VB for complex problems insofar as
MCMC is computationally demanding (though somewhat more stable) in complex
large-scale data scenarios. Indeed, research continues on developing fast and
stable VB estimation (see, e.g., Tomasetti, Forbes, & Panagiotelis, 2022; Dang &
Maestrini, 2022). Nevertheless, given these concerns, and the fact that we will
be demonstrating Bayesian methods primarily with large-scale data, we will not
demonstrate VB with the examples presented throughout this book. However, a
situation where VB seems to work well is in addressing the so-called label-switching
problem in Bayesian latent class analysis, and we will demonstrate an application
of VB as implemented in Stan to this problem in Chapter 8.
4.9 Summary
Markov chain Monte Carlo sampling and Hamiltonian Monte Carlo have revolu-
tionized Bayesian statistical practice by making it possible to accurately estimate
the posterior distribution of model parameters. Three algorithms were reviewed
in this chapter — the Metropolis-Hastings algorithm, the Gibbs sampler, and
Hamiltonian Monte Carlo with No-U-Turn sampling. Convergence diagnostics
were presented along with a simple example using the Stan software language.
Monitoring convergence cannot be overstated insofar as MCMC can be computa-
tionally intensive, especially for complex models. We briefly discuss an alternative
algorithm — variational Bayes — which has considerable advantages in terms of
speed, but should be used with caution when inference is of specific concern.
Part II
BAYESIAN
MODEL
BUILDING
5
Bayesian Linear and Generalized
Models
This chapter focuses on Bayesian linear and generalized linear models and sets
the groundwork for later chapters insofar as many, if not most, of the statistical
methodologies used in the social sciences have, at their core, the linear or gener-
alized linear regression model. We begin with the linear model after which we
will examine generalized linear models, including logistic regression, multinomial
logistic regression, Poisson, and negative binomial regression.
5.1 The Bayesian Linear Regression Model

We start with a very simple model linking student background characteristics
to scores on the PISA reading assessment. To begin, let y be an n-dimensional
vector (y1 , y2 , . . . , yn )′ (i = 1, 2, . . . , n) of scores from n students on the PISA reading
assessment, and let X be an n × Q matrix containing Q (q = 1, 2, . . . , Q) background
measures, such as gender, ethnicity, metacognitive reading strategies, etc. Then,
the Gaussian linear regression model can be written as
y = Xβ + u (5.1)
where β is an Q × 1 vector of regression coefficients and where the first column

of β contains an n-dimensional unit vector to capture the intercept term. We
assume that the PISA reading scores are generated from a Gaussian distribution,
specifically,
y ∼ N(Xβ, σ2 I) (5.2)
where I is an identity matrix. Moreover, we assume that the n-dimensional vector
u are disturbance terms assumed to be independently, identically, and Gaussian
distributed, specifically,
u ∼ N(0, σ2 I) (5.3)
The assumptions in Equations (5.2) and (5.3) give rise to the Gaussian linear
regression model with homoskedastic disturbances (see e.g. Fox, 2008).
73
From standard linear regression theory, the probability of the data X and y
given of the model parameters β and σ2 , can be written as
n 1 o
p(X, y | β, σ2 ) = (2πσ2 )−n/2 exp − 2 (y − Xβ)′ (y − Xβ) (5.4)
2σ
Notice that estimation of β hinges on minimizing the residual sum of squares
(y − Xβ)′ (y − Xβ) in the exponent of Equation (5.4). Expanding the residual sum
of squares, we obtain
RSS = (y − Xβ)′ (y − Xβ)

= y′ y − 2β′ X′ y + β′ X′ Xβ (5.5)
To minimize involves taking the derivative of Equation (5.5) with respect to β as
∂ ′
(y y − 2β′ X′ y + β′ X′ Xβ)
∂β
= −2X′ y + 2X′ Xβ (5.6)
setting to zero, and solving for β we obtain
X′ Xβ = X′ y,
β̂ = (X′ X)−1 X′ y (5.7)
where the solution depends on whether the inverse of X′ X exists. Continuing,

maximizing the likelihood with respect to σ2 yields the estimator
(y − Xβ̂)′ (y − Xβ̂)
σ̂2 = (5.8)
n
We recognize that Equation (5.7) is the same as that obtained under ordinary
least squares. However, Equation (5.8) differs from the least squares estimator
[(y − Xβ̂)′ (y − Xβ̂)]/(n − Q)
Recall from Chapter 2 that the first step in a Bayesian analysis is the specifi-
cation of the prior distributions for all model parameters. Recall also that we can
consider three broad classes of prior distributions: (1) non-informative priors that
reflect no prior knowledge or information about the location or shape of the dis-
tribution of the model parameters, (2) weakly-informative priors that place some
constraints on the model parameters for the purposes of aiding computation, but
otherwise mostly encode little information, and (3) informative-conjugate priors
that specify prior knowledge or information about model parameters. Multiply-
ing the likelihood and the conjugate prior yields a posterior distribution that is in
the same family of distributions as the likelihood.
5.1.1 Non-Informative Priors in the Linear Regression Model

To begin, we might choose to assign non-informative priors to the individual
regression coefficients in the vector β allowing the elements to take on values
Bayesian Linear and Generalized Models 75
over a very large range. A Gaussian distribution with a mean of zero and a
large standard deviation would accomplish this goal. Next, we might choose to
assign a C+ prior to σ. From here, the joint posterior distribution of the model
parameters is obtained by multiplying the prior distributions of β and σ2 by the
likelihood give in Equation (5.4).
Example 5.1: Bayesian Linear Regression Model Using Non-Informative Priors
For this example we use data from PISA 2018 to estimate a model relating read-
ing proficiency to a set of background, attitudinal, and reading strategy variables.
The sample comes from 4,838 PISA-eligible students in the United States. Variables
included in this model are FEMALE (female=1, male = 0); economic, social, and
cultural status of the student (ESCS); an index measuring the awareness of using
summarizing strategies in obtaining information from text (METASUM); an index
on the perceived extent to which the teacher gives feedback on students’ strengths
in reading (PERFEED); an index of student enjoyment of reading (JOYREAD); an
index of perceived teacher’s adaptivity of instruction to student and class learning
needs (ADAPTIVITY); a measure of the extent to which the teacher is interested
in every students’ learning (TEACHINT); a scale of students’ mastery-approach
orientation of achievement goals (MASTGOAL); an index of perceived difficulty
in reading (SCREADDIFF); and an index of perceived competency in reading
(SCREADCOMP). The first plausible value of the reading assessment was used
as the dependent variable (READSCORE).1 For more detail on these scales, see
OECD (2018).
The linear model can be written as
READSCORE = β0 + β1 (FEMALE) + β2 (ESCS) + β3 (METASUM)

+ β4 (PERFEED) + β5 (JOYREAD) + β6 (ADAPTIVITY)
+ β7 (TEACHINT) + β8 (MASTGOAL)
+ β9 (SCREADDIFF) + β10 (SCREADCOMP) + u (5.9)
The regression model with non-informative priors was estimated using Stan.
In the following we load the necessary libraries, read in and recode the data, and
then create a necessary data list.
library(rstan)
library(loo)
library(bayesplot)
library(dplyr)
1
Plausible values were developed as a means of obtaining consistent estimates of population
characteristics in large-scale assessment such as PISA where students are administered too
few items to allow precise ability estimates. Plausible values represent random draws
from an empirical proficiency distributional conditioned on the observed responses to the
assessment items and background variables (Mislevy, 1991).
Next we read in the data and do some data cleaning.
PISA18reg <- read.csv(file.choose(),header=T)
PISA18reg1 <- PISA18reg %>% select(PV1READ, Female, ESCS,

METASUM, PERFEED, JOYREAD, MASTGOAL, ADAPTIVITY, TEACHINT,
SCREADDIFF,SCREADCOMP)
PISA18reg2 <- PISA18reg1 %>% mutate(Female=recode(Female,’Male’=1,

’Female’=0))
data.list <- with(PISA18reg2, list(readscore=PV1READ, Female=Female,

ESCS=ESCS, METASUM=METASUM, PERFEED=PERFEED, JOYREAD=JOYREAD,
ADAPTIVITY=ADAPTIVITY, TEACHINT=TEACHINT,
MASTGOAL=MASTGOAL, SCREADDIFF=SCREADDIFF, SCREADCOMP=SCREADCOMP,
n = nrow(PISA18reg2)))
The Stan code comes next in which we specify the dimensions and types of
variables in the data block.
ReadingReg = "
data {
int<lower=0> n;
vector [n] readscore;
vector [n] Female; vector [n] ESCS;
vector [n] METASUM; vector [n] PERFEED;
vector [n] JOYREAD; vector [n] MASTGOAL;
vector [n] ADAPTIVITY; vector [n] TEACHINT;
vector [n] SCREADDIFF; vector [n] SCREADCOMP;
}
In the following parameter block, we specify each parameter in the model and
provide its scale of measurement. Notice that defining sigma as real< lower = 0 >
will ensure that sigma is on the positive real line.
parameters {
real alpha;
real beta1; real beta6;

}
In the following model block, we write out our regression model, provide priors
for each of the model parameters, and specify the likelihood. For this example,
non-informative N(0, 10) priors were specified for the intercept and all regression
coefficients. The standard deviation of the disturbance term was set to be a C+ (0,
6) prior. By having specified the scale of sigma as being positive in the parameters
block, the cauchy command yields a C+ distribution.
model {
real mu[n];
for (i in 1:n)
mu[i] = alpha + beta1*Female[i] + beta2*ESCS[i]
+ beta3*METASUM[i]
+ beta4*PERFEED[i] + beta5*JOYREAD[i] + beta6*MASTGOAL[i]
+ beta7*ADAPTIVITY[i] + beta8*TEACHINT[i]
+ beta9*SCREADDIFF[i] + beta10*SCREADCOMP[i] ;
// Non-informative Priors
alpha ˜ normal(0, 10);
beta1 ˜ normal(0, 10); beta6 ˜ normal(0, 10);
// Likelihood
}
"
For this analysis we use four chains with 2,500 warmup iterations and 2,500 post-
warmup iterations per chain and a thinning interval of 10. Posterior results are
therefore based on a total of 1,000 iterations. The analysis is run using the following
commands:
nChains = 4
nIter= 10000
thinSteps = 10
myfit = stan(data=data.list,model_code=ReadingReg,
chains=nChains,iter=nIter,warmup=warmupSteps,thin=thinSteps)
Convergence Diagnostics: Non-Informative Priors Case

Using the mcmc trace command, Figure 5.1 below illustrates the trace plots for
the model parameters.
mcmc_trace(myfit2,pars=c("alpha","beta1", "beta2", "beta3",

"beta4", "beta5", "beta6",
"beta7", "beta8", "beta9","beta10",
"sigma"))
FIGURE 5.1. Trace plots for regression example under non-informative priors.
For each parameter, we find a nice tight band for each chain and good mixing.
This is not terribly surprising insofar as the data are well behaved and the model
is not particularly complex. We can combine this information with the split-chain
rhat statistics in Table 5.1 below which is based on eight chains and conclude that
there is good mixing and no evidence of non-stationarity of the chains.
Figure 5.2 below uses the stan dens command to display the posterior density
plots for each of the parameters in the model.
stan_dens(myfit2NonInf, fill="gray",pars=c("alpha","beta1",
"beta2", "beta3","beta4", "beta5","beta6","beta7", "beta8","beta9",
"beta10","sigma"))
FIGURE 5.2. Density plots for regression example under non-informative priors.
Not all density plots exhibit a smooth bell-shaped curve as desired, especially for
beta6. A larger number of iterations might help improve the shape, but as we’ll
see, other diagnostics seem to suggest that the shape of these density plots might
not be much of problem.
Using the stan ac command, Figure 5.3 below displays the auto-correlation
plots.
stan_ac(myfit2NonInf, fill="gray",pars=c("alpha","beta1", "beta2",

"beta3","beta4", "beta5", "beta6","beta7",

"beta8", "beta9","beta10","sigma"))
FIGURE 5.3. ACF plots for regression example under non-informative priors.
The ACF plots show that the algorithm achieves mostly independent samples very
quickly. This information can be combined with the effective sample sizes for each
parameter (Table 5.1) to further gauge the extent of independent samples. We find
that the ratio of the n eff to the total number of iterations of 2,000 is never greater
than 3% (beta9) and never less than 93% (sigma) of the total number of iterations.
This further indicates that the samples achieved independence.
Results of Linear Model Under Non-Informative Priors

With the following commands we obtain the results of the analysis as below shown
in Table 5.1.
print(myfit2NonInf,pars=c("alpha","beta1", "beta2",
"beta3", beta4", "beta5", "beta6","beta7", "beta8",
"beta9","beta10","sigma"),probs = c(0.025, 0.975))
TABLE 5.1. Posterior results for reading literacy score under non-informative priors
Variable Parameter Mean SEMean SD 95% PPI n eff rhat

Intercept alpha 483.26 0.04 1.91 479.59 – 487.07 1859 1
FEMALE beta1 18.66 0.06 2.46 13.84 – 23.48 1752 1
ESCS beta2 23.94 0.03 1.30 21.27 – 26.45 1929 1
METASUM beta3 32.78 0.03 1.28 30.36 – 35.13 2173 1
PERFEED beta4 −5.21 0.03 1.49 −7.99 – −2.32 1950 1
JOYREAD beta5 11.15 0.03 1.33 8.44 – 13.69 2035 1
MASTGOAL beta6 −10.81 0.03 1.29 −13.37 – −8.31 2034 1
ADAPTIVITY beta7 2.05 0.04 1.66 −1.17 – 5.30 1983 1
TEACHINT beta8 13.87 0.03 1.51 10.89 – 16.90 2013 1
SCREADDIFF beta9 −9.20 0.03 1.45 −11.93 – −6.27 1904 1
SCREADCOMP beta10 18.20 0.03 1.59 15.16 – 21.21 2063 1
Residual sigma 88.06 0.02 0.93 86.27 – 89.87 1858 1
The results show that for all but beta7 (the effect of ADAPTIVITY on READSCORE)
the posterior probability intervals do not cover zero. As discussed in Chapter 4, it
is possible to assess the probability that the posterior mean for beta7 is greater than
zero, despite it being in the 95% posterior probability interval. Using the pnorm
function in R we find that the probability that the obtained posterior mean (2.05) is
greater than zero is approximately 0.89. Thus, even though the posterior mean is
in the 95% credible interval, the majority of the probability distribution lies to the
right of zero. It may, however, be of interest to see how close the posterior mean
is from zero. This too can be obtained from the posterior distribution using the
pnorm function in R. The value we obtain is 0.39. Content area expertise would be
needed to decide whether this result is substantively important, but it is crucial
to note that this calculation is not possible in the frequentist framework and high-
lights the nuanced analyses that can be conducted within the Bayesian framework.
Example 5.2: Bayesian Linear Regression Model Using PISA 2018 Data:
Informative Priors
The previous example presented Bayesian linear regression with non-

informative priors – the case in which there is no information that can be brought
to bear on the distributions of the model parameters. In that case, we are, in
essence, quantifying our lack of prior information by choosing a prior that allows
all possible values of the parameters to be more or less equally likely. There may
be some cases, however, in which statistical theory as well as prior information
can be brought to bear on the analysis. In this case, we would want to employ
informative priors.
In this section, we demonstrate the application of informative conjugate priors
for the Bayesian linear regression model. As discussed in Chapter 2, conjugate
priors are those that, when multiplied by the likelihood, yield a posterior distri-
bution in the same family as the prior distribution. The degree to which the prior
distribution is informative depends on the values of the hyperparameters of the
prior distribution, and in particular the value of the precision of the hyperparame-
ters. Informative priors for this example come from an analysis of the same model
using data from PISA 2009.
The most sensible conjugate prior distribution for the individual regression
coefficients in β is the Gaussian prior. The argument for using the Gaussian
distribution as the prior for each β lies in the fact that the asymptotic distribution
of regression coefficients is Gaussian (Fox, 2008). Moreover, the Gaussian prior is
a conjugate distribution for the regression coefficients and will result in a Gaussian
posterior distribution for these model parameters.
The conditional prior distribution of the vector β given σ2 can be written as
1

p(β | σ2 ) = (2π)Q/2 | Σ |1/2 exp − (β − B)′ Σ−1 (β − B) (5.10)
2
where Q is the number of variables, B is the vector of mean hyperparameters
assigned to β and Σ = σ2 I is the diagonal matrix of constant disturbance variances.
The conjugate prior for the variance of the disturbance term σ2 is (from Chapter
3), the inverse-gamma distribution, with hyperparameters a and b. We write the
conjugate prior density for σ2 as
2
p(σ2 ) ∝ (σ2 )−(a+1) e−b/σ (5.11)
With the likelihood L(β, σ2 | X, y) defined in Equation (5.4) as well as the prior
distributions p(β | σ2 ) and p(σ2 ), we have the necessary components to obtain the
joint posterior distribution of the model parameters given the data. Specifically,
the joint posterior distribution of the parameters β and σ2 is given as
p(β, σ2 | y, X) ∝ p(X, y | β, σ2 ) × p(β | σ2 ) × p(σ2 ) (5.12)
which, after some algebra, yields,

h 1
p(β, σ2 | y, X) ∝ σ−n−a exp − 2 σˆ2 (n − k) + (β − β̂)′ X′ X(β − β̂)
2σ
i
+ 2b + (β − B) (β − B
′
(5.13)
which has the form of a multivariate Gaussian distribution.
With regard to the Stan code, we leave all of the code intact with the exception
of the block of code in the model statement that specifies the prior distributions.
Here we place either weakly-informative or informative priors on the model pa-
rameters as follows:
// Informative Priors

Convergence Diagnostics: Informative Priors Case

Using the mcmc trace command, Figures 5.4 - 5.6 below show the trace plots,
density plots, and autocorrelation plots, respectively, for the model parameters.
FIGURE 5.4. Trace plots for regression example under informative priors.
FIGURE 5.5. Density plots for regression example under informative priors.
FIGURE 5.6. ACF plots for regression example under informative priors.
Results of Linear Model Under Informative Priors

The convergence diagnostics for this example are virtually identical to the non-
informative case, suggesting adequate convergence to the posterior distribution.
The results are shown below in Table 5.2.
TABLE 5.2. Posterior results for reading literacy score under informative priors
Variable Parameter Mean SEMean SD 95% PPI n eff Rhat

Intercept alpha 499.44 0.04 1.55 496.37 – 502.39 1814 1.00
FEMALE beta1 4.48 0.04 1.62 1.29 – 7.65 1957 1.00
ESCS beta2 26.52 0.03 1.24 24.16 – 28.94 1923 1.00
METASUM beta3 24.30 0.03 1.08 22.14 – 26.46 1921 1.00
PERFEED beta4 −3.94 0.02 1.13 −6.05 – −1.76 2081 1.00
JOYREAD beta5 16.18 0.02 1.25 13.72 – 18.57 1817 1.00
MASTGOAL beta6 −8.17 0.03 1.11 −10.24 – −5.89 1991 1.00
ADAPTIVITY beta7 2.99 0.03 1.27 0.50 – 5.52 2011 1.00
TEACHINT beta8 8.46 0.03 1.20 6.03 – 10.77 1983 1.00
SCREADDIFF beta9 −9.76 0.03 1.15 −12.05 – −7.48 1980 1.00
SCREADCOMP beta10 10.67 0.03 1.20 8.25 – 13.09 1844 1.00
Residual sigma 88.35 0.02 0.91 87.76 – 90.14 1868 1.00
From a substantive point of view, the results have changed in important ways,
suggesting a degree of sensitivity to choice of priors. For example, for the non-
informative priors case in Table 5.1, we found that the posterior probability interval
for the effect of ADAPTIVITY on reading readscore (beta7) covered zero. In Table
5.2, we find that PPI for beta7 does not cover zero, and moreover, the probability
that the posterior mean estimate of 2.59 is greater than zero is 0.49. Other estimates
exhibited relatively large changes. A direct comparison of these two models in
terms of model fit and model selection is deferred to Chapter 6.
5.2 Bayesian Generalized Linear Models

The previous section examined Bayesian linear regression. The conventional form
of the linear regression model assumes that the outcome variable is continuous
and Gaussian distributed, as specified in Equation (5.2). Note that when these
assumptions hold, the conditional distribution of y given the predictors X will be
Gaussian distributed as well. This leads to asymptotically unbiased, consistent,
and efficient estimates of the model parameters.
For many applications of linear regression in the social sciences, the assump-
tion of a continuous and Gaussian distributed outcome is largely untenable or not
even appropriate for the research question at hand. In the former case, the data
might have been generated from a population in which the assumption of Gaus-
sian distributed data might arguably hold, but that the data in the finite sample are
non-Gaussian. In those cases, transformations of the original variables to achieve
a more Gaussian shape might be justified. In the latter case, the population distri-
bution might not be Gaussian at all. For example, an outcome of interest might be
dichotomous, such as a response to the question “Did you vote in the last national
election? (Yes/No)”. Or, an outcome might be in the form of a count – for example,
a response to a question, such as ”How many times this week did you read to your
child?”
In both of these examples, the use of Gaussian theory-based linear regression
would lead to biased and inefficient estimates, and, moreover, there would be
loss in the richness of interpretation if the Gaussian model were used for these
data. Rather, it is best to apply models that explicitly account for the probability
model generating the data. In the example of the dichotomous outcome, the
appropriate probability model would be based on the binomial distribution, and
in the example of the count outcome, the appropriate probability model would be
based on the Poisson distribution. Both distributions (along with their conjugate
priors) were discussed in Chapter 3. Incorporating alternative probability models
in the context of regression has lead to the generalized linear model.
For this section, we describe the so-called link-function, which provides a con-
venient framework for moving among non-linear and linear models (McCullagh
& Nelder, 1989). We then provide an empirical example utilizing Bayesian logistic
regression.
5.2.1 The Link Function

In considering generalized linear models, the notion of the link function has unified
a large number of models and, in fact, has shown that the linear model discussed
in the previous chapter is also a special case of the generalized linear model.
Specifically, the link function is designed to relate (or “link”) a noncontinuous
and/or nonlinear outcome variable yi (i = 1, 2, . . . n) to its linear predictor x′i β.
Take, as an example, the linear model discussed in the previous chapter. Letting
E(yi ) = µi , we have µi = x′i β. Thus, the link function for the linear model is F(µ) = µ
and is referred to as the identity link (McCullagh & Nelder, 1989).
The idea is to determine link functions for noncontinuous and/or nonlinear
outcome variables such as those we model in the context of logistic regression and
Poisson regression. As we will describe in more detail later, the link functions
for the linear model, logistic model, multinomial model, and Poisson model are
shown in Table 5.3 below:
TABLE 5.3. Various link functions
Link function F(µ) Model

µ Linear model
µ
ln( 1−µ ) Logistic regression
µ
ln( 1−µ ) Multinomial regression
ln(µ) Poisson regression
5.3 Bayesian Logistic Regression

To begin our discussion of the logistic regression model, consider an outcome
variable, yi , that can take on values 1/0. As before, this outcome variable could be
a right/wrong answer on a math test, or agree/disagree on an attitude survey item.
For individual i, the outcome is generated from a binomial distribution, namely,
yi | θ ∼ bin(n, θ) (5.14)
Recall that the link function is the logarithm of the odds ratio, that is,
µ
!
ln (5.15)
1−µ
where µi = β0 + β1 x1 + . . . + βq xQ The first step in a Bayesian analysis is to define the

likelihood. In the case of logistic regression, the likelihood of a single observation,
yi , under the binomial model, can be written as
y
p(yi | β0 . . . βQ ) = θi i (1 − θi )1−yi (5.16)
For n independent observations, the likelihood can be written as

n
Y
p(y1 , . . . , yn | β0 . . . βQ ) = p(yi | β0 . . . βQ ) (5.17)
i=1
As usual, the goal is to find the joint posterior distribution of the model
parameters via Bayes’ theorem, that is,
p(β0 . . . βQ | y1 , . . . , yn ) ∝ p(y1 , . . . , yn | β0 . . . βp ) × p(β0 . . . βQ ) (5.18)
From here, different types of priors could be chosen for the intercept β0 and
the slopes β1 , . . . , βQ reflecting a lesser or greater degree of prior informa-
tion on the model parameters. The resulting joint posterior distribution,
p(β0 . . . βQ | y1 , . . . , yn ), will be Gaussian.
Example 5.3 Bayesian Logistic Regression of Subjective Positive Feelings

Using PISA 2018
This example models students’ report of positive feelings. Specifically, PISA

2018 asked students to report how frequently ( “never,” “rarely,” “sometimes,”
“always”) they feel happy, lively, proud, joyful, cheerful, scared, miserable, afraid,
and sad. Three of these positive feelings — happy, joyful, and cheerful —- were
combined to create an index of positive feelings (SWBP). Positive values in this
index mean that the student reported more positive feelings than the average
student across OECD countries. The scale was dichotomized so that sccores above
zero referred to higher than average SWBP while scores below zero reflected lower
than average SWBP (not to be confused with negative feelings).
The dichotomized SWBP variable was regressed on a dichotomous gender
variable (Female), a measure of a scale of competitiveness achievement motivation
(COMPETE), a measure of students’ mastery-approach orientation of achievement

goals (MASTGOAL), the PISA measure of socioeconomic status (ESCS), a measure
of general fear of failure (GFOFAIL), and a measure of students’ sense of belonging
in the school (BELONG). For more information on the scaling of SWBP and the
other variables used in this analysis, see OECD (2018, Ch. 16).
After reading in the data the usual way, we define the metric and ranges of
the variables in the data block and define the parameters in the parameter block.
Notice that we dichotomized the subjective well-being scale and declared it as an
integer with a lower value of zero and an upper value of one.
ReadLogistic ="
data {
int <lower=0> n;
vector[n] Female; vector[n] ESCS;
vector[n] COMPETE; vector[n] GFOFAIL;
vector[n] MASTGOAL; vector[n] BELONG;
int <lower=0,upper=1> SWBP[n];
}
parameters {
real alpha;
}
Next we define the logistic regression model in the model block by modeling
subjective well-being using the bernoulli logit function in Stan. This function has
been found to be more numerically stable when models for dichotomous outcomes
are parameterized into the logit scale (Stan Development Team, 2021a). Note that
non-informative priors are specified for the model parameters in this example.
model {
for (i in 1:n) {
SWBP[i] ˜ bernoulli_logit(alpha + beta1*Female[i] +
beta2*ESCS[i] + beta3*COMPETE[i] + beta4*GFOFAIL[i]
+ beta5*MASTGOAL[i] + beta6*BELONG[i]);
}
}
}
"
In Figures 5.7 - 5.10 below we show the convergence diagnostics for the logistic
regression model based on 2,000 draws from the posterior distribution (10,000
total draws, four chains, and a thinning interval of 10).
FIGURE 5.7. Trace plots for logistic regression example under informative priors.
The trace plots do exhibit some degree “stickiness” but overall the chains seem to
mix well.
FIGURE 5.8. Density plots for logistic regression example under informative priors.
The density plots exhibit a relatively nice bell shape with the possible exception of
beta3.
FIGURE 5.9. ACF plots for logistic regression example under informative priors.
The ACF plots exhibit an immediate transition to almost independent draws. The
posterior results for Bayesian logistic regression are shown below in Table 5.4.
TABLE 5.4. Posterior results for logistic regression of subjective well-being

Variable Parameter Mean SEMean SD 95% PPI n eff Rhat
Intercept alpha 0.02 0.00 0.05 −0.08 – 0.11 2260 1.00
Female beta1 0.01 0.00 0.06 −0.11 – 0.13 1787 1.00
ESCS beta2 0.07 0.00 0.03 0.01 – 0.13 2190 1.00
COMPETE beta3 0.10 0.00 0.03 0.04 – 0.17 1856 1.00
GFOFAIL beta4 −0.28 0.00 0.03 −0.34 – −0.22 2004 1.00
MASTGOAL beta5 0.41 0.00 0.03 0.34 – 0.47 2020 1.00
BELONG beta6 0.72 0.00 0.04 0.64 – 0.80 2092 1.00
We find that zero is in the 95% PPI for Female only. The pnorm functions discussed
in Chapter 4 can be used to examine the probabilities that any of these effects are
greater than zero.
5.4 Bayesian Multinomial Regression

A natural extension of Bayesian logistic regression occurs when the outcome has
more than two categories.
Example 5.4: Bayesian Multinomial Logistic Regression of Perception of

Reading Competency
This example uses a subsample of 5,000 fourth-grade Canadian students who

participated in the Progress in International Reading Literacy Study (PIRLS). For
this analysis, the research question relates to the prediction of students’ self-
assessment in reading, obtained as categorical responses to the question ”I do
well in reading” (DOWELL), and measured on a four-point scale from 1=strongly
Agree to 4=strongly disagree, which for this example is treated as an ordinal cat-
egorical outcome. Predictors used in this analysis were Female (=1); number of
books in the home (bookHome) categorized as (1) 0-10, (2) 11-25, (3) 26-100, (4)
101-200, (5) more than 200; and student motivation to read (motivRead) measured
on a 4-point scale from strongly agree to strongly disagree.
After reading in the data the usual way, we obtain the sample size n and create
a model formula f using the as.formula function in R that will be needed for the
matrix multiplication in the model block. We also define the number of unique
categories of the outcome as C in the code.
n <- nrow(Canadareg2)
f <- as.formula("DOWELL ˜ Female + booksHome + motivRead")
m <- model.matrix(f,Canadareg2)
data.list <- list(n=nrow(Canadareg2),
C=length(unique(Canadareg2[,1])),
P=ncol(m),x=m, Female=Female, booksHome=booksHome,motivRead=motivRead,
DOWELL=as.numeric(Canadareg2[,1]))
The Stan code comes next in which the data block defines the data matrix which is
of order n × Q, where Q is the number of predictors. Note again that C represents
the number of unique categories of the outcome variable and with lower=2 defines
that this variable has at least two categories.
ReadMultiNom <- "

data {
int<lower = 2> C; // The variable has at least two categories
int<lower = 1> Q; // number of predictors
int<lower = 0> n;
int <lower=1, upper=C> DOWELL[n];
matrix[n, Q] x;
}
We next multiple the x matrix by the beta matrix. Note that x is n × Q and beta is
Q × C, and so the resulting x beta matrix is n × C.
parameters {
matrix[k, C] beta;
}
transformed parameters {
matrix[n, C] x_beta= x * beta;
}
In the following model block, we use the to vector utility in Stan which vectorizes
the matrix beta and allows assigning the same prior to all elements of beta.
model {
to_vector(beta) ˜ normal (0,2);
for (i in 1:n) {
DOWELL[i] ˜ categorical_logit(x_beta[i]’);
}
}
Finally, in the generated quantities block, we obtain posterior predicted values

that we can use for posterior predictive checks and leave-one-out cross-validation
described in more detail in Chaper 6.
generated quantities {
int <lower=1, upper=C> DOWELL_rep[n];
vector[n] log_lik;
for (i in 1:n) {
DOWELL_rep[i] = categorical_logit_rng(x_beta[i]’);
log_lik[i] = categorical_logit_lpmf(DOWELL[i] | x_beta[i]’);
}
}
Results for Multinomial Logistic Regression

Table 5.5 below gives the results for the multinomial logistic regression.
TABLE 5.5. Posterior results for multinomial logistic regression on self-reported reading
competency
Parameter Mean SEMean SD 95% PPI n eff Rhat

beta[1,1] −2.14 0.02 1.04 −4.16– −0.11 1867 1
beta[1,2] 1.09 0.02 1.04 −0.96– 3.16 1836 1
beta[1,3] 0.98 0.02 1.04 −1.08 – 3.09 1883 1
beta[1,4] −0.07 0.02 1.06 −2.11 – 1.98 1865 1
beta[2,1] 0.54 0.03 1.04 −1.46 – 2.51 1665 1
beta[2,2] 0.02 0.03 1.04 −1.96 – 2.02 1706 1
beta[2,3] −0.28 0.03 1.05 −2.28 – 1.73 1700 1
beta[2,4] −0.23 0.03 1.05 −2.24 – 1.78 1670 1
beta[3,1] 0.19 0.02 0.99 −1.72 – 2.11 1880 1
beta[3,2] −0.04 0.02 0.99 −1.99 – 1.88 1873 1
beta[3,3] −0.05 0.02 0.99 −1.96 – 1.88 1896 1
beta[3,4] −0.38 0.02 0.99 −2.30 – 1.51 1865 1
beta[4,1] 0.17 0.02 1.01 −1.80 – 2.15 1774 1
beta[4,2] −0.10 0.02 1.01 −2.06 – 1.88 1767 1
beta[4,3] −0.12 0.02 1.01 −2.12 – 1.85 1770 1
beta[4,4] 0.10 0.02 1.01 −1.85 – 2.06 1773 1
The regression coefficients in Table 5.5 are interpreted as follows. The first element
of each beta (e.g. beta[1,1]) represents the variable in the model and the second
element represents the level of the categorical outcome. So, beta[1,1] is the intercept
for the first category of DOWELL. Similarly, beta[2,1] is the Female effect on the
first category of DOWELL, and so on. As these regression coefficients are in the
logit metric, they can be converted into probabilities to reflect the probability that,
say, a male would answer category c using the formula
eβc
p(y = c) = PC (5.19)
c=1 eβc
where βc is the coefficient for the particular categorical outcome and level of x.
For example, recalling that Females are coded 1 and that the first category of
DOWELL is “strongly agree,” we find that the proportion of females endorsing
this category is e(0.54) /[(e(0.54) + e(0.02) + e(−0.28) + e(−0.23) ] = 0.40, indicating that less
than half of the females in this sample strongly agree that they do well in reading.
In contrast, examining the last category, only 19% of the females in this sample
strongly disagree that they do well in reading. If we calculate these probabilities
for all C = 4 categories for females and add them up, they will sum to 1.0.
5.5 Bayesian Poisson Regression

As discussed in Section 3.2, often in the social sciences we wish to model data in the
form of counts. To reiterate, in cases of this sort, let the random variable k represent
the number of occurrences of the event. This random variable is assumed to follow
a Poisson distribution with parameter θ representing the rate per unit time. The
probability density function for the Poisson distribution is written as
e−θ θk
p(k) = , k = 0, 1, 2, . . . , θ>0 (5.20)
k!
where k is the set of whole numbers representing the counts of the events. The
link function for the Poisson regression in Table 5.3 allows us to model the count
data in terms of chosen predictors.
Example 5.5: Bayesian Poisson Regression of Absentee Data
This analysis utilizes data from a school administrator’s study of the atten-
dance behavior of high school juniors at two schools. The outcome is a count
of the number of days absent and the predictors include gender of the student
and standardized test scores in math and language arts. The source of the data
is unknown, but is widely used as an example of regression for count data in a
number of R programs. The data are available on the accompanying website.
We begin by reading in the data and creating the data list for Stan.
poissonex <- read.csv(file.choose(),header=T)

data.list <- with(poissonex, list(daysabs=daysabs, math=math,
langarts=langarts, male=male, N = nrow(poissonex)))
Next we obtain a histogram plot in Figure 5.10 below.

p<-ggplot(poissonex, aes(x=daysabs)) + geom_histogram(color="black",

fill="gray")
p +theme_classic()
FIGURE 5.10. Histogram of number of days absent from school.
We see that the distribution of number of days absent has the typical look of a
Poisson distribution. Next, we set up the Stan code. Notice that in the data block,
we indicate that the variable daysabs is an integer with a lower bound of zero,
which would be the proper metric for a count.
PoissonModel = "
data {
int<lower=0> n;
vector[N] math;
vector[N] langarts;
vector[N] male;
int<lower=0> daysabs[n];
}
Next, in the parameter block, we specify a vector for the regression coefficients
which is an efficient way to provide the same prior to all of the regression coeffi-
cients.
parameters {
vector[4] beta;
}
In this next transformed parameters block we specify the exponential of the ex-
pected value of days absent,
vector[n] mu = exp(beta[1] + beta[2]*math
+ beta[3]*langarts + beta[4]*male);
}
and in the model block next, we specify Poisson likelihood and place standard
Gaussian priors on the regression coefficients.
model {
daysabs ˜ poisson(mu);
beta ˜ std_normal();
}
Finally, in the following generated quantities block, we obtain the information we

need for posterior predictive checking and loo cross-validation.
int<lower=0> daysabs_pred[N] = poisson_rng(mu);
vector[n] log_lik;
for (i in 1:n) {
log_lik[i] = poisson_lpmf(daysabs[i] | mu[i]);
}
}
"
For completeness, the next set of code reads in the relevant information to begin
estimation, summarize the results, obtain necessary diagnostic plots, conduct pos-
terior predictive checking, and obtain loo cross-validation measures to be used
for model comparison, which we will show in the following section on negative
binomial regression. Notice that for this analysis, the total number of draws will
be 2,000.
nChains = 4
nIter= 10000
thinSteps = 10
burnInSteps = floor(nIter/2)
daysabs = data.list$daysabs
PoissRegfit = stan(data=data.list,model_code=PoissonModel,
chains=nChains,iter=nIter,warmup=burnInSteps,
thin=thinSteps)
stan_plot(PoissRegfit, pars=c("beta"))
stan_trace(PoissRegfit,inc_warmup=T,pars=c("beta"))
stan_dens(PoissRegfit,pars=c("beta"))
stan_ac(PoissRegfit, pars=c("beta"))
print(PoissRegfit,pars=c("beta"))
log_lik_poisson <- extract_log_lik(PoissRegfit, merge_chains = FALSE)

loo_poisson <- loo(log_lik_poisson)
print(loo_poisson)
All diagnostics indicated good convergence of the algorithm. Table 5.6 below
presents the results.
TABLE 5.6. Posterior results for Poisson regression of days absent from school
Variable Parameter Mean SD 2.5% PPI 95% PPI n eff Rhat
Intercept alpha 2.75 0.07 2.61 2.89 2063 1.00
Math beta1 −0.01 0.00 −0.01 0.00 2170 1.00
Lang. Arts beta2 0.00 −0.01 0.00 0.03 2047 1.00
Male beta3 −0.35 0.05 −0.45 −0.26 1920 1.00
We find that the expected number of days absent are about 30% percent lower
for males than females: exp(−0.35) = 0.70 and 1−0.70 = 0.30). We also find that
a 10-unit increase in language arts scores results in about a 10% decrease in the
expected number of days absent: exp(−.01 × 10) = 0.90) and 1−0.90 = 0.10.
5.6 Bayesian Negative Binomial Regression

Poisson regression is applicable to the situation in which the variance is roughly
equal to the mean. In the case where the variance of the outcome is much greater
than the mean, it is preferable to model the data using the negative binomial
distribution. In our example of the absentee data, the mean number of days
absent is 2.83 and the variance of the number of days absent is 7.18, and so the
variance is about 2.5 times larger than the mean. In what follows, we model days
absent using Bayesian negative binomial regression and defer a direct comparison
of the Poisson regression results to Chapter 6.
Example 5.6: Bayesian Negative Binomial Regression of Absentee Data
The Stan code for Bayesian negative binomial regression is identical to that for
the Poisson regression in Example 5.5 except for a small change to the parameter
block where we specify the dispersion parameters ϕ (phi), add it to the likelihood
in the model block, and use it in the generated quantities block to obtain posterior
predictive checks and cross-validation measures. Note that in the model block we
give phi a C+ (0,1) prior. Other priors for the dispersion term are also possible.
NegBinomModel = "
data {
int<lower=0> n;
vector[N] math;
vector[N] langarts;
vector[N] male;
int<lower=0> daysabs[n];
}
parameters {
vector[4] beta;
real <lower=0> phi; // Dispersion parameter

}
vector[n] mu = exp(beta[1] + beta[2]*math + beta[3]*langarts
+ beta[4]*male);
}
model {
daysabs ˜ neg_binomial_2(mu,phi);
beta ˜ std_normal();
phi ˜ cauchy(0,1);
}
int<lower=0> daysabs_pred[n] = neg_binomial_2_rng(mu,phi);
vector[n] log_lik;
for (i in 1:n) {
log_lik[i] = neg_binomial_2_lpmf(daysabs[i] | mu[i],phi);
}
}
"
The results for the negative binomial regression are shown below in Table 5.7
TABLE 5.7. Posterior results for negative binomial regression of days absent from school
Variable Parameter Mean SD 2.5% PPI 95% PPI n eff Rhat
Intercept alpha 2.64 0.22 2.49 3.08 2152 1.00
Math beta1 −0.01 0.01 −0.02 0.00 2039 1.00
Lang. Arts beta2 −0.01 0.01 −0.02 0.00 1767 1.00
Male beta3 −0.34 0.14 −0.43 −0.08 1974 1.00
Dispersion phi 0.79 0.08 0.66 0.95 1992 1.00
5.7 Summary
This chapter provided the first complete set of analyses of substantive problems
using Bayesian linear and generalized linear regression. The general approach
follows closely the steps of model building, evaluation, and selection within the
frequentist domain. The key differences between the Bayesian and frequentist
approaches to model building are (1) incorporation of prior knowledge as en-
coded into the prior distribution, and (2) interpretation, particularly of posterior
probability intervals
An example of Bayesian generalized linear modeling was also provided for
logistic regression, Poisson regression, and negative binomial regression. From a
Bayesian perspective, as long as the probability model for the outcome variable is
correctly specified, the issue then centers on the specification of the priors.
This chapter did not look at the specification of interaction terms in the linear
or generalized linear model. The omission of interaction terms in our examples
was for simplicity, but adding interaction terms would not yield any additional
complications. Specifically, as with the conventional frequentist regression mod-
els, it might be necessary to center the variables involved in the interaction. In
doing so, this would involve taking care to specify the hyperparameters of the
prior distributions of the coefficients associated with the interactions so as to rep-
resent the appropriate metrics of the association between the interaction terms and
the outcome.
The next chapter takes up the topic of Bayesian approaches to model evaluation
and model comparison, with an initial discussion of the problem of hypothesis
testing in frequentist framework the Bayesian perspective to the problem.
6
Model Evaluation and Comparison
6.1 The Classical Approach to Hypothesis Testing

and Its Limitations
A critically important component of applied statistics is inference and model
building. Indeed, a considerable amount of time is spent in introductory statis-
tics courses laying the foundation for the frequentist perspective on hypothesis
testing, culminating in the Neyman-Pearson approach which can be considered
the conventional approach to null hypothesis significance testing (NHST) in the
social sciences (Neyman & Pearson, 1928). An interesting aspect of the Neyman-
Pearson approach to hypothesis testing is that students (as well as many seasoned
researchers) appear to have a very difficult time grasping its principles. In a
review of the problem of hypothesis testing in the social sciences, Gigerenzer,
Krauss, and Vitouch (2004) argued that much of the difficulty in grasping frequen-
tist hypothesis testing lies in the conflation of Fisherian hypothesis testing and the
Neyman-Pearson approach to hypothesis testing. For interesting discussions on
this problem, see, e.g., Cohen (1994), Gigerenzer et al. (2004), and the volume by
Harlow, Mulaik, and Steiger (1997).
Briefly, Fisher’s early approach to hypothesis testing required specifying only
the null hypothesis — a hypothesis of interest to be nullified. A conventional
significance level is chosen (usually the 5% level). Once the test is conducted, the
result is either significant (p < .05) or it is not (p > .05). If the resulting test is
significant, then the null hypothesis is rejected. However, if the resulting test is
not significant, then no conclusion can be drawn. As Gigerenzer et al. (2004) has
pointed out, Fisher developed a later version of his ideas wherein one only reports
the exact significance level arising from the test and does not place a “significant”
or “non-significant” value label to the result. In other words, one reports, say,
p = .045, but does not label the result as “significant” (Gigerenzer et al., 2004, p.
399).
In contrast to Fisher’s ideas, the approach advocated by Neyman and Pearson
requires that two hypotheses be specified – the null and alternative hypothesis.
By specifying two hypotheses, one can compute a desired trade-off between two
types of decision errors: Type I errors (the probability of rejecting the null when
it is true, denoted as α) and Type II errors (the probability of not rejecting the null
101
when it is false, denoted as β, where 1 − β denotes the power of the test). As Raftery
(1995) has pointed out, the dimensions of this hypothesis are not relevant – that
is, the problem can be as simple as the difference between a treatment group and
a control group, or as complex as a structural equation model. The point remains
that only two hypotheses are of interest in the conventional practice. Moreover,
as Raftery (1995) also notes, it is often far from the case that only two hypotheses
are of interest. This is particularly true in the early stages of a research program,
when a large number of models might be of interest to explore, with equally large
numbers of variables that can be plausibly entertained as relevant to the problem.
The goal is not, typically, the comparison of any one model taken as “true” against
an alternative model. Rather, it is whether the data provide evidence in support
for one of the competing models.
The conflation of Fisherian and Neyman-Pearson hypothesis testing lies in the
use and interpretation of the p-value. In Fisher’s paradigm, the p-value is a matter
of convention with the resulting outcome being based on the data. In contrast, in
the Neyman-Pearson paradigm, α and β are determined prior to the experiment
being conducted and refer to a consideration of the cost of making one or the other
decision error. Indeed, in the Neyman-Pearson approach, the problem is one of
finding a balance between α, power, and sample size. However, even a casual
perusal of the top journals in the social sciences will reveal that this balance is vir-
tually always ignored and α is taken to be 0.05, with the 0.05 level itself being the
result of Fisher’s experience with small agricultural experiments and never taken
to be a universal standard. The point is that the p-value and α are not the same
thing. This confusion is made worse by the fact that statistical software packages
often report a number of p-values that a researcher can choose from after having
conducted the analysis (e.g., .001, .01, .05). This can lead a researcher to set α ahead
of time (perhaps according to an experimental design), but then communicate a
different level of “significance” after running the test. This different level of signif-
icance would have corresponded to a different effect size, sample size, and power,
all of which were not part of the experimenter’s original design. The conventional
practice is even worse than described as evidenced by nonsensical phrases such as
results “trending toward significance,” or “approaching significance,” or “nearly
significant.”1
One could argue that a poor understanding and questionable practice of NHST
is not sufficient as a criticism against its use. However, it has been argued by
Jeffreys (1961, see also; Wagenmakers, 2007) that the statistical logic of the p-
value underlying NHST is fundamentally flawed on its own terms. Consider
that any test statistic t(y) that is a function of the data y (such as the t-test). The
p-value is obtained by p[t(y) | H0 ], as well as that part of the sampling distribution
t(yrep | H0 ) more extreme than the observed t(y). A p-value is obtained from
the distribution of a test statistic over hypothetical replications (i.e., the sampling
distribution). The p-value is the sum or integral over values of the test statistic
that are at least as extreme as the one that is actually observed. In other words,
1
For a perhaps not-so-humorous account of the number of different phrases that have
been used in the literature, see https://mchankins.wordpress.com/2013/04/21/still
-not-significant-2/.
Model Evaluation and Comparison 103
the p-value is the probability of observing data at least as extreme as the data that
were actually observed, computed under the assumption that the null hypothesis
is true. However, data points more extreme were never actually observed and
thus constitutes a violation of the likelihood principle, a foundational principle in
statistics (Birnbaum, 1962) that states that in drawing inferences or decisions about
a parameter after the data are observed, all relevant observational information is
contained in the likelihood function for the observed data, p(y | θ). This issue was
echoed in Kadane (2011, p. 439) who wrote:
Significance testing violates the Likelihood Principle, which states
that, having observed the data, inference must rely only on what
happened, and not on what might have happened but did not.
Kadane (2011, p. 439) goes on to write:
But the probability statement...is a statment about√X̄n before it is ob-
served. After it is observed, the event | X̄n |> 1.96/ n either happened
or did not happened and hence has probability either one or zero.
To be specific, if we observe an effect, say y = 5, then the significance calculations
involve not just y = 5 but also more extreme values, y > 5. But y > 5 was not
observed and it might not even be possible to observe it in reality! To quote Jeffreys
(1961, p. 385),
I have always considered the arguments for the use of P [sic] absurd.
They amount to saying that a hypothesis that may or may not be
true is rejected because a greater departure from the trial value was
improbable; that is, that it has not predicted something that has not
happened. ...This seems a remarkable procedure
6.2 Model Assessment

In many respects, the frequentist and Bayesian goals of model building are the
same. First, a researcher will specify an initial model relying on a lesser or greater
degree of prior theoretical knowledge. In fact, at this first stage, a number of
different models may be specified according to different theories, with the goal
being to choose the “best” model, in some sense of the word. Second, these
models will be fit to data obtained from a sample from some relevant population.
Third, an evaluation of the quality of the models will be undertaken, examining
where each model might deviate from the data, as well as assessing any possible
model violations. At this point, model respecification may come into play. Finally,
depending on the goals of the research, the “best model” will be chosen for some
purpose.
Despite the similarities between the two approaches with regard to the broad
goals of model building, there are important differences. A major difference
between the Bayesian and frequentist goals of model building lies in the model
specification stage. In particular, because the Bayesian perspective explicitly incor-
porates uncertainty regarding model parameters in the form of prior probability
distributions, the first phase of modeling building will require the specification of
a full probability model for the data and the parameters of the model, where the
latter requires the specification of the prior distribution. The notion of model fit,
therefore, implies that the full probability model fits the data. Lack of model fit
may well be due to incorrect specification of likelihood, the prior distribution, or
both.
Arguably, another difference between the Bayesian and frequentist goals of
model building relates to the justification for choosing a particular model among
a set of competing models. Specifically, model building and model choice in the
frequentist domain is based primarily on choosing the model that best fits the data.
This has certainly been the key motivation for model building, respecification,
and model choice in the context of popular methods in the social sciences such as
structural equation modeling (see, e.g., Kaplan, 2009). In the Bayesian domain,
the choice among a set of competing models is based on which model provides the
best posterior predictions. That is, the choice among a set of competing models
should be based on which model will best predict what actually happened.
In this chapter, we examine the components that might be termed a Bayesian
workflow (see, e.g., Bayesian Workflow, 2020; Schad, Betancourt, & Vasishth, 2019;
Depaoli & van de Schoot, 2017) and includes a series of steps that can be taken
by a researcher before and after estimation of the model. We discuss a possible
Bayesian workflow in more detail in Chapter 12. These steps include (1) prior pre-
dictive checking, which can aid in identifying priors that might be serious conflict
with the distribution of the data, and (2) post-estimation steps including model
assessment, model comparison, and model selection. We begin with a discus-
sion of prior predictive checking, which essentially generates data from the prior
distribution before the data are observed. Next, we discuss posterior predictive
checking as a flexible approach to assessing the overall fit of a model as well as
the fit of the model to specific features of the data. Then, we discuss methods
of model comparison including Bayes factors, the related Bayesian information
criterion (BIC), the deviance information criterion (DIC), the widely applicable
information criterion (WAIC), and the leave-one-out cross-validation information
criterion (LOO-IC).
6.2.1 Prior Predictive Checking

As part of the modeling workflow, it is often useful to have a sense of the ap-
propriateness of a prior distribution before observing the data and performing a
full Bayesian analysis. Such a check is useful in the presence of content knowl-
edge wherein clearly absurd prior distributions can be ruled out. This approach
to examining prior distributions is referred to as prior predictive checking and was
originally discussed by Box (1980, 1983) and advocated by Gelman, Simpson, and
Betancourt (2017, see also; van de Schoot et al., 2021; Depaoli & van de Schoot,
2017).
The prior predictive distribution can be written as
Z
p(y ) =
rep
p(yrep | θ)p(θ)dθ (6.1)
which we see is simply generating replications of the data from the prior
distribution.2
Example 6.1: Prior Predictive Checking
In this example, we conduct prior predictive checks on a very simple analysis

of the distribution of the PISA 2018 reading outcome. Using information from the
2009 cycle of PISA, we have a good idea of the distribution, but here we examine
the case where our priors for 2018 are quite inaccurate and the case where they are
based on more accurate knowledge of the 2009 reading outcome distribution.
Setting up prior predictive checking in Stan simply involves generating
MCMC samples from the prior distribution. In other words, the Stan code sim-
ply excludes the likelihood of the data from the model block and so there is no
posterior distribution that is being sampled. The code is as follows.
library(rstan)
library(loo)
library(bayesplot)
library(ggplot2)
## Read in data ##
PISA18Data <-read.csv("˜/desktop/pisa2018.BayesBook.csv",header=TRUE)
PISA18.read<- subset(PISA18Data, select=c(PV1READ))
data.list <- with(PISA18.read, list(readscore=PV1READ,
n = nrow(PISA18.read)))
## Incorrect priors ###
priorpredIncorrect<-"
data {
int n;
}
parameters {
real<lower=0> mu;
}
model {
// priors
mu ˜ normal(400,1);
sigma ˜ normal(100,5);
2
Gelman, Carlin, et al. (2014, p. 7) point out that the marginal distribution p(y) given in
Equation (2.4) is more accurately referred to as the prior predictive distribution as it is not
conditioned on any prior observation and predictive because it is the distribution of an
observable quantity.
vector[n] prior_rep;
// prior predictive distribution
for(i in 1:n) {
prior_rep[i] = normal_rng(mu,sigma);
}
}
"
priorpredIncorrect <- stan(model_code=priorpredIncorrect,
data=data.list,
chains = 4,
iter = 5000)
prior_rep <- as.matrix(priorpredIncorrect,pars="prior_rep")
plot <- ppd_dens_overlay(ypred=prior_rep[1:100,])
plot + lims(x=c(200,800),y=c(0,.009))
## Accurate priors ###
priorpredCorrect<-"
data {
int n;
}
parameters {
real<lower=0> mu;
}
model {
// priors
mu ˜ normal(500,10); // mean based on 2009 PISA US results
sigma ˜ normal(109,10);
}
vector[n] prior_rep;
// prior predictive distribution
for(i in 1:n) {
prior_rep[i] = normal_rng(mu,sigma);
}
"
priorpredCorrect <- stan(model_code=priorpredCorrect,
data=data.list,
chains = 4,
iter = 5000)
prior_rep <- as.matrix(priorpredCorrect,pars="prior_rep")
plot <- ppd_dens_overlay(ypred=prior_rep[1:100,])
plot + lims(x=c(200,800),y=c(0,.009))
Figure 6.1 below presents the plots of the prior predictive distributions.
FIGURE 6.1. Prior predictive checking plots. The plot on left represents the results of an
elicitation without substantive knowledge of previous results from PISA 2009. These priors
would be somewhat incorrect. The plot on the right is based on hypothetical expert opinion
based on results from PISA 2009.
6.2.2 Posterior Predictive Checking

A very natural way of evaluating the overall quality of a model is to examine
how well the model fits the actual data. Examples of such approaches abound in
frequentist statistics, often based on “badness-of-fit” measures. In the context of
Bayesian statistics, the approach to examining how well a model fits the data is
based on the notion of posterior predictive checking (PPC), and the accompanying
posterior predictive p-value (PPp ). An important philosophical defense of the use of
posterior predictive checks can be found in Gelman and Shalizi (2013).
The general idea behind posterior predictive checking is that there should be
little, if any, discrepancy between data generated by the model and the actual
data itself. In essence, posterior predictive checking is a method for assessing
the specification quality of the model. Any deviation between the data generated
from the model and the actual data implies model misspecification.
In the Bayesian context, the approach to examining model fit and specification
utilizes the posterior predictive distribution of replicated data. Following Gelman,
Carlin, et al. (2014), let yrep be data replicated from our current model. That is,
Z
p(yrep | y) = p(yrep | θ)p(θ | y)dθ (6.2a)
Z
= p(yrep | θ)p(y | θ)p(θ)dθ (6.2b)
Notice that the posterior predictive distribution derives from the fact that the
second term on the right-hand side of Equation (6.2b) is simply the posterior
distribution of the model parameters. In words, Equation (6.2b) states that the
distribution of future observations given the present data, p(yrep | y), is equal
to the probability distribution of the future observations given the parameters,
p(yrep | θ), weighted by the posterior distribution of the model parameters. This is
then integrated (or summed) over the model parameters yielding the distribution
of future observations given the present data. Thus, posterior predictive checking
accounts for the uncertainty in the model parameters and the uncertainty in the
data.
As a means of assessing the fit of the model, posterior predictive checking
implies that the replicated data should match the observed data quite closely
if we are to conclude that the model fits the data. One approach to quantifying
model fit in the context of posterior predictive checking is to calculate the posterior
predictive p-value. Denote by T(y) a test statistic based on the data, and let T(yrep )
be the same test statistics but defined for the replicated data. Then, the PPp is
defined to be
PPp = p[T(yrep ) ≥ T(y) | y] (6.3)
Equation (6.3) measures the proportion of test statistics based on replicated data
that equal or exceed that of the test statistics based on the actual data.
For the examples presented in this book, the interpretation of the posterior
predictive p-value is as follows. First, as noted by Gelman (2013), when the
uncertainty in the model parameters is passed to the test statistic T through
the posterior predictive distribution then the resulting p-values will concentrate
around 0.5, under the assumption that the model is true. Therefore, values
closer to 0 or 1 are indicative of a model with poor posterior predictive qualities.
Because the focus is on assessing the predictive quality of a model, the degree of
deviation from 0.5 that would constitute “poor predictive quality” is a matter of
content area judgment and will depend, in part, on expected uses of the model.
Example 6.2: Posterior Predictive Checking
Let’s return to the linear regression example using PISA 2018 data in Section
5.1.1. The code necessary to obtain posterior predictive checks of the overall
quality of the model requires the use of generated quantities block within the Stan
model string.
vector[n] readscore_rep;
for (i in 1:n) {
readscore_rep[i] = normal_rng(alpha + beta1*Female[i]
+ beta2*ESCS[i] + beta3*METASUM[i]
+ beta4*PERFEED[i] + beta5*JOYREAD[i]
+ beta6*MASTGOAL[i] + beta7*ADAPTIVITY[i]
+ beta8*TEACHINT[i]+ beta9*SCREADDIFF[i]
+ beta10*SCREADCOMP[i], sigma);
}
}
A variety of plots can be obtained to judge the quality of the model via PPC. An
important plot is the overlay of the density of the outcome in comparison with the
randomly generated densities from the above code. Below in Figure 6.2 we display
the overlay density plot based on 1,000 randomly generated reading scores from
the model. We observe a small degree of misfit of the model.
readscore_rep <- as.matrix(myfit2,pars="readscore_rep")

ppc_dens_overlay(readscore,readscore_rep[1:1000,])
FIGURE 6.2. Density overlay plot.

We can also obtain histograms that examine predictions of specific expectations

of the reading distribution. For example, suppose we are interested in knowing
how well the model predicts average reading scores. In this case, we generate
means from the 1,000 posterior replications and compare them to the actual mean
of the reading literacy score. We can also obtain the PPp associated with this
comparison as the proportion of randomly generated means that equal or exceed
the mean reading score. An example of a histogram for the posterior prediction of
the mean reading score is displayed Figure 6.3 below.
plot <- ppc_stat(readscore, readscore_rep, stat = "mean")

p <- mean(apply(readscore_rep, 1, mean) > mean(readscore))
plot + annotate("text", x=497, y=50, label = paste("p =", round(p,3)))
FIGURE 6.3. Histogram for prediction of reading mean.
Here we see very poor fit of the model to the mean of the reading distribution. We
can also evaluate our model with respect to the variance of the posterior reading
distribution, as shown below in Figure 6.4.
plot <- ppc_stat(readscore, readscore_rep, stat = "var")

p <- mean(apply(readscore_rep, 1, var) > var(readscore))
plot + annotate("text", x=12700, y=50,
label = paste("p =", round(p,3)))
FIGURE 6.4. Histogram for prediction of reading variance.
Relative to the mean of the distribution, the model does a somewhat better job of
predicting the variance of the reading distribution, however, some degree of misfit
still remains.
The flexibility of posterior predictive checking should not be underestimated.
In addition to examining the posterior predictive performance of the model in
predicting the mean or variance, any quantile of the distribution can be examined.
For example, suppose an investigator is concerned with the ability of the model
to predict the 25th quantile of the reading distribution as this quantile is of policy
relevance in identifying very poor reading performance. Below in Figure 6.5 we
show the posterior predictive performance of the model in predicting the 25th
quantile.
q25 <- function(readscore) quantile(readscore, 0.25)

plot <- ppc_stat(readscore, readscore_rep, stat = "q25")
p <- mean(apply(readscore_rep, 1, q25) > q25(readscore))
plot + annotate("text", x=440, y=150, label = paste("p =", round(p,3)))
FIGURE 6.5. Histogram for prediction of 25th quantile of the reading distribution.
As with the mean and variance of the reading distribution, this model does not do
a very good job in predicting the 25th quantile, and should probably not be used
for this purpose.
6.3 Model Comparison

6.3.1 Bayes Factors
Moving beyond posterior predictive checking, it is also common for investigators
to test a set of competing models with the goal of choosing one model from the set of
models as being the “best” in some sense of the word. A very simple and intuitive
approach to model selection uses so-called Bayes factors (Kass & Raftery, 1995). An
excellent discussion of Bayes factors and the problem of hypothesis testing from
the Bayesian perspective can be found in Raftery (1995). In essence, the Bayes
factor provides a way to quantify the odds that the data favor one hypothesis over
another. A key benefit of Bayes factors is that the models do not have to be nested,
where in this context, a nested model refers to a regression model that contains a
subset of the predictor variables in another regression model.
To motivate Bayes factors, consider two competing models, denoted as M1 and
M2 , that could be nested within a larger space of alternative models. For example,
these could be two regression models with a different number of variables, or
two structural equation models specifying very different directions of mediating
effects. Further, let θ1 and θ2 be the two parameter vectors associated with these
two models. From Bayes’ theorem, the posterior probability that, say, M1 , is the
correct model can be written as
p(y | M1 )p(M1 )
p(M1 | y) = (6.4)
p(y | M1 )p(M1 ) + p(y | M2 )p(M2 )
Notice that p(y | M1 ) does not contain model parameters θ1 , so to obtain p(y | M1 )
requires integrating over θ1 . That is,
Z
p(y | M1 ) = p(y | θ1 , M1 )p(θ1 | M1 )dθ1 (6.5)
where the terms inside the integral are the likelihood and the prior, respectively.
The quantity p(y | M1 ) is referred to as the marginal likelihood for model M1 (Raftery,
1995). A similar expression can be written for M2 .
With these expressions, we can move to the comparison of our two models,
M1 and M2 . The goal is to develop a quantity that expresses the extent to which
a posteriori, the data support M1 over M2 . One quantity could be the posterior
odds of M1 over M2 , expressed as
" #
p(M1 | y) p(y | M1 ) p(M1 )
= × (6.6)
p(M2 | y) p(y | M2 ) p(M2 )
Notice that the first term on the right hand side of Equation (6.6) is the ratio of two
marginal likelihoods. This ratio is referred to as the Bayes factor (BF) for M1 over M2 ,
denoted here as B12 , which is an odds ratio. In line with Kass and Raftery (1995, p.
776), our prior opinion regarding the odds of M1 over M2 , given by p(M1 )/p(M2 ),
is weighted by our consideration of the data, given by p(y | M1 )/p(y | M2 ). This
weighting gives rise to our updated view of evidence provided by the data for
either hypothesis, denoted as p(M1 | y)/p(M2 | y). If the posterior odds is greater
than 1.0, then the evidence supports Model 1. If the posterior odds are less than
1.0, then the evidence supports Model 2.
In practice, there might not be a prior preference for one model over the other
and, of course, this is the default setting in software programs that produce the
BF. In this case, the prior odds are neutral and p(M1 ) = p(M2 ) = 0.5 and the prior
odds ratio equals 1, in which case the posterior odds is equal to the BF. The inter-
pretation of the BF as the ratio of the integrated likelihoods is straightforward —
it expresses the relative evidence in the data for support of one model over another.
Example 6.3: Comparison of Reading Literacy Models Using Bayes Factors
For this example, we compare four separate models of reading literacy to a

baseline, intercept-only model. The five models under comparison are
• Model 0: Baseline intercept-only
• Model 1: FEMALE, ESCS, HOMEPOS, ICTRES
• Model 2: JOYREAD, PISADIFF, SCREADCOMP, SCREADDIF
• Model 3: METASUM, GFOFAIL, MASTGOAL, SWBP, WORKMAST,
ADAPTIVITY, COMPETE
• Model 4: PERFEED,TEACHINT, BELONG

Our analyses use the R software program bayestestR (Makowski, Ben-Shachar, &
Lüdecke, 2019). The code for this example is given as
PISAdata <- read.csv(file.choose(),header=T)

PISAdata <- subset(PISAdata, select=c(PV1READ, Female, ESCS,
HOMEPOS,ICTRES,JOYREAD,PISADIFF,
SCREADDIFF,SCREADCOMP,METASUM,
GFOFAIL,MASTGOAL,SWBP,WORKMAST,
ADAPTVITY,COMPETE,PERFEED,TEACHINT,BELONG))
PISAdata$Female <- ifelse(PISAdata$Female=="Female",1,0)
PISAdata<-na.omit(PISAdata)
m0 <- brm(PV1READ˜ 1, # intercept only model

data = PISAdata,save_all_pars=TRUE)
m1 <- brm(PV1READ ˜ Female+ESCS+HOMEPOS+ICTRES,
data = PISAdata, save_all_pars=TRUE)
m2 <- brm(PV1READ ˜ JOYREAD+PISADIFF+SCREADCOMP+SCREADDIFF,
m3 <- brm(PV1READ ˜ METASUM+GFOFAIL+MASTGOAL+SWBP
+WORKMAST+ADAPTIVITY+COMPETE,
m4 <- brm(PV1READ ˜ PERFEED+TEACHINT+BELONG,
BFmodCompare <- bayesfactor_models(m1, m2, m3, m4, denominator = m0)

BFmodCompare
The results of this analysis are given below in Table 6.1.
TABLE 6.1. Bayes factor results for reading literacy example
Model BF
1 Female + ESCS 3.269e+146
2 JOYREAD + SCREADCOMP + SCREADDIFF 4.168e+208
3 METASUM + MASTGOAL + ADAPTIVITY 2.486e+212
4 PERFEED + TEACHINT 2.946e+42
Recalling that the intercept-only model is the baseline for comparison, we find
that the Model 3, which involve meta-cognitive strategies that a student might use
when reading a passage, is preferred by the data.
An Aside: The Bayesian Information Criterion (BIC)

A popular measure for model selection used in both frequentist and Bayesian
applications is based on an approximation of the log of the marginal likelihood
and is referred to as the Bayesian information criterion (BIC), also referred to as the
Schwarz criterion (Schwarz, 1978). A detailed mathematical derivation for the BIC
can be found in Raftery (1995), who also examines generalizations of the BIC to a
broad class of statistical models.
Consider again our two models, M1 and M2 , with M2 nested in M1 . Here
again, M1 could represent a set of predictors in a regression model and M2 could
be a subset of those predictors. Or, M1 could be an initially specified structural
equation model and M2 could be the same model with one path deleted. Under
conditions where there is little prior information, Raftery (1995) has shown that
an approximation of the Bayes factor can be written as
2 log B12 = χ212 − d f12 log n (6.7)

where χ212 is the conventional likelihood ratio chi-square obtained from testing M1
against M2 and d f12 is the difference in the degrees of freedom associated with
each test. From here, Raftery (1995) shows that the BIC can be obtained as
BIC = −2 log(θ̂ | y) + Q log(n) (6.8)

where −2 log θ̂ | y describes model fit while Q log(n) is a penalty for model com-
plexity, where Q represents the number of variables in the model and n is the
sample size. Notice that the priors are not specified in the calculation of BIC in
Equation (6.8).
As with Bayes factors, the BIC is used for model comparison. Specifically, the
difference between two BIC measures comparing, say M1 to M2 , can be written as
∆(BIC12 ) = BIC(M1 ) − BIC(M2 )

1
= log(θ̂1 | y) − log(θ̂2 | y) − (Q1 − Q2 ) log(n) (6.9)
2
Following Raftery (1995, p. 777), rules of thumb have been developed to assess the
quality of the evidence in the data favoring one model over another using Bayes
factors and the comparison of BIC values from two competing models. These are
displayed below in Table 6.2
TABLE 6.2. Rules of thumb for the BIC and Bayes factors with M1 as the reference model
BIC difference Bayes factor Evidence against M2

0 to 2 1 to 3 Weak
2 to 6 3 to 20 Positive
6 to 10 20 to 150 Strong
> 10 > 150 Very strong
As with all rules of thumb, those in Table 6.2 should be used with caution and not
without content area knowledge to support decisions about model selection.
6.3.2 Criticisms of Bayes Factors and the BIC

Our discussion of Bayes factors and the BIC would not be complete without ac-
knowledging some concerns that have been raised in the literature about these two
methods. It is beyond the scope of this section to detail all of the concerns regarding
Bayes factors and the BIC, but a few specific concerns should be provided.
Regarding Bayes factors, an important critique was put forward early on by
Gelman and Rubin (1995) in response to a lead paper by Raftery (1995) on the
problem of model selection. Two concerns were raised by Gelman and Rubin
(1995) on the general problem of model selection via Bayes factors. First, they
argued that it is not possible to select a model for some purpose without stating
what that purpose might be. To be specific, the Bayes factor only provides a
selection of a model that has more evidence in its favor (as measured by its
likelihood) compared to some other model. But, the chosen model could not
only fit the data rather badly, particularly in terms of predictive performance, but
could easily be supplanted by another model that fits the data quite well from a
predictive point of view. Second, posterior model probabilities are blind regarding
the intended purposes of the model. This, Gelman and Rubin (1995) argued,
deprives the analyst from using other Bayesian methods to deeply examine the
performance of a model in light of its purpose – advocating instead the use of
posterior predictive distributions through posterior predictive checking that we
discussed in Section 6.2.2. In anticipation of Chapter 11, it is important to note
that Gelman and Rubin’s (1995) critique of posterior model probabilities for model
selection extends to Bayesian model averaging as well.
An alternative view of the use of Bayes factors, though not in complete dis-
agreement with Gelman and Rubin (1995), was put forth by Morey, Romeijn, and
Rouder (2016) who focused their discussion of Bayes factors from the viewpoint
of the meaning of evidence. According to Morey et al. (2016), evidence should be
viewed as “something that should change the credibility of a claim in a reasonable
person’s mind” (p. 6). From this point of view, data serve the purpose of mediat-
ing the change in the probability assessments of our hypotheses before (prior) and
after (posterior) seeing the data. Morey et al. (2016) argued that the Bayes factor
indeed serves this purpose.
In light of the general critique by Gelman and Rubin (1995) regarding Bayes
factors for model selection, Morey et al. (2016) do not necessarily disagree. They
point out that the Bayes factor is optimally used for the purposes of quantifying
changes in belief when prior belief encounters evidence in the form of data. How
one uses the evidence, they argued, is a separate issue. Indeed, they agree with
Gelman and Rubin (1995) that using Bayes factors for model selection is prob-
lematic when the model’s intended use is not specified. Rather, they advocate
Bayes factors solely for model comparison acknowledging the tentative nature of
scientific conclusions.
Turning our attention to the BIC, it is important to note that the BIC is, ar-
guably, neither Bayesian nor an information criterion. First, the BIC does not
involve draws from a posterior distribution, but rather requires the evaluation of
a likelihood, and thus is not classically Bayesian. Second, almost all information
criteria are derived from the Kullback-Leibler divergence and have their roots in
information theory. The BIC is not derived from the Kullback-Leibler divergence
(see Section 4.8).
Aside from the misnomer, there are more important criticisms of the BIC
when used for model selection (see Weakliem, 1999). First, recall that the BIC
is an approximation of the Bayes factor defined in Equation (6.6), and note that
the Bayes factor requires that the prior odds of, say, M1 against M2 , written as
p(M1 )/p(M2 ), be specified. In practice, and by default, this ratio is set to 1.0, which
itself might not be what a researcher truly believes about the prior odds of the
models. For example, consider the case where M1 is a model that has had good
empirical support in the past, and M2 is a competitor model. Equal prior odds in
this case might not make sense. If, however, we wish to account for differences
in our knowledge and experience with these models, then each model will also
have different prior distributions on the model parameters. In large samples, these
priors might not have much of an effect alone, but the Bayes factor can be quite
sensitive to these prior distributions.
The importance of this point is that the BIC implies specific (and identical)
distributions for each model’s set of parameters and these priors might not be what
a researcher truly believes about the two models taken separately. Specifically,
the BIC assumes so-called unit information priors (UIP) for the model parameters
(Raftery, 1995). We take up unit information priors in Chapter 11, but suffice to
say that the UIP is a data-dependent prior that is Gaussian with a mean set at the
maximum likelihood estimate and precision equal to the information provided by
one observation. Thus, although there can be an infinite number of Bayes factors
corresponding to an infinite number of prior beliefs, there is only one Bayes factor
implied by the use of the BIC, and this might not accurately describe a researcher’s
prior belief. At the very least, researchers who use of the BIC for model selection
should interpret their findings with caution.
6.3.3 The Deviance Information Criterion (DIC)

The Bayes factor discussed above is simply used to choose a model among a set
of models that has the highest likelihood of having generated the data. It does
not, however, provide an evaluation of the predictive fit of the model — that is to
say, how well the model would do if applied to another comparable set of data.
The evaluation of a model based on predictive fit is critical if the model is used for
forecasts, but it is also useful when simply comparing or selecting models.3
A popular measure of predictive accuracy is the deviance information criterion
(DIC) (Spiegelhalter, Best, Carlin, & van der Linde, 2002), which is the Bayesian
counterpart of Akaike information criterion (AIC) (Akaike, 1973) often used in fre-
quentist model selection. The AIC can be calculated as
AIC = −2 log p(y | θ̂mle ) − 2P (6.10)

3
Arguably, it might be better not to simply select a model, but rather to either develop and
critique a model along the lines of its specific purposes using posterior predictive checks
(see, e.g., Gelman & Rubin, 1995; Gelman & Shalizi, 2013) or average across models (Kaplan,
2021). We discuss model averaging in Chapter 11.
where P is the number of parameters and where 2P serves as a penalty term for
overfitting. Note, however, that the AIC is based on a plug-in point estimate
of θ̂ obtained from maximum likelihood estimation of the model and not from
posterior samples. An advantage of the DIC over the AIC is that the DIC is
based on posterior samples and is designed to choose a model that gives the
smallest expected Kullback-Leibler divergence between the true data generating
process (DGP) and a predictive distribution (see Section 4.8 for a discussion of the
Kullback-Leibler divergence).
The DIC can be defined as
DIC = log p(y | θ̂Bayes ) − PDIC (6.11)
where θ̂Bayes is the posterior mean E(θ | y) derived from the MCMC draws, and
PDIC is the effective number of parameters obtained as

PDIC = 2 log p(y | θ̂Bayes ) − E log p(y | θ) (6.12)
where E(log (p(y | θ)) is obtained over the T draws from the posterior distribution.
For calculation purposes, Equation (6.12) can be written as
T
1X
DIC = 2 log p(y | θ̂Bayes ) −
Pd log (p(y | θt ) (6.13)
T
t=1
Notice that Equations (6.12) and (6.13) are essentially expressions of variation
such that the more uncertainty there is in the parameters the greater the overall
penalization against the fit as measured by log p(y | θ̂Bayes ), particularly when
compared to the penalty term for the AIC.
The DIC in Equation (6.11) is presented here in terms of log predictive density,
but it is often expressed in terms of deviance as
DIC = −2 log p(y | θ̂Bayes ) + 2Pd

DIC (6.14)
Finally, the maximum posterior density will be obtained when the posterior mean
coincides with the maximum a posterior (MAP) value. The DIC is not available in
Stan but is available in the R program rjags (Plummer, 2022)
6.3.4 Widely Applicable Information Criterion (WAIC)

Although the DIC is based on the posterior distribution, it is still not fully Bayesian
insofar as it conditions on a plug-in point estimate θ̂Bayes (albeit Bayesian) rather
than averaging over the posterior distribution to account for uncertainty in the pa-
rameter estimates. Recall that this is true for the AIC as well. As noted by Gelman,
Hwang, and Vehtari (2014), the AIC and DIC are actually gauging the predictive
performance of the plug-in estimators θ̂mle and θ̂Bayes , respectively, rather than the
actual predictive density. What is needed therefore is a measure based on the
pointwise predictive density for each observation in the dataset as these are the
quantities of actual concern in prediction. This leads us to the widely applicable
information criterion.
The widely applicable information criterion (WAIC) was developed by

Watanabe (2010) as a fully Bayesian counterpart to the AIC. To motivate the WAIC
we need to introduce some terminology. Following Gelman, Hwang, and Vehtari
(2014) and Vehtari, Gelman, and Gabry (2017), note that the ideal situation would
be to obtain a measure of out-of-sample predictive performance for new data.
Denote by y the data in hand, ỹ the new data, and a new data point as ỹi . Then,
the expected log predictive density (elpd) for the new dataset of n observations
can be defined as
Xn Z
elpdWAIC = pt ( ỹi )log p( ỹi | y)d ỹi (6.15)
i=1
where pt ( ỹi ) represents the distribution of the true but unknown data-generating
process for the predicted values ỹi . For computation purposes we need the log
pointwise predictive density, defined as
n
X n
X Z
lpdWAIC = log p(yi | y) = log p(yi | θ)p(θ | y)dθ (6.16)
i=1 i=1
To calculate this predictive density, we use draws from the posterior distribution,
yielding
n S !
X 1X
lpd
d
WAIC = log p(yi | θT
) (6.17)
S t=s
i=1
With these definitions in hand, the WAIC uses Equation (6.17) and, as with the
DIC, adds a term to correct for overfitting. Specifically, this correction factor can
be written as
Xn
!
PWAIC = 2 log E p(yi | θ) − E log p(yi | θ) (6.18)
i=1
which can also be computed from the posterior draws as

n S ! S !
X 1X 1X
WAIC = 2 p(yi | θ ) − log p(yi | θ )
s S
P[ log (6.19)
S S
i=1 s=1 s=1
As with the AIC and DIC, the WAIC can be written in deviance form as
WAIC = −2(lpdWAIC ) + 2(PWAIC ) (6.20)
6.3.5 Leave-One-Out Cross-Validation

Notice that the AIC, DIC, and WAIC use the same data to assess the predictive
performance of the model as was used to estimate the model and thus could lead to
overstating the predictive quality of the model. Ideally, we would like to use actual
out-of-sample data in order to assess predictive performance. A method that has
been around since at least the mid-1970s and is uniquely suited to the question of
out-of-sample predictive performance is the so-called leave-one-out cross-validation
(LOO-CV)4 (Allen, 1974; Stone, 1974, 1977). Leave-one-out cross-validation is a

special case of k-fold cross-validation (k-fold CV) wherein a sample is split into k
groups (folds) and each fold is taken to be the validation set with the remaining
k − 1 folds serving as the training set. For LOO-CV, each observation serves as the
validation set with the remaining n − 1 observations serving as the training set. In
other words, k = n.
The LOO-CV follows closely the development of the WAIC described earlier,
however, the elpd for LOO (denoted as elpdLOO is defined as
n
X
elpdLOO = log p(yi | y−i ) (6.21)
i=1
where Z
p(yi | y−i ) = p(yi | θ)p(θ | y−i )dθ (6.22)
is the LOO predictive density given the data with the ith data point left out (Vehtari
et al., 2017).
It is useful to note that an information criterion based on LOO, referred to as
the LOO-IC, can be easily derived as
LOO-IC = −2 × elpd
d
LOO (6.23)
which places the LOO-IC on the deviance scale. Among a set of competing models,
the one with the smallest LOO-IC is considered best from an out-of-sample point-
wise predictive point of view. In addition, it may also be interesting to note that
under maximum likelihood estimation, LOO-CV is asymptotically equivalent to
the AIC (Stone, 1977, see also; Yao, Vehtari, Simpson, & Gelman, 2018a).
As pointed out by Vehtari et al. (2017), LOO is asymptotically equivalent to
the WAIC, but in the case of finite samples with weak priors and/or influential
observations, a more robust method for calculating the LOO-CV might be desired.
To this end, Vehtari et al. (2017) developed a fast and stable approach to obtaining
LOO-CV using Pareto-smoothed importance sampling, which we discussed in
Section 4.8.2. As applied to LOO, the importance ratios presented in Equation
(4.20) are obtained as
p(θt | y−i )
rti = (6.24)
p(θt | y)
From here, the importance sampling LOO predictive distribution is obtained as
PT
rti p( ỹi | θt )
t=1
p( ỹi | y−i ) ≈ PT t (6.25)
t=1 ri
4
A distinction is sometimes made between LOO-CV and Bayesian LOO-CV. The former
implies that LOO can be applied in any cross-validation context, whereas Bayesian LOO-
CV deals explicitly with posterior predictive distributions. For this book, we use LOO-CV
to mean Bayesian LOO-CV.
The density of the held-out data point is obtained from the T posterior samples as
1
p(yi | y−i ) ≈ 1 PT 1
. (6.26)
T t=1 p(yi |θT )
As in the case of importance sampling for variational Bayes, the importance

weights in Equation (6.24) can have very large (and perhaps infinite) variances.
To handle this, Vehtari et al. (2017) propose the use of the generalized Pareto
distribution as discussed in Section 4.8 and which provides the same diagnostic
based on the shape parameter k of the Pareto distribution. The same rules of
thumb as discussed in Section 4.8 apply. The PSIS approach is implemented in loo
(Vehtari et al., 2019), and often referred to as PSIS-LOO. The PSIS-LOO measure
is available in the R software program loo (Vehtari et al., 2019) and is installed
as a required package for RStan (Stan Development Team, 2020), brms (Bürkner,
2021), and rstanarm (Goodrich et al., 2022). See Vehtari et al. (2017) for more
details on the implementation of the LOO-IC in loo.
6.3.6 A Comparison of the WAIC and LOO

Returning to the PISA 2018 regression model in Chapter 5, The Stan code below can
be used to implement PSIS-LOO. The readscore rep[i] line is used for conducting
posterior predictive checks, as before. The log lik[i] is used to obtain log-normal
probability density function need for WAIC and PSIS-LOO.
vector[n] log_lik;
for (i in 1:n) {
readscore_rep[i] = normal_rng(alpha + beta1*Female[i] +
beta2*ESCS[i] + beta3*METASUM[i]
+ beta4*PERFEED[i] + beta5*JOYREAD[i] + beta6*MASTGOAL[i]+
beta7*ADAPTIVITY[i] + beta8*TEACHINT[i]
+ beta9*SCREADDIFF[i] + beta10*SCREADCOMP[i], sigma);
log_lik[i] = normal_lpdf(readscore[i] | alpha + beta1*Female[i]

+ beta2*ESCS[i] + beta3*METASUM[i]
+ beta4*PERFEED[i] + beta5*JOYREAD[i] + beta6*MASTGOAL[i]+
beta7*ADAPTIVITY[i] + beta8*TEACHINT[i]
+ beta9*SCREADDIFF[i] + beta10*SCREADCOMP[i], sigma);
}
}
"
Then, outside of the modelString we add the lines for the model with non-
informative priors.
log_lik_noninf <- extract_log_lik(myfit2, merge_chains = FALSE)

loo_noninf <- loo(log_lik_noninf)
waic_noninf <- waic(log_lik_noninf)
print(loo_noninf)
print(waic_noninf)
The results are displayed directly from the output of RStan.
> print(loo_noninf)
Computed from 2000 by 4838 log-likelihood matrix
Estimate SE
elpd_loo -28537.8 48.3
p_loo 12.2 0.3
looic 57075.6 96.6
------
Monte Carlo SE of elpd_loo is 0.1.
All Pareto k estimates are good (k < 0.5).

See help(’pareto-k-diagnostic’) for details.
> print(waic_inf)
Computed from 2000 by 4838 log-likelihood matrix
Estimate SE
elpd_waic -28537.8 48.3
p_waic 12.1 0.3
waic 57075.6 96.6
The results for the LOO and WAIC are identical because of the large sample size
and also identical to the results for the informative priors case, but the LOO does
provide diagnostic information based on Pareto k that the results can be trusted.
Comparison of Poisson and Negative Binomial Regressions

For completeness, we return to the results of the Poisson and negative binomial
regression models in Chapter 5. With the addition of the dispersion parameter
in the negative binomial model, it may be of interest to compare the LOO values
for both models. The LOO value for the Poisson regression model is 2991.9.5 The
LOO value for the negative binomial regression was found to be 1752.0 which is
substantially lower than the value found for the Poisson regression indicating that
the negative binomial regression model shows much better out-of-sample point-
wise prediction of number of days absent from school compared to the Poisson
regression model.
6.4 Summary
In this chapter we covered Bayesian methods for model evaluation and com-
parison. The chapter began with an overview of the Bayesian critique of null
hypothesis significance testing. We argue that over and above the consistent mis-
use of the NHST in the social and behavioral sciences, the method itself may be
fundamentally flawed insofar as it appears to violate the likelihood principle. We
then discussed the Bayesian alternative of model assessment through posterior
predictive checks. We agree with Gelman and Shalizi (2013) that posterior pre-
dictive checks represent a powerful approach to fully probing the adequacy of a
model around its intended use, and we advocate its routine application in Bayesian
practice. We next covered issues in model comparison and provided a review of
Bayes factors (and an aside on the BIC), the DIC, WAIC, and LOO-CV/LOO-IC.
Following Gelman and Rubin, we agree that while model comparison might be
useful, in the case of Bayes factors and BIC, they probably should not be used
for model selection insofar as the purposes of the model are not embedded as
part of the decision on which to select a model. The DIC, WAIC, and LOO-IC,
on the other hand, are an improvement insofar as they are directed toward model
selection based on predictive criteria. Also, although the WAIC and LOO-CV have
been shown to be asymptotically equivalent (Watanabe, 2010), the implementation
of LOO-CV in the loo package is more robust in finite samples with weak priors
or influential observations (Vehtari et al., 2017) with the LOO-IC perhaps having
the most solid underlying motivation if it is to be used for model selection. In
summary, for model evaluation and comparison/selection, we advocate substan-
tively guided posterior predictive checks for model evaluation and the LOO-IC for
model selection. These will be presented across the examples used in this book.
5
Note that the initial calculation of the LOO led to a number of ”bad” or ”very bad” Pareto-k
values. On inspection, this was due to one outlier observation which was removed, after
which all Pareto-k values were below 0.7. A discussion of the Pareto-k diagnostic can be
found in Sections 4.8.2 and 7.3.4.
7
Bayesian Multilevel Modeling
A common feature of data collection in the social sciences is that units of anal-
ysis (e.g., students or employees) are often nested in higher level organizational
units (e.g., schools or companies, respectively). Indeed, in many instances, the
substantive problem concerns specifically an understanding of the role that units
at both levels play in explaining or predicting outcomes of interest. For example,
the OECD/PISA study deliberately samples schools (within a country) and then
takes an age-based sample of 15 year olds within sampled schools. Such data
collection plans are generically referred to as clustered sampling designs. Data from
clustered sampling designs are then collected at both levels for the purposes of un-
derstanding each level separately, but also to understand the inputs and processes
of student and school level variables as they predict both school and student level
outcomes. Higher levels of nesting are, of course, possible, e.g., students nested in
schools, which in turn are nested in local educational authorities, such as school
districts.
It is probably without exaggeration to say that one of the most important con-
tributions to the empirical analysis of data arising from such data collection efforts
has been the development of so-called multilevel models. Original contributions to
the theory of multilevel modeling for the social sciences can be found in Burstein
(1980), Goldstein (2011), and Raudenbush and Bryk (2002), among others.
The purpose of this chapter is to highlight the fact that multilevel models can
be conceptualized as Bayesian hierarchical models. Apart from the advantages
gained from being able to incorporate priors directly into a multilevel model,
the Bayesian conception of multilevel modeling has another advantage – namely
it clears up a great deal of confusion in the presentation of multilevel models.
Specifically, the literature on multilevel modeling attempts to make a distinction
between so-called “fixed effects” and “random effects.” Indeed, Gelman and Hill
(2003) have recognized this issue and present five different definitions of fixed and
random effects. Moreover, there are differences in the presentation of multilevel
models. For example, Raudenbush and Bryk (2002) provide a pedagogically useful
representation of multilevel modeling as one of modeling different organizational
levels. Others (e.g., Pinheiro & Bates, 2000) represent multilevel modeling as
a single-level “mixed effects” model. Although these two representations are
mathematically equivalent, such differences in presentation and the varying uses
of terminology can be confusing.
125
The advantage of the Bayesian approach to multilevel modeling is that no

distinction needs to be made between fixed and random effects. That is, all
parameters are assumed to be random, characterized by prior distributions that
are, in turn, characterized by hyperparameters. Thus, when conceived of as a
Bayesian hierarchical model, much of the confusion surrounding terminology
disappears. However, the fact that the Bayesian approach to multilevel modeling
clears up terminology is not sufficient to warrant its use. Rather, the advantage
of the Bayesian approach to multilevel modeling lies in the incorporation of prior
information on all model parameters, resulting in posterior distributions that
provide rich descriptions of the effects of interest.
The organization of this chapter is as follows. First, we revisit the concept
of exchangeability and, in particular, discuss conditional exchangeability as a
concept that sets the stage for multilevel modeling. Then, we outline the simple
random effects analysis-of-variance model as a Bayesian hierarchical model. This
is followed by extensions of the random effects analysis-of-variance model to
include random intercepts as outcomes and both random intercepts and slopes as
outcomes models.
We recognize that the topic of multilevel model would not be complete without
a discussion of longitudinal data where repeated measures are nested within
individuals. Indeed, a major benefit of the multilevel modeling perspective is the
flexibility with which longitudinal data can be handled (see, e.g., Raudenbush &
Bryk, 2002). However, our preference is to represent longitudinal modeling as
a latent variable model (Bollen & Curran, 2006; Kaplan, 2009) and we refer the
reader to Depaoli (2021) for a Bayesian treatment of latent variable growth curve
modeling.
7.1 Revisiting Exchangeability

In Section 2.3 we discussed the role of exchangeability and its importance in pro-
viding a firm mathematical basis for the existence of prior distributions. To reiter-
ate, exchangeability implies that the joint distribution of the data, p(y1 , y2 , . . . yn ) is
invariant to permutations of the subscripts. For example, in a sequence of binary
outcomes, the proportion of 1s and 0s matter but not their position in the vector.
In other words, we can exchange the positions of the responses without changing
our belief about the probability model that generated the data. Given that the or-
der does not matter, then the probability model governed by a parameter θ exists
prior to observing the data. The fact that we can describe θ without reference to a
particular set of data is, in fact, what is implied by the idea of a prior distribution.
We also briefly alluded to the fact that one condition where exchangeability
might not hold is when observations are clustered in higher-level units, such as our
example at the beginning of this chapter. Here, observations are not exchangeable
across schools, but may be conditionally exchangeable on knowledge about the
schools. For example, imagine considering the problem of predicting reading
competency and assume that we have no information to differentiate one school
from another (e.g., whether the school is public or private). In this case we can
assume that exchangeability holds for both the student level data vector, denoted
Bayesian Multilevel Modeling 127
as yig , and the school means, denoted as β0g . Assuming exchangeable schools
allows us to assign the same prior probability for the parameters β0g . In other
words, lacking any information about the G schools, exchangeability of the β0g is
a reasonable assumption.
Now, however, consider the situation in which we learn that some subset of
the G schools are public schools and the remainder are private schools. Given this
knowledge, it might not be appropriate to specify the same prior distribution for
these different types of schools. Instead, we might be able to argue that conditional
on school type, the β0g s are exchangeable – that is, we might feel comfortable
assigning the same prior distribution within public and private schools. For this
to be reasonable, we would need to directly add school type to our random effects
model, yielding a more general multilevel model.
Yet another implication of exchangeability in the context of multilevel models
concerns the notion of borrowing strength (see, e.g., Jackman, 2009, p. 307). The cen-
tral idea is that inferences regarding the school means β0g come from two sources.
The first source is information coming from school g itself. However, under ex-
changeability, another source of information arises from the remaining schools via
the prior distribution on β0g . Specifically, given that the prior distribution on the
random school means is generated from the hyperparameters mean µ and variance
σ2 , and given that these parameters are unknown, in essence the data coming from
school g is being used to update the priors µ and σ2 . As Jackman (2009) points
out, the phenomenon of borrowing strength is (1) a consequence of hierarchical
modeling and partial pooling, discussed in Section 2.2, and (2) possible only under
exchangeability of β0g .
7.2 Bayesian Random Effects Analysis of Variance

Perhaps the most basic multilevel model is the random effects analysis of variance.
As a simple example consider whether there are differences among G schools
(g = 1, 2, . . . , G) on reading performance obtained from n students (i = 1, 2, . . . , n)
nested in the G schools. In this example, it is assumed that the G schools are a
random sample from a population of schools.1 The model can be written as a two–
level linear model as follows: Introducing notation that will be used throughout
this chapter, let
readscoreig = β0g + ri j (7.1)
where readscoreig is the reading performance score for student i in school g, β0g
is the school random effect, and rig is a disturbance term with homoskedastic
variance σ2read . The model for the school random effect can be written as
β0g = γ00 + u0g (7.2)

1
In many large-scale studies of schooling, the schools themselves may be obtained from
a complex sampling scheme. However, we will stay with the simple example of random
sampling.
where γ00 is a grand mean and u0g is a homoskedastic error term with variance σ2β0
that picks up the school effect over and above the grand mean. Inserting Equation
(7.2) into Equation (7.1) yields
readscoreig = γ00 + u0g + rig (7.3)
indicating that the reading score for student i in school g can be decomposed into
an overall grand mean γ00 , a component due to the school effects u0g , and a random
error component rig .
An important measure used to evaluate the necessity of multilevel modeling is
the so-called intra-class correlation (ICC). The ICC yields the proportion of variance
in the outcome that can be attributed to differences among schools and is defined
as
σ2β0
ICC = 2 (7.4)
σβ0 + σ2read
What constitutes a large ICC is, of course, a matter of substantive judgment, but a
benefit of Bayesian multilevel modeling is that the ICC is obtained from the draws
of the posterior distribution of the variance terms and thus not only encodes
uncertainty, but allows one to explore credible intervals for the ICC.
Recall that a fully Bayesian perspective requires specifying the prior distribu-
tions on all model parameters. For the model in Equation (7.3), we first specify the
distribution of the reading score given the school effect u g and the within school
variance σ2 . Specifically,
readscoreig ∼ N(u0g , σ2β0 ) (7.5)
Prior distributions on the remaining model parameters can be specified as
β0g ∼ N(µ, σ2β0 ) (7.6)

µ ∼ N(µ0 , σ20 ) (7.7)
+
σ2read ∼ C (0, 1) (7.8)
+
σ2β0 ∼ C (0, 1) (7.9)
where µ0 and τ0 are the mean and standard deviation hyperparameters on µ that
are assumed to be fixed and known. Other choices for prior distributions, especially
for the level 1 and level 2 variance terms, are possible (see Gelman, 2006)
To see how this specification fits into a Bayesian hierarchical model, note that
we can arrange all of the parameters of the random effects ANOVA model into a
vector θ and write the prior distribution as
p(θ) = p(u1 , u2 , . . . , uG , µ, σ2 , ω2 ) (7.10)
where, under the assumption of exchangeability of the school effects u g , we obtain

(see, e.g., Jackman, 2009)
G
Y
p(θ) = p(u g | µ, ω2 )p(µ)p(σ2 )p(ω2 ) (7.11)
g=1
Example 7.1: Bayesian Random Effects ANOVA
For this example we run a simple Bayesian random effects ANOVA on reading
literacy for the United States sample of PISA 2018 (n = 4, 838). The analysis simply
examines whether there are school differences in the average reading performance
of students within schools.
To begin, we read in the data and select the reading score along with the school
identification code which is necessary for the multilevel analysis.
library(rstan)
library(loo)
library(bayesplot)
library(dplyr)
PISA18m0 <- read.csv(file.choose(),header=TRUE)

PISA18m0$SchoolID <- as.numeric(as.factor(PISA18m0$SchoolID))
PISA18m0 <- PISA18m0 %>% select(PV1READ,SchoolID)
attach(PISA18m0)
data.list <- with(PISA18m0, list(readscore=PV1READ,SchoolID=SchoolID,
G = length(unique(PISA18m0$SchoolID)), n = nrow(PISA18m0)))
In the following section is the Stan code for the random effects ANOVA model.
In the data block we read in the sample size, the number of groups, a school
identification number for each student, and the reading score.
RandomEffectsAnova = "
data {
int<lower=0> n; // number of observations
int<lower=0> G; // number of groups
int<lower=1,upper=G> SchoolID[n]; // discrete group indicators
vector[n] readscore; // real valued observations
}
In the following parameter block we define the parameters in the model. This
is followed by a transformed parameters block that allows us to obtain components
for calculating the ICC. Recall that Stan works with standard deviations, and so
these elements must be squared to obtain variances.
parameters {
real mu0; // school means
real<lower=0> sigma_beta0; // level 2 std.
real<lower=0> sigma_read; // level 1 std.
vector[G] mu; // Overall mean
}
real <lower=0> sigma2_read;
real <lower=0> sigma2_beta0;
real <lower=0> ICC;
sigma2_read= sigma_readˆ2;
sigma2_beta0 = sigma_beta0ˆ2;
ICC = sigma2_beta0/(sigma2_read + sigma2_beta0);
}
Next we specify the priors and the likelihood in the model block. In this example
we know that a non-informative prior on the intercept mu would be quite incorrect
given that the international reading mean in PISA 2018 is not zero. Thus we give
a highly diffused prior around a more sensible mean. Finally, notice that the
expression in the likelihood mu[SchoolID] signals the program to obtain the mean
reading score for each school as indexed by the school identification number.
model {
mu ˜ normal(400, 100); // Prior based on PISA international scale
sigma_read ˜ cauchy(0, 1); // Weakly informative prior
sigma_l2 ˜ cauchy(0, 1); // Weakly informative prior
mu0 ˜ normal(mu,sigma_l2);
readscore ˜ normal(mu[SchoolID], sigma_read); // Likelihood
}
In the generated quantities block below, we specify the information needed to

produce the posterior predictive checks and the LOO cross-validation.
vector[n] log_lik_ANOVA;
for(i in 1:n) {
readscore_rep[i] = normal_rng(mu[SchoolID[i]], sigma);

log_lik_ANOVA[i] = normal_lpdf(readscore[i] | mu[SchoolID[i]]
, sigma);
}
}
"
We next specify the information needed for the the analysis, plots, results, posterior
predictive checks, and cross-validation assessment.
nChains = 4
nIter= 10000
thinSteps = 10
RFAnova = stan(data=data.list,model_code=RandomEffectsAnova,
chains=nChains,iter=nIter,warmup=burnInSteps,
thin=thinSteps)
stan_plot(RFAnova,pars=c("mu0","sigma","tau","ICC"))
stan_trace(RFAnova,pars=c("mu0","sigma","tau","ICC"))
stan_dens(RFAnova,fill="gray",pars=c("mu0","sigma","tau","ICC"))
stan_ac(RFAnova,fill="gray",pars=c("mu0","sigma","tau","ICC"))
RFAnovaResults <- print(RFAnova,pars=c("mu0","sigma","tau",

"ICC"))
# Posterior predictive checks
readscore_rep <- as.matrix(RFAnova,pars="readscore_rep")

ppc_dens_overlay(readscore,readscore_rep[1:1000,])
plot <- ppc_stat(readscore, readscore_rep, stat = "mean")

p <- mean(apply(readscore_rep, 1, mean) > mean(readscore))
plot + annotate("text", x=505, y=150,
label = paste("p =", round(p,3)))
# loo predictive accuracy

log_lik_ANOVA <- extract_log_lik(RFAnova, parameter_name=

"log_lik_ANOVA",merge_chains = FALSE)
looANOVA <- loo(log_lik_ANOVA)
print(looANOVA)
Below are the convergence diagnostics along with the results in Table 7.1 below.
As seen in Figures 7.1 - 7.3 below, the diagnostic information suggests that for each
parameter, the algorithm converged adequately to the posterior distribution.
FIGURE 7.1. Trace plots for random effects regression example under informative priors.
FIGURE 7.2. ACF plots for random effects regression example under informative priors.
FIGURE 7.3. Density plots for random effects regression example under informative pri-
ors.
TABLE 7.1. Results of one-way ANOVA

Parameter EAP SD 2.5% 50% 97.5% n eff Rhat
mu0 500.76 3.72 493.62 500.71 508.14 1829 1
sigma 99.32 1.04 97.25 99.36 101.30 2065 1
tau 43.04 2.80 38.03 42.87 49.10 2002 1
ICC 0.16 0.02 0.13 0.16 0.20 1983 1
Note. EAP = Expected A Posteriori. SD = Standard Deviation.
The important piece of information in this analysis is the intra-class correlation.

We see that the posterior mean of the ICC is 0.16, meaning that 16% of the variation
in reading scores occurs between schools and there is a 0.95 probability that the
true ICC is between 13% and 20%. It becomes a matter of substantive judgment
as to whether the ICC is substantial enough to warrant concern about treating the
outcome scores as independent. In practice, it would be prudent to proceed with
multilevel modeling insofar as the resulting estimates would either be impacted
by the clustering in a meaningful way, or not.
Posterior Predictive Checks

Figure 7.4 below displays the posterior predictive density and histogram plots
along with the posterior p-value.
FIGURE 7.4. Density Plots for Random Effects Regression Example Under Informative
Priors
We find that there is a small degree of misfit as judged by the posterior density
plot. However, the posterior prediction of the reading mean under the specific
priors in this model is quite good, with a posterior p-value of 0.507. Finally
the LOO-IC value for this model is 58358.8 which we will use as a means of
comparison with more complex models next.
7.3 Bayesian Intercepts as Outcomes Model

In Example 7.1 we learned that 16% of the variance in the reading scores could
be explained by differences between schools. Assuming that this is substantively
large, the next step is to attempt to model that variation in terms of school level
predictors. For this next example, we examine whether the variation in the school
reading means can be explained by a measure of the shortage of teaching staff in
the school.
Example 7.2: Intercepts as Outcomes Model
Here we provide only the Stan code for the varying intercept model. The
surrounding R code for reading in the data and summarizing the results are
virtually the same as the in Example 7.1. The complete code is available on the
book’s companion website.
InterceptOutcome = "
data {
int<lower=1> n; // number of students
int<lower=1> G; // number of schools
int SchoolID[n]; // school indices
vector[n] readscore; // reading outcome variable
vector[n] STAFFSHORT; // Measure of staff shortage in the school
}
parameters {
vector[G] beta0;
vector[G] beta1;
real gamma00;
real gamma01;
real<lower=0> sigma_read;
real<lower=0> sigma_beta0;
In the following transformed parameters block we define and calculate the intra-
class correlation.
real <lower=0> sigma2;
real <lower=0> tau2;
real <lower=0> ICC;
sigma2 = sigmaˆ2;
tau2 = tauˆ2;
ICC = tau2/(sigma2 + tau2);
}
In the following model block notice that we now specify the model relating the
school means beta0[g]to the staff shortage variable, denoted as STAFFSHORT.
model {
gamma00 ˜ normal(400,100); // Informative prior
gamma01 ˜ normal(0, 2);
sigma_read ˜ cauchy(0,1);
sigma_beta0 ˜ cauchy(0,1);
for(i in 1:n) {
readscore[i] ˜ normal(beta0[SchoolID[i]], sigma_read);
}
for(g in 1:G) {
beta0[g] ˜ normal(gamma00 + gamma01*STAFFSHORT[g], sigma_beta0);
}
}
Finally, in the following generated quantities block, we obtain the information we

need for posterior predictive checking and cross-validation.
real readscore_rep[n];
vector[n] log_lik_M1;
real beta0_rep[G];
for(i in 1:n) {
readscore_rep[i] = normal_rng(beta0[SchoolID[i]],sigma_read);
log_lik_M1[i] = normal_lpdf(readscore[i] | beta0[SchoolID[i]],
sigma_read);
}
for(g in 1:G) {
beta0_rep[g] = normal_rng(gamma00 + gamma01 * STAFFSHORT[g],
sigma_beta0);
}
}
"
The results of the varying intercept model are displayed below in Table 7.2.
TABLE 7.2. Results of varying intercept model

Parameter EAP SD 2.5% 50% 97.5% n eff Rhat
gamma00 500.44 3.82 493.40 500.36 508.21 1924 1.00
gamma01 0.61 1.86 −3.19 0.65 4.13 1996 1.00
sigmaread 99.22 1.06 97.12 99.22 101.32 1947 1.00
sigmabeta0 44.97 2.92 39.51 44.81 50.98 2010 1.00
For this example, we do not show the trace plots, density plots, or ACF plots.
However, inspecting the n eff and Rhat, we find that the model shows good
convergence. The posterior p-value is 0.48 indicating excellent predictive fit of
the model to the mean reading score. The LOO-IC for this model is 58355.6,
indicating that the varying intercept model yields slightly better out-of-sample
predictive accuracy compared to the random effects ANOVA model. The mean
effect of staff shortage (0.61) is in the 95% posterior probability interval, however,
the probability of the effect being greater than zero is 0.63. This suggests that zero
lies relatively close to the mean and 63% of the distribution lies to the right of 0. To
check this, we can use the full posterior distribution to determine the percentage
of the distribution that lies between zero and the mean value of 0.61. This value
turns out to be 0.13, which, indeed, is relatively close to 0. Here again, such a
nuanced analysis would not be possible in a frequentist setting, but, still, content
area judgment is needed to warrant the importance of the effect.
7.4 Bayesian Intercepts and Slopes as Outcomes

Model
Suppose again that interest centers on reading proficiency among 15-year-old
students in the United States. We now wish to model readscore as a function of the
PISA measure of student socioeconomic status (ESCS):
readscoreig = β0ig + β1ig (ESCS) + rig (7.12)
where β jig ( j = 1, 2, . . . , P + 1) are the intercept and regression coefficients (slopes)

that are allowed to vary over the G groups. Raudenbush and Bryk (2002) refer to
the model in Equation (7.12) as the “level 1” model.
Interest in multilevel regression models stems from the fact that we can model
the intercepts and slopes as a function of school-level predictors, which we will
denote as z g . For the following example, we again use the single school-level
predictor of principal-reported teacher shortage in the school (STAFFSHORT). Al-
lowing the intercept β0g to be modeled as a function of STAFFSHORT and allowing
the slope of readscore on ESCS to be modeled as function of STAFFSHORT, we
can write the model
β0g = γ00 + γ01 (STAFFSHORT) + u0g (7.13a)

β1g = γ10 + γ11 (STAFFSHORT) + u1g (7.13b)
where γ’s are the coefficients relating β jg to the school-level predictors. From
Raudenbush and Bryk (2002), the model in Equations (7.13a) - (7.13b) is referred to
as the “level 2” model. The coefficient γ00 captures the school average reading score
for schools with no staff shortage; γ01 captures the relationship between school staff
shortage and school-level reading performance; γ10 captures the school average
ESCS-reading score relationship for schools with no staff shortage; and γ11 captures
the moderating effect of staff shortage on the school average ESCS-reading score
relationship.
As with the random effects ANOVA model, we can substitute Equations (7.13a)
- (7.13b) into Equation (7.12) and rearrange to yield the full model:
readscoreig = γ00 + γ01 (STAFFSHORT) + γ10 (ESCS)

+ γ11 (STAFFSHORT)(ESCS)
+ u0g + u1g (ESCS) + rig (7.14)
When expressed in terms of a Bayesian hierarchical model, we can write the

intercepts and slopes as outcomes model as
readscoreig ∼ N(xig β g , σ2g ) (7.15)

βg ∼ N(z g γ j , σ2j ) (7.16)
γ ∼ N(γ0 , σ2γ ) (7.17)
+
σ2read ∼ C (0, 1) (7.18)
where prior distributions would have to be chosen for σ2g , γ g , and σ2j .
Example 7.3: Intercepts and Slopes as Outcomes Model
The modelString statement IntSlopeOutcome is the same as before only now

we add the student-level variable ESCS.
IntSlopeOutcome = "
data {
int<lower=1> n; // number of students
int<lower=1> G; // number of schools
int SchoolID[n]; // school indices

vector[n] readscore; // reading outcome variable
vector[n] ESCS; // PISA measure of SES
vector[n] STAFFSHORT; // Measure of staff shortage in the school
}
The parameters block adds the coefficient gamma01 for the main effect of ESCS,
and gamma11 for the interaction of ESCS with STAFFSHORT.
parameters {
vector[G] beta0; vector[G] beta1;
real gamma00; real gamma01;
real gamma10; real gamma11;
real<lower=0> sigma_read; real<lower=0>sigma_beta0;
real<lower=0> sigma_beta1;
}
The following model block simply adds non-informative priors to the newly added
parameters from the previous block and specifies the model shown in
model {
gamma00 ˜ normal(400,100);
sigma_read ˜ cauchy(0,1);
for(i in 1:n) {
readscore[i] ˜ normal(beta0[SchoolID[i]] +
beta1[SchoolID[i]]*ESCS[i], sigma_read);
}
for(g in 1:G) {
beta0[g] ˜ normal(gamma00 + gamma01*STAFFSHORT[g],
sigma_beta0);
beta1[g] ˜ normal(gamma10 + gamma11*STAFFSHORT[g],
sigma_beta1);
}
}
Finally, the generated quantities block sets up the code necessary to obtain the
posterior predictive checks and LOO-IC information.
real readscore_rep[n];
vector[n] log_lik_M2;
real beta0_rep[G];
real beta1_rep[G];
for(i in 1:n) {
readscore_rep[i] = normal_rng(beta0[SchoolID[i]]
+ beta1[SchoolID[i]]*ESCS[i],sigma_read);
log_lik_M2[i] = normal_lpdf(readscore[i] | beta0[SchoolID[i]]

+ beta1[SchoolID[i]]*ESCS[i],sigma_read);
}
for(g in 1:G) {
sigma_beta0);
sigma_beta1);
}
}
"
An inspection of the n eff and Rhat (as well as diagnostic plots not shown)
suggest adequate convergence of the algorithm. The results of the intercepts and
slopes as outcomes model are displayed below in Table 7.3.
TABLE 7.3. Results of intercepts and slopes as outcomes model

Parameter EAP SD 2.5% 50% 97.5% n eff Rhat
gamma00 497.69 3.27 491.512 497.74 503.97 1916 1.00
gamma01 0.56 1.79 −3.91 0.60 3.90 2049 1.00
gamma10 13.31 1.20 9.61 13.38 16.70 1673 1.00
gamma11 3.33 1.49 0.32 3.35 6.31 1817 1.00
sigmaread 95.93 1.02 93.93 95.94 97.96 1991 1.00
sigmabeta0 36.89 2.78 31.97 36.77 42.80 2140 1.00
sigmabeta1 17.66 2.76 12.27 17.66 23.07 1499 1.00
The results in Table 7.3 reveal a rather small effect of staff shortage on reading.
Here, however, we are interested in two important predictors of reading perfor-
mance: (1) the impact of ESCS and (2) the moderating effect of staff shortage on
the relationship between ESCS and reading. From Table 7.3 we find a posterior
effect of ESCS of gamma01 = 13.31, with a standard deviation of 1.20. The proba-
bility that this effect is greater than zero is approximately one. The impact of the
moderating effect of staff shortage on the relationship between ESCS and reading
is represented by the coefficient gamma11 = 3.33 with a standard deviation of
1.49. Working with the full posterior distribution, we find that the probability
that 3.33 is greater than 0 is approximately 0.98. In both cases, the percentage of
the distribution that lies between 0 and the obtained estimates are also reasonably
large (0.50 and 0.49, respectively). It appears that ESCS impacts not only average
school reading performance but also interacts with school-level staff shortages in
impacting reading performance. It should also be noted that the LOO-IC for this
model is 58093.8 which is substantially lower than the varying intercepts model,
indicating substantial improvement in the prediction of reading when accounting
for student socioeconomic status, a finding that is not surprising.
7.5 Summary
Multilevel modeling has become an extremely important and powerful tool in
the array of methodologies for the social sciences, by virtue of the fact that many
research studies in the social sciences result in data with some sort of clustering.
The conventional approach to multilevel modeling is based on some variant of
the mixed effects model. A pedagogically useful approach conceives of multi-
level models in terms of levels, as in the work of Raudenbush and Bryk (2002)
and colleagues. The Bayesian perspective of multilevel modeling is to treat the
problem as one of a hierarchy of parameters treated as unknown and where our
uncertainty about the parameters is described by probability distributions. This
chapter attempted to maintain the discussion of multilevel models as levels but
also to show that they are essentially Bayesian hierarchical models. We also point
out that the assumption of exchangeability requires careful consideration in the
context of Bayesian hierarchical models.
8
Bayesian Latent Variable Modeling
As noted in the Preface, a recent book by Sarah Depaoli (2021) provides a detailed
overview of Bayesian structural equation modeling. Depaoli primarily covers
Bayesian structural equation modeling using Mplus (Muthén & Muthén, 1998–
2017), BUGS (Lunn et al., 2009), and blavaan (Merkle et al., 2021) but does not
cover the Stan software package which has been the primary software package
for the examples in this book. Thus, for this chapter, we will examine two simple
but popular latent variable models: (1) confirmatory factor analysis (CFA) and (2)
latent class analysis (LCA), and provide examples that utilize interesting features
of the Stan programming language.
Our focus specifically on CFA and not the the full structural equation model
(SEM) stems from the fact that (recursive) SEM models can be shown to be a
special case of CFA (Kaplan, 2009), and Bayesian estimation of CFA can be readily
translated to the SEM context. We focus on LCA insofar as Bayesian estimation
in this case introduces some interesting problems that lead to new computational
solutions.
8.1 Bayesian Estimation for the CFA

Following the general notation originally provided by Jöreskog (1967), we write
the factor analysis model as
y = α + Λη + ϵ (8.1)
where y is a vector of manifest variables, α is a vector of measurement intercepts,
Λ is a factor loading matrix, η is a vector of latent variables, and ϵ is a vector of
uniquenesses with covariance matrix Ψ, typically specified to be diagonal. Under
conventional assumptions (see, e.g., Kaplan, 2009), we obtain the model expressed
in terms of the population covariance matrix Σ as
Σ = ΛΦΛ′ + Ξ (8.2)
where Φ is the covariance matrix of the common factors, and Ξ is a diagonal

matrix of unique variances.
For this chapter we focus on the confirmatory factor analysis model. Unlike the
exploratory factor analysis model, the confirmatory factor model results from the
143
a priori number and location of (typically) zero values in the factor loading matrix
Λ. In the conventional approach to factor analysis, the additional restrictions
placed on Λ preclude rotation to simple structure (Lawley & Maxwell, 1971).
8.1.1 Priors for CFA Model Parameters

A Bayesian approach to confirmatory factor analysis requires the specification of
priors on all model parameters. To specify the prior distributions, it is notationally
convenient to arrange the model parameters as sets of common conjugate distri-
butions (see Kaplan & Depaoli, 2012). For this model, let θnorm = {α, Λ} be the set
of free model parameters that are assumed to follow a normal distribution and let
θIW = {Φ, Ψ} be the set of free model parameters that are assumed to follow an
inverse-Wishart distribution. Thus,
θnorm ∼ N(µ, Ω) (8.3)
where µ and Ω are the mean and variance hyperparameters, respectively, of the
normal prior. The covariance matrix of the common factors, Φ, and the unique-
ness covariance matrix, Ξ, are assumed to follow an inverse-Wishart distribution.
Specifically,
θIW ∼ IW(Ψ, ν) (8.4)
where Ψ is a positive definite matrix, and ν > P − 1, where P is the number of
observed variables. Different choices for Ψ and ν will yield different degrees
of “informativeness” for the inverse-Wishart distribution. The inverse-Wishart
distribution was discussed in Chapter 3. Note that other prior distributions
for θIW can be chosen. For example, Φ can be standardized to a correlation
matrix and the LKJ(η) prior can be applied. Also, if the uniqueness covariance
matrix Ξ is assumed to be a diagonal matrix of unique variances (which is
typically the case), then the elements of Ξ can be given IG(α, β) priors or
C+ (0,β) priors, where α and β are shape and scale parameters, respectively, for
the C+ distribution, and the location x0 is set to zero by definition (see Section 3.1.3).
Example 8.1: Bayesian Confirmatory Factor Analysis
This example is based on a reanalysis of a confirmatory factor analysis

described in the OECD technical report (OECD, 2010). The PISA background
questionnaire distinguishes two forms of motivation to learn mathematics: (1)
students may learn mathematics because they enjoy it and find it interesting, or
(2) because they perceive learning mathematics as useful. These two constructs
are central in self-determination theory (Ryan & Deci, 2009) and expectancy-value
theory (Wigfield, Tonks, & Klauda, 2009). Data come from the U.S. sample of
PISA 2012 students who were asked the following:
Thinking about your views on mathematics: to what extent do you

agree with the following statements? (Please tick only one box in each
row.) Strongly agree (1)/ Agree (2)/ Disagree (3)/ Strongly disagree
Bayesian Latent Variable Modeling 145
(4) a) I enjoy reading about mathematics (enjoyread), b) Making an

effort in mathematics is worth it because it will help me in the work
that I want to do later on (effort), c) I look forward to my mathematics
lessons (lookforward), d) I do mathematics because I enjoy it (enjoy),
e) Learning mathematics is worthwhile for me because it will
improve my career (career), f) I am interested in the things I learn in
mathematics (interest), g) Mathematics is an important subject for me
because I need it for what I want to study later on (important), and h) I
will learn many things in mathematics that will help me get a job (job).
The confirmatory factor model in this example was specified to have two fac-
tors. The first factor is labeled IntrinsicMotiv measured by enjoyread, lookforward,
enjoy, and interest. The second factor is labeled ExtrinsicMotiv and is measured by
effort, career, important, and job. The Stan code follows.
To begin, we read in and select the data. Missing data are deleted listwise.
library(rstan)
library(loo)
library(bayesplot)
library(dplyr)
# Read in data
PISAcfa <-read.csv(file.choose(),header=TRUE) # browse to select data
PISAcfa <- subset(PISAcfa, select=c(enjoyread,effort,lookforward,
enjoy,career,interest,important,job))
PISAcfa[PISAcfa==999]=NA
PISAcfa <- na.omit(PISAcfa)
The following Stan code was drawn from DeWitt (2018). We next create a list
called patternMat that will be used to assign items to factors.
BayesCFA = "
data {
int<lower=1> n; //sample size
int<lower=1> k; //number of items
int<lower=1> n_fac; // number of factors
matrix[n,k] y; // matrix of outcomes
int<lower=1, upper=n_fac> patternMat[k];
}
Next we standardize the data.

transformed data {
matrix[n,k] scaled_y;
for (j in 1:k){
scaled_y[,j] = (y[,j] - mean(y[,j]))/sd(y[,j]);
}
}
In the following parameters block we first define a matrix with rows equal
to the number of factors and columns equal to the sample size which we name
N01prior, to be used to assign an N(0, 1) prior to the factor loadings. This is
followed by a command in Stan labeled cholesky factor corr which will provide
the Cholesky decomposition to be used later to obtain the factor correlations.
parameters {
matrix[n_fac,n] N01prior;
cholesky_factor_corr[n_fac] fac_cor_helper;
vector[p] scaled_y_means;
vector<lower=0>[k] scaled_y_unique;
vector<lower=0>[k] lambda;
}
In the following transformed parameters block, we specify a matrix that will

contain the factor scores, called FacScores in the code. Then, FacScores is specified
to be the transpose of the
matrix[n,n_fac] FacScores;
FacScores = transpose(fac_cor_helper * N01prior);
In the model block next, we use Stan’s to vector function to specify an N(0, 1)
prior to N0prior. We specify fac cor helper to have a non-informative LKJ prior
via the function lkj corr cholesky(1) (see Chapter 3, Section 3.6). The remain-
ing non-informative/weakly informative priors are then assigned to the means,
uniqunesses, and factor loadings. The final line in this section specifies the likeli-
hood of the data.
model {
to_vector(N01prior) ˜ normal(0,1);
fac_cor_helper ˜ lkj_corr_cholesky(1);
scaled_y_means ˜ normal(0,1);
scaled_y_unique ˜ cauchy(0,1);
lambda ˜ normal(0,1);
// Likelihood
for (j in 1:k) {
scaled_y[, j] ˜ normal(scaled_y_means[j] +
FacScores[,patternMat[j]] * lambda[j],
scaled_y_unique[j]);
}
}
corr_matrix[n_fac] fac_cor ;
vector[p] y_means;
vector[p] y_unique;
fac_cor = multiply_lower_tri_self_transpose(fac_cor_helper);
for(j in 1:k){
y_means[j] = scaled_y_means[j]*sd(y[,j]) + mean(y[,j]);
y_unique[j] = scaled_y_unique[j]*sd(y[,j]);
}
}
"
Finally, we provide the information necessary to begin the estimation of the model.
nChains = 4
nIter= 10000
thinSteps = 10
data.list <- list(n = nrow(PISAcfa), p=ncol(PISAcfa),

y = as.matrix(PISAcfa),n_fac=2,
patternMat = c(1,1,1,1,2,2,2,2))
BayesCFAfit = stan(data=data.list,model_code=BayesCFA,chains=nChains,
iter=nIter,warmup=warmupSteps,thin=thinSteps)
Figures 8.1 - 8.3 below provide the trace plots, density plots, and autocorrela-
tion plots, respectively, for the Bayesian CFA analysis. We see that in each case,
there is reasonable evidence of convergence with the chains mixing well.
FIGURE 8.1. Trace plots for Bayesian CFA.

FIGURE 8.2. Density plots for Bayesian CFA.
FIGURE 8.3. Autocorrelation plots for Bayesian CFA.
The results of the Bayesian CFA can be seen below in Table 8.1.
TABLE 8.1. Parameter estimates for Bayesian CFA model
Parameter Mean SD 95% PPI n eff Rhat

lambda[1] 0.75 0.02 0.72– 0.78 1924 1
lambda[2] 0.63 0.02 0.59–0.66 2007 1
lambda[3] 0.87 0.01 0.85–0.90 1632 1
lambda[4] 0.86 0.01 0.83–0.89 1730 1
lambda[5] 0.82 0.01 0.79–0.85 1784 1
lambda[6] 0.73 0.02 0.70–0.76 1762 1
lambda[7] 0.84 0.01 0.81–0.87 1887 1
lambda[8] 0.83 0.02 0.80–0.86 1784 1
fac cor[2,1] 0.78 0.01 0.76–0.80 1769 1
Along with the results, we also see additional evidence of convergence through
inspection of the n eff, which are close to 2000, and Rhat values, which are all 1.0.
8.2 Bayesian Latent Class Analysis

The use of continuous latent variables arguably dominates most applications of
latent variable modeling. However, it is often useful to hypothesize the existence
of categorical latent variables. Such categorical latent variables can be employed
to explain response frequencies among dichotomous or ordered categorical man-
ifest variables. Methods that use categorical latent variables to explain response
frequencies underlying categorical manifest variables include latent class analysis
and, in the longitudinal context, latent Markov modeling. However, categorical
latent variables can also be used to model continuous observed variables as well,
such as in the context of latent profile analysis. In all of these cases, a common
estimation framework involves mixture modeling.
To motivate Bayesian latent class analysis consider a set of I items (q = 1, . . . , I)
measured on n children (i = 1 . . . n) that represent different levels of reading com-
petency. In our example below, these items represent mastery (=1) or non-mastery
(=0) of items that measure various levels of reading competency. The marginal
response probabilities of an observation yi , (i = 1, 2, . . . n) can be written as
C
X I
Y
p(yi ) = αc p(y jq | zi = c) (8.5)
c=1 q=1
where zi is the class that individual i belongs to, and αc is the probability of being
in class c.
We focus on binary responses (mastery/non-mastery) and therefore we specify
yiq ∼ Bernoulli(pcq ) where pcq is the probability that an individual in class c masters
skill level q. Note that zi is the latent class indicator and so requires a probability
distribution. We write the Bayesian hierarchical latent class model as
zi ∼ Categorical(αc ) (8.6a)
yi | zi = c ∼ Bernoulli(pc ) (8.6b)
PC
where α is a simplex (α1 . . . αC )’ and c=1 = 1 and pc is the class-specific parameter
vector.
Example 8.2 : Bayesian Latent Class Analysis of Reading Competency
For this example, we focus on Bayesian latent class analysis. The data come
from the Early Childhood Longitudinal Study – Kindergarten Cohort of 1998
(ECLS-K:1998-99) (NCES, 2018), which provide a nationally representative sample
of children attending kindergarten in 1998–99 who were periodically assessed until
they reached 8th grade.
We focus on the performance of students in the third wave of the study cor-
responding to students entering the first grade. Five components of reading were
assessed: letter recognition, beginning sounds, ending sounds, word recognition,
and reading in context. For each component, a binary master score is assigned
if the student gets three out of four items in the set correct. Less than that, and
the student is deemed not to have mastered that particular skill. On the basis of
previous research (Kaplan, 2008), a three-class model was deemed to fit the data
well, and we will demonstrate Bayesian LCA with a three-class model.
The Stan code is as follows. First, we read in the data and create a list that
specifies the sample size, the number of variables, and the hypothesized number
of classes. Here we specify a three-class model.
options(mc.cores = 4)
reading <- read.csv(file.choose(),header=T)
read3rd <- subset(reading,select=c(letterrec3,beginning3,ending3,
words3,reading3))
lca_data <- list(y = read3rd,

N = nrow(reading),
I = 5,
C = 3)
In the following data block we specify the dimensions of the items, respondents,
number of latent classes, and the response matrix.
ReadLCA <- "

data {
int<lower=1> I; // number of items

int<lower=1> n; // number of respondents
int<lower=1> C; // number of classes
int y[n,I]; // response matrix
}
Next in the parameter block we specify a unit-simplex labeled alpha of dimension

equal to the number of latent classes C using the function simplex[C]. This defines
the class proportions for each class as the vector alpha and the C × I matrix of
mastery probabilities p for I items and C classes. The unit simplex forces the
values of the latent classes to be greater than zero and that the sum of the values
equal unity.
parameters {
simplex[C] alpha; // probabilities of being in one latent class
real <lower = 0, upper = 1> p[C, I];
}
In the model block we define the vector lmix[C] which will contain the contri-
butions to the marginal probabilities from each latent class. We then use the
log sum exp function in Stan to compute the log of the exponentiated elements
from lmix[C]. Finally, we use the function target += to increment the resulting log
posterior up to an additive constant.1
model {
real lmix[C];
for (i in 1:N){
for (c in 1: C){
lmix[c] = log(alpha[c]) + bernoulli_lpmf(y[i, ] | p[c,]);
}
target += log_sum_exp(lmix);
}
}
1
For more information about the target += function in Stan, see Stan Development Team
(2021a).
Prediction of Latent Class Membership

An important aspect of latent class analysis is the prediction of class membership
for each observation. Specifically, we are interested in p(zi = c | y), the posterior
probability of belonging to latent class c given the observed response vector. It
may be interesting to note that the frequentist approach to the prediction of class
membership utilizes p(zi = c | yi , θ̂), where θ̂ is the maximum likelihood estimate
of the parameters and thus does not account for the uncertainty in the parameters.
The Bayesian approach to prediction of latent class membership utilizes posterior
probabilities and thus accounts for uncertainty in the model.
Below in the generated quantities block we specify an N dimensional vector
labeled pred class dis[N] that will provide the posterior prediction for respondent
i in latent class c. Then we specify a vector of length N labeled pred class[N] that
will contain the posterior probabilities of respondent i in latent class c and which
must have the form of a simplex such that the probabilities must be greater than
or equal to zero and must sum to one. We then define the log mixture class lmix
as the log of
int<lower = 1> pred_class_dis[N];
simplex[C] pred_class[N];
real lmix[C];
for (i in 1:N){
for (c in 1: C){
lmix[c] = log(alpha[c]) + bernoulli_lpmf(y[i, ] | p[c,]);
}
for (c in 1: C){
pred_class[i][c] = exp((lmix[c])-log_sum_exp(lmix));
}
pred_class_dis[i] = categorical_rng(pred_class[i]);
}
}
"
Finally, we specify the necessary code to run the analysis and produce diagnostic
plots.
nChains = 4
nIter= 5000
thinSteps = 10
burnInSteps = floor(nIter/2)
BayesLCA = stan(data=lca_data,model_code=ReadLCA, chains=nChains,

iter=nIter,warmup=burnInSteps,thin=thinSteps)
print(BayesLCA, c("alpha", "p","prod_p"))

testtrace <- mcmc_trace(BayesLCA, pars=c("alpha","prod_p"))
The results are presented below in Table 8.2.
TABLE 8.2. Parameter estimates for Bayesian LCA model
Parameter Mean SD PPI n eff Rhat

alpha[1] 0.31 0.19 0.10-0.64 2 14.47
alpha[2] 0.31 0.19 0.10- 0.65 2 14.58
alpha[3] 0.37 0.25 0.10-0.63 2 18.05
p[1,1] 0.95 0.08 0.77-1.00 2 7.11
p[1,2] 0.78 0.33 0.14-1.00 2 11.93
p[1,3] 0.69 0.38 0.02-0.99 2 21.78
p[1,4] 0.51 0.48 0.00-1.00 2 50.15
p[1,5] 0.21 0.21 0.00-.46 2 14.04
p[2,1] 0.95 0.08 0.77-1.00 2 6.70
p[2,2] 0.76 0.30 0.16-0.95 2 10.82
p[2,3] 0.57 0.30 0.02-0.77 2 15.10
p[2,4] 0.04 0.02 0.00-0.08 3 1.62
p[2,5] 0.00 0.00 0.00-0.01 3 1.05
p[3,1] 0.90 0.10 0.76-1.00 2 6.08
p[3,2] 0.61 0.38 0.14-0.99 2 10.04
p[3,3] 0.52 0.46 0.02-0.99 2 20.76
p[3,4] 0.50 0.49 0.00-1.00 2 67.56
p[3,5] 0.21 0.21 0.00-0.46 2 13.42
8.2.1 The Problem of Label-Switching and a Possible Solution

It is clear from an inspection of the n eff and Rhat values in Table 8.2 that there is
a problem with the LCA results. The problem is one of so-called label-switching, a
common problem in Bayesian mixture models generally. The problem is due to the
fact that a mixture model is identified up to a permutation of the class labels. To be
specific, for any given set of true mixture model parameters, there will be another
alternative set that will give rise to the same likelihood, and thus the model is not
identified. For example, if we have a set of true parameters, say,
α1 = .4
α2 = .6
p1 = (.2, .4, .3, .1)
p2 = (.6, .1, .1, .2)
this will give rise to the same likelihood as
α1 = .6
α2 = .4
p1 = (.6, .1, .1, .2)
p2 = (.2, .4, .3, .1)
In Figure 8.4 below, we can clearly see the label-switching issue in the trace plots,
where the different chains separate for different parameters of the model.
FIGURE 8.4. Trace plots for Bayesian LCA demonstrating the label-switching problem.
A Solution to Label-Switching Using Variational Bayes

One solution to the problem of label-switching is to try to avoid sampling the
parameters of the model in the first place and instead try to approximate the
posterior distribution of the model parameters with other distributions via the
method of variational Bayes, which we introduced in Section 4.8 of Chapter 4.
In what follows, we present the Stan code for conducting variational Bayesian
estimation. The stan vb command reads in the LCA code from a separate file
but it is the same as the above example. The elbo samples command controls the
number of Monte Carlo samples to take for computing the evidence lower bound.
The algorithm command indicates that we are using the meanfield algorithm
which is the default in Stan. The output samples command controls the number
of samples that will be saved for summaries, and the tol rel obj command controls
the convergence tolerance.
stan_vb <- stan_model(file = "˜/desktop/LCA.stan")
vb_fit <-
vb(
stan_vb,
data = lca_data,
iter = 40000,
elbo_samples = 1000,
algorithm = c("meanfield"),
output_samples = 1000,
tol_rel_obj = 0.000001,
Followed by printing and plotting the results as
print(vb_fit, c("alpha", "p","p_prod"))

stan_trace(vb_fit,pars=c("alpha", "p","p_prod")) +
scale_colour_grey(start=.6)
We see from Figure 8.5 below that although the trace plots are not perfect, the
variational Bayes procedure eliminates the label-switching problem.
FIGURE 8.5. Trace plots for variational Bayes LCA demonstrating the removal of the
label-switching problem.
The results shown in Table 8.3 below indicate that latent class 1 is composed of
approximately 12% of the sample and is made up of children who have more or less
mastered letter recognition but have not yet quite mastered the remaining skills.
Latent class 2 is composed of approximately 26% of the sample and is made up
of children who have mastered almost all skills except fully reading. Latent class
3, constituting 63% of the sample, is made up of children who have more or less
mastered letter recognition, beginning sounds, and ending sounds. It is important
to point out that these results are not optimal but may be practically useful insofar
as the Pareto-k values, denoted as khat in the table, were all approximately 0.7.
Recall from Chapter 4 that the Pareto-k values for variational Bayes reflect the
quality of the approximation to p(θ | y) (Yao et al., 2018b). Also, again note that, as
of this writing, the implementation of variational Bayes in Stan is still experimental
and the results should be treated with caution.
TABLE 8.3. Parameter estimates for the variational Bayes LCA model
Parameter Mean SD PPI khat

alpha[1] 0.12 0.01 0.10-0.13 0.66
alpha[2] 0.26 0.01 0.24-0.28 0.66
alpha[3] 0.63 0.01 0.61-0.65 0.67
p[1,1] 0.80 0.02 0.75-0.84 0.66
p[1,2] 0.22 0.04 0.16-0.29 0.67
p[1,3] 0.05 0.03 0.02-0.12 0.66
p[1,4] 0.01 0.01 0.00-0.04 0.66
p[1,5] 0.00 0.01 0.00-0.02 0.66
p[2,1] 1.00 0.00 0.99-1.00 0.66
p[2,2] 0.99 0.01 0.97-0.99 0.67
p[2,3] 0.98 0.01 0.96-0.99 0.66
p[2,4] 0.98 0.01 0.97-0.99 0.66
p[2,5] 0.42 0.02 0.38-0.46 0.66
p[3,1] 1.00 0.00 0.99-1.00 0.66
p[3,2] 0.93 0.01 0.91-0.94 0.66
p[3,3] 0.74 0.01 0.72-0.76 0.66
p[3,4] 0.05 0.01 0.01-0.07 0.66
p[3,5] 0.00 0.03 0.00-0.00 0.66
The results in Table 8.3 can be interpreted as follows. Class 1 is comprised

of 12% of the sample and can be characterized as having students with a high
degree of mastery of letter recognition (80%) but relatively low levels of mastery
of remaining skills. Class 2 is comprised of 26% of the sample and is characterized
by students who have a very high degrees of mastery of letter recognition (100%),
beginning sounds (99%), ending sounds (98%), reading words on site (98%), but
lower probability of mastering the reading words in context (42%). Finally, Class
3 is comprised of 63% of the sample and is characterized by students with high
degrees of master of letter recognition (100%) and beginning sounds (93%), slightly
lower probabilities of mastery of ending sounds (74%) and low probabilities of
master of reading words on site (5%), and full reading in context (0%). Keeping in
mind that these are first-grade students, Class 1 could be labeled “Beginning Skill
Readers,” Class 2 could be labeled “Advanced Skill Readers,” and Class 3 could
be labeled ”Intermediate Skill Readers.”
8.2.2 Comparison of VB to the EM alogorithm

Finally, we compare the results of VB to the same data analyzed via maximum like-
lihood using the expectation-maximization (EM) and Newton-Raphson algorithms
as implemented in the R program poLCA (Linzer & Lewis, 2011). The poLCA code
is as follows:
install.packages("poLCA")
library(poLCA)
reading <- read.csv("˜/desktop/reading.csv",header=T)
readvars <- subset(reading,select=c(letterrec3,
beginning3,ending3,words3,reading3))
First, we need to recode so that values are 1 and 2 instead of 0 and 1. Note that 2
represents mastery.
readvars <- readvars + 1
Next, we specify the formula for the model with no predictors as
readtime1 <- cbind(letterrec3,beginning3,ending3,words3,reading3) ˜ 1
This provides the code for a simple latent class analysis without regressing latent
class membership onto a predictor. Finally, we estimate the 3-class model and the
results are shown below in Table 8.4.
model3 <- poLCA(readtime1,readvars,nclass=3,nrep=20,maxiter=500)

TABLE 8.4. Parameter estimates for maximum likelihood LCA model using poLCA
Variable/class Prob(1) Prob(2)
letterrec3
class 1: 0.20 0.80
class 2: 0.00 1.00
class 3: 0.00 1.00
beginning3
class 1: 0.78 0.22
class 2: 0.01 0.99
class 3: 0.07 0.93
ending3
class 1: 0.96 0.04
class 2: 0.02 0.98
class 3: 0.26 0.74
words3
class 1: 1.00 0.00
class 2: 0.02 0.98
class 3: 0.95 0.05
reading3
class 1: 1.00 0.00
class 2: 0.57 0.43
class 3: 1.00 0.00
Estimated class population shares:
0.11 0.26 0.63
Predicted class memberships (by modal posterior prob.):
0.10 0.27 0.63
We find that the results are very close to the findings using variational Bayes.
Although there are changes to class enumeration (which is trivial), we find that
the estimated class population shares, as well as the predicted class membership
by modal posterior probability, are very close to those found using variational
Bayes.
8.3 Summary
This chapter discussed Bayesian approaches to latent variable modeling with spe-
cial focus on confirmatory factor analysis and latent class analysis. Our example
of confirmatory factor analysis involved some unique Stan coding, but it should
be pointed out that the R program blavaan (Merkle et al., 2021) provides a very
simple interface for confirmatory factor analysis (and structural equation model-
ing generally) with Stan running in the background. For examples of how to run
Bayesian structural equation models, including CFA, using blavaan, see Depaoli,
Kaplan, and Winter (2023)
Regarding latent class analysis, we demonstrated the common problem of
label switching and then applied variational Bayes to address the problem. The
results from variational Bayes were, in this instance, quite close to the results
obtained from frequentist LCA using the EM algorithm. As useful as variational

Bayes appears, at least in this case, it should be noted that the method is somewhat
experimental and a considerable amount of research remains to warrant its general
use over MCMC approaches.
Part III
ADVANCED
TOPICS AND
METHODS
9
Missing Data from a Bayesian
Perspective
Before beginning our treatment of Bayesian model building methods, it is nec-

essary to address a problem that transcends both basic and advanced methods
namely, the problem of missing data. For the purposes of this chapter, we will
focus first on how the problem of missing data has been conceptualized, relying
heavily on the seminal work of Little and Rubin (2020). We will then provide
an overview of some general methods for handling missing data. Finally, we
will examine Bayesian approaches to the analysis of missing data. It should be
noted that this chapter is not meant to provide a complete description of the lit-
erature on missing data theory. Important discussions of missing data can be
found in Little and Rubin (2020), Enders (2022), and Schafer (1997), and the reader
is referred to these works for greater detail. Also, in the interest of space, the
software for this chapter is not presented here but has been made available in
the companion resources for the first edition of this book, which can be found at
http://bise.wceruw.org/publications.html.
9.1 A Nomenclature for Missing Data

To motivate our discussion of missing data, consider as an example, responses
to three items on a parent questionnaire.1 These items ask for the parent’s age,
education level, and household income. For a sample of parents, we might notice
complete data for the age question, somewhat less data are observed for the
education level question, and even more data are missing for the income question.
In considering how to handle the missing data, we need to first consider the
mechanisms that would lead to omitted responses to these items.
In a series of seminal papers culminating in an important book, Little and
Rubin (2020) considered three possible scenarios that would give rise to missing
(and observed) data in this example. Following Little and Rubin (2020, see also;
Enders, 2022), let M be a missing data indicator, taking the value of 1 if the data
1
An example of a parent questionnaire might be the international optional parent survey
provided by the OECD/PISA project.
165
are observed, and 0 if the data are missing. Further, let y be the complete data,
yobs represent observed data, and ymiss represent missing data. Finally, let ϕ be
the scalar or vector-valued parameter describing the process that generates the
missing data. In the first instance, the missing data on education and income
might be unrelated to the age, education, or income level of the participants. In
this instance, we say that the missing data are missing completely at random or
MCAR. More formally, MCAR implies that
f (M | y) = f (M | ϕ) (9.1)
which is to say that the missing data indicator is unrelated to the data, missing
or observed, and only related to some unknown missing data-generating mech-
anism. Conditions in which the missing data might be MCAR include random
coding errors, instances of missing by design, such as occurs with balanced in-
complete block spiraling designs (Kaplan, 1995; Kaplan & Su, 2016), or statistical
matching/data fusion (see e.g. Rässler, 2002). It has been recognized that MCAR
is a fairly unrealistic assumption in most social science data.
In the second instance, missing data on, say education, may be due to the age
or income of the respondents. Similarly, missing data on income may be due to
the age or education of the respondents. So, for example, a parent may not reveal
their income level based on their age and/or education, regardless of their income.
In this case, we say that the missing data are missing at random or MAR. Again, in
terms of our notation, MAR implies that
f (M | y) = f (M | yobs , ϕ) (9.2)
which states that the missing data mechanism is unrelated to variables that are
missing, but could be related to other observed variables in the analysis. Generally,
MAR is a more realistic assumption than MCAR.
Finally, missing data on, say income, might be related to the income of the
respondents and not necessarily on their age or education level. That is, individ-
uals choose to omit their response on the income question because of their level
of income, regardless of their age or education level. In this case, we say that the
missing data are not missing at random or NMAR. More formally,
f (M | y) = f (M | yobs , ymiss , ϕ) (9.3)
meaning that the missing data are related to the variable on which there is missing
data as well as, possibly, the observed data. It has been argued that NMAR is
probably the most realistic scenario of why omitted responses are occurring.
9.2 Ad Hoc Deletion Methods for Handling Missing

Data
Given the formal definitions in the last section, we next examine traditional meth-
ods that have been used in the past to address the problem of missing data. In
this section, we briefly review six traditional methods for addressing missing data
Missing Data from a Bayesian Perspective 167
before turning to more advanced methods, including Bayesian approaches. These

are (1) listwise deletion, (2) pairwise deletion, (3) mean imputation, (4) regression
imputation, (5) stochastic regression imputation, and (6) hot deck imputation.
9.2.1 Listwise Deletion

Simply put, listwise deletion (also referred to as casewise deletion) involves omit-
ting any respondent for which there are any missing data on any variable to be
used in an analysis. Listwise deletion is used as a convenient method for handling
missing data, but suffers from two critical flaws. First, from the frequentist frame-
work, listwise deletion can result in an unacceptable loss of data, hence reducing
statistical power. More importantly, perhaps, is that listwise deletion assumes
that the missing data are MCAR, and as we noted, this is probably a highly un-
realistic assumption in most practical settings. Put another way, the missing data
mechanism presumed under listwise deletion is misspecified.
9.2.2 Pairwise Deletion

Pairwise deletion is essentially the use of listwise deletion when statistical meth-
ods require using pairs of variables in the estimation. The simplest example is
the calculation of a covariance matrix among a set of variables. Using the age,
education, and income example from above, it may be of interest to calculate the
covariance matrix among these three variables. In calculating the covariance be-
tween age and education, the sample would be listwise deleted, and hence the
sample size for education would be used for that calculation. Similarly, for the
covariance between age and income, the sample would again be listwise deleted
and the sample size for income would be used, and similarly, for the education
and income covariance calculation.
There are three fundamental flaws with the use of pairwise deletion. First,
as with listwise deletion, pairwise deletion rests on the somewhat unrealistic
assumption that the missing data are MCAR. Second, from the standpoint of the
frequentist framework, it will result in a loss of power due to the decrease in
sample size. However, and perhaps more seriously, the decrease in the sample
size is not uniform across covariance calculations. In other words, the sample sizes
on which the covariances are calculated are different for different elements of the
covariance matrix. The result of the covariance matrix being based on different
element sample sizes is that it no longer will follow a Wishart distribution and, in
some cases, might not be positive definite. Thus, methods such as regression and
structural equation modeling, which can use the covariance matrix as a sufficient
statistic for estimation, could result in bias or estimation problems. Generally
speaking, pairwise deletion should be avoided.
9.3 Single Imputation Methods

The methods of listwise deletion and pairwise deletion result in a loss of data. To
get around the loss of data, it might be preferable to impute values for the missing
data. Two types of imputations are possible. The first, discussed next, involves
imputing one single value for each missing data point. Methods to be discussed
next are mean imputation, regression imputation, and stochastic regression imputation.
This will be followed by a brief discussion of multiple imputation, which will
set the stage for Bayesian methods. It should be noted that this section does not
present an exhaustive review of single imputation methods.
9.3.1 Mean Imputation

One approach is mean imputation. Simply put, mean imputation for a given variable
yk , (k = 1, 2, . . . , K), requires that we calculate the mean ȳk and insert that value for
all cases that are missing on yk .
Mean imputation suffers from two problems. First, mean imputation assumes
that the missing data are MCAR, which, again, might be unrealistic. Second, and
perhaps more critically, a constant ȳk is imputed for all occurrences of missing data
on yk . In the situation with a large amount of missing data, mean imputation will
result in a loss of variation on that variable.
9.3.2 Regression Imputation

Regression imputation represents an improvement on the previous procedures.
Again, consider the case of missing data on parent age, education, and income.
Regression imputation begins be specifying a regression equation to predict the
missing data on each variable. Using listwise deletion to start, a regression equa-
tion is formed for the missing data on income and the missing data on education.
We can write the models for parent i (i = 1, 2, . . . n) as
educi = β0 + β1 (agei ) + β2 (incomei ) + ei,educ (9.4)
and
incomei = β0 + β1 (agei ) + β2 (educi ) + ei,income (9.5)
From here, predicted values of education and income are obtained as
[i = βˆ0 + βˆ1 (agei ) + βˆ2 (incomei )

educ (9.6)
and
[ i = βˆ0 + βˆ1 (agei ) + βˆ2 (educi )
income (9.7)
and these predicted values are imputed for the corresponding missing data point.
Although single regression imputation is an improvement upon mean im-
putation and the ad hoc deletion methods, it suffers from one major drawback.
Specifically, the predicted values based on the regression in Equations (9.6) and
(9.7) will, by definition, lie exactly on the regression line. This implies that among
the subset of observations for which there are missing data, the correlations among
the variables of interest will be 1.0. As a result, the overall R2 value will be overes-
timated. Second, as with mean imputation, it is presumed that the imputed values
would be the ones observed had there been no missing data. For this to be true,
the regression model would have to have been correctly specified.
9.3.3 Stochastic Regression Imputation

Recall that the major problem of regression imputation is that for the subset of
observations with missing values, the imputed values will like exactly on the
regression line, thus overestimating the strength of association. To address this
problem, we can add a random term to Equations (9.6) and (9.7) that serve to perturb
the imputed values from the predicted line. We would then write Equations (9.6)
and (9.7) as
[i = βˆ0 + βˆ1 (agei ) + βˆ2 (incomei ) + ri,educ

educ (9.8)
and
[ i = βˆ0 + βˆ1 (agei ) + βˆ2 (educi ) + ri,inc
income (9.9)
where ri,educ and ri,inc are values drawn from a normal distribution with mean 0
and variance equal to the residual variance obtained estimation of Equation (9.4).
A distinct advantage to stochastic regression imputation is that it preserves the
variability in the data that is lost with mean imputation and regression imputation.
As a result, it has been shown that stochastic regression imputation yields unbiased
estimates under the somewhat more realistic assumption of MAR and performs
quite similarly to multiple imputation (Enders, 2022). However, it should be
noted that stochastic regression imputation only applies one draw from the normal
distribution to be used for residual terms in Equations (9.8) and (9.9).
9.3.4 Hot Deck Imputation

A rather different form of single imputation involves replacing the missing value
on a variable from some individual by a value from a “similar” individual who, in
fact, did respond to the variable. This method is referred to as hot deck imputation.
The method of hot deck imputation is rather generic with differences in methods
based on how “similarity” is defined. In the simplest case, a sample of respondents
would be classified across a variety of selected demographic characteristics. Then,
a missing value would be replaced by a draw from the subsample of individuals
who are as similar as possible on the selected demographic characteristics as the
individual who omitted the response (Enders, 2022).
The method just described is not limited to categorical variables such as most
background demographic characteristics. Continuous variable can also be used
for hot deck matching, and algorithms such as nearest neighbor matching can be em-
ployed. Following Little and Rubin (2020), the basic idea behind nearest neighbor
hot deck matching is as follows. Let, xi = (xi1 , xi2 , . . . , xiK )′ be a set of K covariates
k = 1, 2, . . . , K measured on individual i, and let yi by a variable that is missing for
individual i. Then, an indicator variable d(i, j) can be formed as

0, i, j in same cell

d(i, j) = 

1, i, j in different cells

This approach would result in partitioning the sample into cells that can be used
for conventional hot deck matching described above. However, other metrics can
be defined as well. For example, the metric of maximum deviation can be defined
as
Maximum deviation: d(i, j) = maxk | xik − x jk | (9.10)

and the Mahalanobis distance metric can be defined as
Mahalanobis: d(i, j) = (xi − x j )′ S−1

xx (xi − x j ) (9.11)
where Sxx is the sample covariance matrix of xi .
9.3.5 Predictive Mean Matching

Regression imputation and hot deck matching sets the ground work for so-called
predictive mean matching introduced by Rubin (1986). The essential idea is that
a missing value is imputed by matching its predicted value based on regression
imputation to the predicted values of the observed data on the basis a predictive
mean metric.
Predictive mean: d(i, j) = [ ŷ(xi ) − ŷ(x j )]2 (9.12)

Once a match is found, the procedure uses the actual observed value for the
imputation. That is, for each regression, there is a predicted value for the missing
data and also a predicted value for the observed data. The predicted value for the
observed data is then matched to a predicted value of the missing data using, say,
a nearest neighbor distance metric. Once the match is found, the actual observed
value (rather than the predicted value) replaces the missing value. In this sense,
predictive mean matching operates much like hot deck matching.
9.4 Bayesian Methods of Multiple Imputation

In the previous section, we discussed single imputation models — including mean
imputation, regression imputation, stochastic regression imputation, hot deck, and
predictive mean matching. In each case, a single value is imputed for the missing
data point. Among the methods discussed in the previous section, stochastic
regression imputation returns some of the variability in the data by imputing a
missing value with some error that allows the missing data point to deviate from
the regression line. Nevertheless, it is still the case that a single value is being
imputed and treated as though that is the value that would have been obtained if
the missing data point had, in fact, been observed. In other words, the imputed
missing data point based on a single imputation method does not account for
uncertainty about the missing data point itself. Thus, rather than imputing a single
value, it may be theoretically better justified to draw multiple plausible missing
data values. This idea forms the basis of multiple imputation, which fundamentally
rests on Bayesian theory.
The central reason for adopting a Bayesian perspective to missing data prob-
lems is that by viewing parameters probabilistically and specifying a prior distri-
bution on the parameters of interest, the imputation method is Bayesianly proper
(Rubin, 1987) insofar as the imputations reflect uncertainty about the missing
data as well as uncertainty about the unknown model parameters. Moreover, the
Bayesian view of statistical inference allows for the incorporation of prior knowl-
edge, which can further reduce uncertainty in model parameters. It is important to
point out that although the method of stochastic regression imputation described
above has a Bayesian flavor, it is not Bayesianly proper insofar as it does not
account for parameter uncertainty, but rather only uncertainty in the predicted
missing data values.
In this section, we will discuss Bayesian approaches to multiple imputation.
To begin, we first consider the data augmentation algorithm of Tanner and Wong
(1987), which is a Bayesian approach to addressing missing data problems and
which is similar to Gibbs sampling (see Section 4.3). We then discuss an approach
to multiple imputation using the chained equation algorithm of van Buuren (2012).
From there, we consider two more modern approaches to multiple imputation.
The first of these is based on the EM algorithm, and the second is based on
a combination of the so-called Bayesian bootstrap and predictive mean matching
discussed earlier.
9.4.1 Data Augmentation

In this section we introduce a Bayesianly proper approach to handling missing
data, which is quite similar to the EM algorithm (described in more detail in Section
9.4.3) but includes notions of Bayesian simulation. This approach is referred to as
data augmentation (DA), developed by Tanner and Wong (1987), and can be useful
for handling missing data problems in the presence of small sample sizes.
Following Little and Rubin (2020), the Bayesian core of the DA algorithm recog-
nizes that under the ignorable missing data assumption, the posterior distribution
of the model parameters given observed data can be written as
p(θ | yobs ) ∝ p(yobs | θ)p(θ) (9.13)

where, as before, p(θ) is the prior distribution. As with the EM algorithm, the DA
algorithm is composed of two steps – referred to as the I(mputation) step and the
P(osterior) step corresponding, respectively, to the E-step and M-step of the EM
algorithm. The I-step begins with an initial value of θ, denoted as θs . Then
(s+1)
I -step: Draw ymiss from p(ymiss | yobs , θs )
In other words, we use the current value θ(s) and the observed data yobs to generate
a value for the missing data from the predictive distribution of the missing data
p(ymiss | yobs , θs ). The I-step is followed by the P-step which draws a new value of
θ, namely θ(s+1) , from the posterior distribution of θ given the observed data yobs
and simulated missing data from the previous step, ys+1 miss
. Formally,
(s+1)
P-step: Draw θ(s+1) from p(θ | yobs , ymiss )
As the number of iterations goes to infinity, the DA algorithm converges to a draw

from the joint posterior distribution p(θ, ymiss | yobs ) (Little & Rubin, 2020, p. 201).
Results of the comparative study for multiple imputation under data augmen-
tation are given below in Table 9.1. The analysis uses the R program norm (Shafer,
2012).
TABLE 9.1. Multiple imputation using data augmentation

Parameter Non-informative prior Informative prior Frequentist
EAP SD EAP SD Coeff. SE
INTERCEPT 487.66 3.25 482.37 2.49 487.67 3.28
READING on GENDER 5.98 2.29 10.76 1.78 5.98 2.29
READING on NATIVE −6.04 3.79 −5.74 3.00 −6.05 3.79
READING on SLNAG 7.55 4.42 10.93 3.23 7.54 4.46
READING on ESCS 31.24 1.29 33.02 1.00 31.25 1.29
READING on JOYREAD 28.82 1.25 25.41 0.99 28.81 1.24
READING on DIVREAD −4.57 1.16 −1.87 0.91 −4.58 1.17
READING on MEMOR −19.06 1.30 −18.79 1.06 −19.05 1.29
READING on ELAB −14.92 1.23 −14.33 1.03 −14.92 1.23
READING on CSTRAT 28.13 1.43 27.12 1.14 28.13 1.44
9.4.2 Chained Equations

In this section we concentrate our discussion on another Bayesianly proper form
of multiple imputation using the method of chained equations. The method of
chained equations recognizes that, in many instances, it might be better to engage
in a series of single univariate imputations along with diagnostic checking rather
than an omnibus multivariate model for imputation that might be sensitive to
specification issues. An overview of previous work on chained equations can be
found in van Buuren (2012).
The essence of the chained equations approach is that univariate regression
models consistent with the scale of the variable with missing data are used to
provide predicted values of the missing data given the observed data. Thus, if
a variable with missing data is continuous, then a normal model is used. If a
variable is a count, then a Poisson model would be appropriate. This is a major
advantage over other Bayesianly proper methods such as data augmentation that
assume a common distribution for all of the variables. Once a variable of interest
is “filled-in,” that variable, along with the variables for which there is complete
data, is used in the sequence to fill in another variable. In general, the order of the
sequence is determined by the amount of missing data, where the variable with
the least amount of missing data is imputed first, and so on.
Once the sequence is completed for all variables with missing data, the pos-
terior distribution of the regression parameters are obtained and the process is
started again. Specifically, the filled-in data from the previous cycle, along with
complete data are used for the second and subsequent cycles (Enders, 2022). The
Gibbs sampler (see Chapter 4) is used to generate the sequence of iterations. Fi-
nally, running the Gibbs sampler with m chains provides m imputed data sets.
Results of the comparative study for multiple imputation using chained equa-
tions are given in Table 9.2 below. The analysis uses the R program mi (Su, Gelman,
Hill, & Yajima, 2011)
TABLE 9.2. Multiple imputation using chained equations: Non-informative and

informative priors
Full Model
INTERCEPT 487.92 3.27 482.59 2.50 487.93 3.31
READING on SLANG 7.68 4.46 10.96 3.24 7.66 4.50
READING on ESCS 31.27 1.31 33.05 1.00 31.29 1.30
READING on MEMOR −19.36 1.32 −18.98 1.07 −19.35 1.31
READING on ELAB −15.05 1.26 −14.40 1.05 −15.04 1.26
9.4.3 EM Bootstrap: A Hybrid Bayesian/Frequentist Method

In this section we examine an approach that combines Bayesian imputation con-
cepts with the frequentist idea of bootstrap sampling. Essentially, bootstrapping is
a data-based simulation method which relies on drawing repeated samples from
the data to estimate the sample distribution of almost any statistic, and it was
developed as a simplified alternative to inferences derived from statistical theory
(Efron & Tibshirani, 1993). Specifically, this section considers the implementation
of the EM algorithm with bootstrapping.
Briefly, EM stands for expectation-maximization and is an algorithm that is
widely used to obtain maximum likelihood estimates of model parameters in
the context of missing data problems (Dempster, Laird, & Rubin, 1977).
The essence of the EM algorithm proceeds as follows. Using a set of starting
values for the means and the covariance matrix of the data (perhaps obtained from
listwise deletion), the E-step of the EM algorithm creates the sufficient statistics
necessary to obtain regression equations that yield the predictions of the missing
data given the observed data and the initial set of model parameters. The next step
is to use the “filled-in” data to obtain new estimates of model parameters via the M-
step, which is simply the use of straightforward equations to obtain new estimates
of the vector of means and the covariance matrix of the data. The algorithm then
iterates back to the E-step to obtain new regression equations. The algorithm
cycles between the E-step and the M-step until a convergence criterion has been
met, at which point the maximum likelihood estimates have been obtained. The
E-step and M-step are the likelihood counterparts of the Bayesian I-step and P-step
in data augmentation.
Following Little and Rubin (2020), the basic idea behind the EM algorithm is
as follows. We recognize that the missing data ymiss contains information relevant
to estimating a parameter θ, and that given an estimate of θ, we can obtain
information regarding ymiss . Thus, a sensible approach would be to start with
an initial value of θ, say θ(0) , estimate ymiss based on that value, and then with
the “filled-in” data, re-estimate θ via maximum likelihood, referring to this new
estimate as θ(1) . The process then continues until the s iterations (s = 0, 1, 2, . . .)
converge.
More formally, the EM algorithm has two steps: the (E)xpectation-step and
the (M)aximzation-step. The E-step begins with an initial value of the parameter,
θs , treating it as θ and obtains the expected complete data log likelihood:
Z
Q(θ | θ ) =
(s)
l(θ | y)p(ymiss | yobs , θ(s) )dymiss (9.14)
The M-step then obtains θ(s+1) via maximum likelihood estimation of the expected
complete data log likelihood in Equation (9.14). Dempster et al. (1977, see also;
Schafer, 1997) showed that θ(s+1) is a better estimate than θ(s) insofar as the observed
data log likelihood under θ(s+1) is at least as large as that obtained under θ(s) – that
is,
l(θ(s+1) | yobs ) ≥ l(θ(s) | yobs ) (9.15)

because θ(s+1) was chosen so that
Q(θ(s+1) | θ(s) ) ≥ Q(θ | θ(s) ), for all θ (9.16)
The EM algorithm has been extended to handle the problem of multiple im-
putation without the need for computationally intensive draws from the posterior
distribution, as with the data augmentation approach. The idea is to extend the EM
algorithm using a bootstrap approach. This approach is labeled EMB (Honaker
& King, 2010) and implemented in the R program Amelia (Honaker, King, &
Blackwell, 2011), which we will use in our analyses below.
Following Honaker and King (2010) and Honaker (personal communcation,
June 2011), the first step is to bootstrap the dataset to create m versions of the
incomplete data, where m ranges typically from three to five as in other multiple
imputation approaches. Bootstrap resampling involves taking a sample of size n
with replacement from the original dataset. Second, the EM algorithm is run and it
is here that Honaker and King (2010) allow for the inclusion of prior distributions
on the model parameters estimated via the EM algorithm. Notice that because m
boostrapped samples are obtained, and that each EM run on these samples may
contain priors, then once the EM algorithm has run, the model parameters will be
different. Indeed, with priors, the final results are the maximum a posteriori (MAP)
estimates, which is the Bayesian counterpart of the maximum likelihood estimates.
Finally, missing values are imputed based on the final converged estimates for each
of the m datasets. These m versions can then be used in subsequent analyses.
Results of the comparative study for multiple imputation under the EM boot-
strap are given below in Table 9.3. The analysis uses the R program Amelia
(Honaker et al., 2011)
TABLE 9.3. Multiple imputation using the EM bootstrap

Full Model
INTERCEPT 488.10 3.26 482.61 2.50 488.11 3.29
READING on SLANG 7.08 4.46 10.68 3.24 7.06 4.50
READING on ESCS 31.41 1.29 33.12 1.00 31.42 1.29
READING on MEMOR −19.01 1.31 −18.76 1.07 −19.00 1.30
READING on ELAB −15.15 1.25 −14.48 1.04 −15.14 1.24
9.4.4 Bayesian Bootstrap Predictive Mean Matching

Multiple imputation via data augmentation is inherently a parametric method.
That is, in estimating a Bayesian linear regression, the posterior distributions are
obtained via Bayes’ theorem which requires parametric assumptions. It may be
desirable, however, to relax assumptions regarding the posterior distributions of
the model parameters, and to do this requires a replacement of the step that draws
the conditional predictive distribution of the missing data given the observed
data. A hybrid of predictive mean matching, referred to as posterior predictive
mean matching, proceeds first by obtaining parameter draws using classical MI
approaches. However, the final step then uses those values to obtain predicted
values of the data followed by conventional predictive mean matching.
Posterior predictive mean matching sets the groundwork for Bayesian bootstrap
predictive mean matching (BBPMM). The goal of BBPMM is to further relax the
distributional assumptions associated with draws from the posterior distributions
of the model parameters. The algorithm begins by forming a Bayesian bootstrap
of the observations (Rubin, 1981). The Bayesian bootstrap (BB) is quite similar to
conventional frequentist bootstrap (Efron, 1979), except that it provides a method
for simulating the posterior distribution of the parameter(s) of interest rather than
the sampling distribution of parameter(s) of interest, and as such, is more robust to
violations of distributional assumptions associated with the posterior distribution.
For specific details, see Rubin (1981). Next, BBPMM obtains estimates of the
regression parameters from the BB sample. This is followed by the calculation
of predicted values of the observed and missing data based on the regression
parameters from the BB sample. Then, predictive mean matching is performed as
described earlier. As with conventional MI, these steps can be carried out m ≥ 1
times to create m multiply imputed datasets.
Results of the comparative study for multiple imputation under the Bayesian
bootstrap predictive mean matching are given below in Table 9.4. The analysis
uses the R program BaBooN (Meinfelder, 2011)
TABLE 9.4. Multiple imputation using Bayesian bootstrap predictive mean matching
Full Model
INTERCEPT 487.71 3.32 482.44 2.52 487.72 3.36
READING on SLANG 7.79 4.55 11.03 3.27 7.77 4.59
READING on ESCS 31.34 1.31 33.09 1.00 31.35 1.30
READING on MEMOR −19.10 1.31 −18.82 1.07 −19.09 1.30
READING on ELAB −15.12 1.27 −14.47 1.05 −15.11 1.26
An inspection of Tables 9.1 – 9.4 reveal similar results across methods of impu-
tation for both the non-informative and frequentist cases, but sizable differences
when comparing the informative case to the non-informative and frequentist cases,
particularly for standard deviations, as expected. Of course, it is difficult to draw
generalizations about these methods when based on real data, but the results do
serve as a caution that important differences can occur depending on whether and
how missing data are handled.
9.4.5 Accounting for Imputation Model Uncertainty

A recent paper by Kaplan and Yavuz (2019) argued that Bayesianly proper ap-
proaches to multiple imputation, although correctly accounting for uncertainty
in imputation model parameters, ignore the uncertainty in the imputation model
itself. Kaplan and Yavuz (2019) addressed imputation model uncertainty by adopt-
ing a predictive point of view regarding imputation and implementing Bayesian
model averaging as part of the imputation process.
We will discuss Bayesian model averaging in more detail in Chapter 11, but
suffice it to say here that Bayesian model averaging accounts for both model and
parameter uncertainty, and thus is fully Bayesianly proper. The essential idea
underlying Kaplan and Yavuz’s (2019) approach to multiple imputation is that
one simply adds a Bayesian model averaging component to each cycle of the
chained equations approach. That is, as each variable takes its turn as the target
variable for imputation, Bayesian model averaging is applied to the imputation
model for that target variable. In doing so, imputation model uncertainty (as well
as parameter uncertainty) is accounted for across all variables and iterations.
Kaplan and Yavuz (2019) applied Bayesian model averaging to multiple impu-
tation under the fully conditional specification approach. An extensive simulation
study was conducted comparing their Bayesian model averaging approach against
normal theory-based Bayesian imputation not accounting for model uncertainty.
Across almost all conditions of their simulation study, the results revealed the
extent of model uncertainty in multiple imputation and a consistent advantage
to our Bayesian model averaging approach over normal-theory multiple impu-

tation under missing-at-random and missing-completely-at random in terms of
Kullback-Liebler divergence and mean squared prediction error. A case study
using data from PISA 2015 (OECD, 2017) revealed the extent of imputation model
uncertainty. Specifically, for a model containing nine predictor variables and 20
fully chained equation imputations, the posterior model probabilities ranged from
a low of 0.181 to a high of 1.0 (see Table 6 in Kaplan & Yavuz, 2019). This extensive
model uncertainty is ignored in standard multiple imputation. An R program
miBMA that was used to conduct the analyses in Kaplan and Yavuz (2019) can be
found at https://github.com/Sinan-Yavuz/miBMA
9.5 Summary
This chapter presented an overview of advanced methods for handling problems
of missing data. Given theoretical developments discussed in Little and Rubin
(2020), extended by Schafer (1997), Rässler (2002), and van Buuren (2012), among
others, and summarized in Enders (2022), there is no defensible reason to resort to
ad hoc methods such as listwise and pairwise deletion. The central idea of multiple
imputation originated by Rubin (1987) is essentially Bayesian, and the various
algorithms described in this chapter, such as data augmentation and chained
equations now allow for a full Bayesian approach to addressing uncertainty in
missing data via chained equations and to analyze multiple imputed datasets
using fully Bayesian methods. The next chapter introduces Bayesian methods to
the linear and generalized linear model.
10
Bayesian Variable Selection and
Sparsity
10.1 Introduction
Over the past three decades a great deal of attention has been paid to the problem
of variable selection. Specifically, in considering a relatively long list of predictors
such as shown in the linear regression example in Chapter 5, concern focuses on
the trade-off between the bias that could occur if important variables are omitted
from the model and the variance that could occur from overfitting the model
with variables that do not play a very important role in the prediction of the
outcome. Variable selection methods are designed to yield so-called sparse models
that contain, more or less, the important predictors of the outcome.
This chapter concentrates on Bayesian methods for variable selection, although
the two methods discussed here can be implemented in a frequentist framework
and the results are often comparable. However, as pointed out by van Erp (2020),
there are a number of important benefits in adopting a Bayesian framework for
variable selection. First, as we will see, variable selection can be easily imple-
mented through the priors placed on model parameters, and these are generically
referred to as shrinkage priors or sparsity-inducing priors. Shrinkage priors can
be specified to shrink small coefficients toward zero while allowing large coeffi-
cients to remain large. Sparsity is induced by specifying certain hyperparameters
within the priors set on the model parameters. These hyperparameters are defined
through their own hyperprior distributions. The hyperpriors can be manipulated
to increase or decrease the amount of shrinkage in the estimated effects.
The second benefit of adopting a Bayesian perspective to variable selection is
that the penalty term is estimated in the same step as the other model parameters.
In other words, the penalty term is built into the model estimation process because
it is incorporated directly into the model via a prior. In turn, that prior can be
specified in a flexible manner through different settings, controlling for the degree
of shrinkage as the researcher sees fit.
Finally, the third benefit of estimating Bayesian penalty terms is that many dif-
ferent forms of penalties can be implemented. There are frequentist-based penalty
techniques, such as the ridge and lasso methods described, which have their
179
Bayesian counterparts. In addition, there are methods that are strictly Bayesian
such as the spike-and-slab prior and the horseshoe prior (see van de Schoot et al.,
2021, for more information on these priors.).
In this chapter, we focus on Bayesian variable selection methods in the context
of linear models and consider four methods for variable selection: (1) the ridge
prior (A. E. Hoerl & Kennard, 1970; Hsiang, 1975), (2) lasso prior, (Park & Casella,
2008; Tibshirani, 1996), (3) horseshoe prior (Carvalho, Polson, & Scott, 2010), and
(4) regularized horseshoe prior (Piironen & Vehtari, 2017). The first two can also
be implemented in a frequentist setting, but we will concentrate on their Bayesian
counterparts. Although there are many more that could be considered (see, e.g.,
Hastie, Tibshirani, & Friedman, 2009), these methods are chosen to highlight the
issues of variable selection and lead naturally into our discussion of Bayesian
model averaging in Chapter 11. A representation of the different shrinkage prior
distributions is given below in Figure 10.1, and a comparison of the performance
of these priors will be given in Section 10.6 below.
FIGURE 10.1. Four types of shrinkage priors. Top row left: Ridge prior N(0,1); top row
right: Laplace prior with location=0, scale=4; bottom row left: Horseshoe prior with λp ∼
C+ (0,1) and τ ∼ C+ (0,1); bottom row right: Regularized horseshoe prior.
Bayesian Variable Selection and Sparsity 181
10.2 The Ridge Prior

As a regularization method, ridge regression (A. E. Hoerl & Kennard, 1970;
R. W. Hoerl, 1985) aims to yield a parsimonious regularized regression model
in the presence of highly correlated variables. The frequentist ridge estimator of
β, denoted as βridge is obtained by solving the minimization
P
X
βridge = argmin y′ y − β′ x′ x + λ β2p (10.1)
β p=1
where λ ≥ 0 is a tuning parameter that controls the degree of regularization and

the term λ Pp=1 β2p ) is referred to as an L2 − norm. When λ = 0, we have ordinary
P
least squares, and when λ = ∞, we obtain βridge = 0. With ridge regression it can
be seen that a large value of λ can lead to very heavy penalization.
Hsiang (1975) showed that if β has a mean of zero and covariance matrix
Σ = (σ2 /λ)I, and if ϵ ∼ N(0, σ2ϵ I), then the posterior mean of β is (x′ x + λI)−1 x′ y,
which is an alternative specification of the ridge estimator. Hsiang (1975) also
notes that if weakly informative or informative priors are placed on βp , then the
interpretation of the posterior mean of β as the ridge estimate is no longer valid.
In Bayesian ridge regression, the penalty term (λ) is captured through normally
distributed independent priors placed on the regression slope parameters. These
normal priors have mean hyperparameter values fixed at zero in order to control
shrinkage toward zero. The variance hyperparameter is typically rescaled to be
in standard deviation form and is set to define the degree of spread that the
distribution exhibits. Note that we specify a C+ (0,1) distribution for the residual
standard deviation, but other priors could be specified as well. A representation
of the ridge prior is given in the top left of Figure 10.1.
The specification of the ridge prior for the example that follows can be written
as
σ2
!
β j |λ, σ ∼ N 0,
2
, for j = 1, ..., p (10.2)
λ
where, for the following example, we assume σ2 = 1 and we set λ = 1, inducing
an N(0, 1) prior on each regression coefficient. Note, however, that larger values
of λ induce greater penalties insofar as the variance of the regression coefficients
become smaller.
Example 10.1: Bayesian Ridge Regression
Before beginning, it is necessary when using any of the sparsity-inducing pri-

ors that the data be standardized beforehand. Standardizing the data beforehand
provides a constant value that all parameters can be shrunk toward, namely, zero.
For our example of Bayesian ridge regression, we return to the Bayesian linear
regression model in Chapter 5, using data from PISA 2018 to estimate a model
of reading proficiency. For this example, we take a random sample of 100 ob-
servations to demonstrate the differences in the amount of shrinkage across the
methods. Preliminary analyses with the full sample reveal virtually no differences
among the methods, as would be expected.1
In what follows, only the data and parameter blocks are provided insofar as
the remaining code is the same as that in Example 5.1 and also across all other
methods. For the ridge priors, we give an N(0, 1) prior to the regression coefficients
and a C+ (0,1) distribution to the standard deviation of the residuals. The likelihood
follows the specification of the priors.
RidgeString = "
data {
int<lower=0> n;
}
parameters {
real alpha;
}
model {
real mu[n];
for (i in 1:n)
mu[i] = alpha + beta1*Female[i] + beta2*ESCS[i] + beta3*METASUM[i]
// Priors
1
We do not sample students within schools, thus this example should not be taken as a
serious model of reading proficiency.
// Likelihood
The important points to note about this code is that, first, the data should be
standardized before estimation. Second, note that the specification of the N(0, 1)
priors induces the ridge shrinkage in the sense that regression coefficients that
are close to zero will be shrunk toward the prior mean of zero, whereas large
coefficients should be relatively unaffected by the prior. Again, as noted above,
the extent of the shrinkage is determined by the value of λ.
10.3 The Lasso Prior

A drawback of ridge regression is that it does not improve parsimony in the sense
that all of the variables still remain in the model after penalization (Zou & Hastie,
2005). A method that appears similar to ridge regression but is principally different
in terms as yielding a parsimonious model is the least absolute shrinkage and selection
operator or LASSO. The frequentist lasso involves solving the expression
P
X
βlasso = argmin y′ y − β′ x′ x + λ |βp | (10.3)
β p=1
The term λ Pp=1 |βp | is referred to as an L1 − norm penalty, which allows less
P
important coefficients to be set to zero, and thus the lasso provides for both
shrinkage and variable selection.
Bayesian lasso penalization uses a different shrinkage prior as compared to the
Bayesian ridge approach. Specifically, Tibshirani (1996) showed that |βp | is propor-
tional to minus the log-density of the double exponential (Laplace) distribution.
That is, the lasso estimate of the posterior mode of βp can be obtained by using the
prior !
1 |βp |
p(βp ) = exp − (10.4)
2τ τ
where τ = 1/λ.
The top right of Figure 10.1 shows the double exponential distribution. We see
that the double exponential distribution is ideal because it peaks at zero, which
shrinks small coefficients toward zero. However, the double exponential can be
set to have thick tails (in both directions), allowing the larger coefficients to remain
large. Given that the distribution is centered at zero to control shrinkage toward
zero, the mean hyperparameter setting is fixed to zero. The scale, or dispersion,
of the double exponential distribution is the hyperparameter that researchers can
alter when implementing the shrinkage. This defines the amount of spread and
the thickness of the tails, which controls the degree of shrinkage in coefficients.
Again, a C+ (0,1) prior can be specified on the standard deviation of the residuals,
if desired.
Although the ridge and lasso approaches are similarly implemented in
the Bayesian framework, these techniques can produce different amounts of
shrinkage depending on the hyperparameter settings. That is, the lasso approach
can result in more shrinkage for the small estimates, but less shrinkage for the
large estimates. This result is a function of the double exponential distribution
implemented in the lasso approach. The double exponential distribution is more
peaked around zero and it has heavier tails compared to the normal distribution
used in the ridge approach. Regardless of the approach implemented, Bayesian
penalization can be a useful tool when attempting to avoid overfitting a complex
model to small samples. Indeed, the lasso is simultaneously a shrinkage and
variable selection method. In addition, these approaches further highlight
the modeling flexibility that Bayesian methods provide through the flexible
implementation of priors. Next follows the specification for the lasso priors.
Example 10.2: Bayesian Lasso Regression
Below we show the Stan code for the lasso prior. Note in the parameter block
the use of the double exponential(0,1) distribution to induce the lasso. Also, notice
that we do not attempt to induce as much shrinkage in the intercept alpha.
modelString = "
data {
int<lower=0> n;
}
model {
real mu[n];
for (i in 1:n)
mu[i] = alpha + beta1*Female[i] + beta2*ESCS[i] +
beta3*METASUM[i]
// Priors
beta1 ˜ double_exponential(0, 1); beta6 ˜ double_exponential(0, 1);

// Likelihood
The lasso is not without limitations (see van Erp, Oberski, & Mulder, 2019).
First, when the number of variables p are greater than the sample size n (which
we might encounter in “big data” problems), the model selection algorithm will
stop at n because the model will no longer be identified. Second, if there are
groups of variables that are highly pairwise correlated, the lasso will select only
one of the variables from that group rather arbitrarily. Third, when n > p, which
is the motivating case in this chapter, and when variables are highly correlated,
it has been shown that ridge regression will outperform the lasso with respect to
predictive performance.
10.4 The Horseshoe Prior

An alternative to the lasso prior which has gained popularity in the Bayesian
literature is the so-called horseshoe prior. The horseshoe prior belongs to a class
of so-called global-local shrinkage priors.2 Following the notation in Betancourt
(2018a), the horseshoe prior can be specified as follows:
βp ∼ N(0, τλp ) (10.5a)

+
λp ∼ C (0, 1) (10.5b)
+
τ ∼ C (0, τ0 ) (10.5c)
where τ0 is a hyperparameter that controls the behavior of the global shrinkage

prior τ. The intuition behind the horseshoe prior is that the global parameter τ
shrinks all of the coefficients toward zero while the local parameter λp allows
some large coefficients to bypass the shrinkage. The horseshoe prior can be seen
in the bottom row left of Figure 10.1
Example 10.3: The Horseshoe Prior
2
The horseshoe prior gets its name from the fact that under certain conditions, the proba-
bility distribution of the shrinkage parameter associated with horseshoe prior reduces to a
Beta( 21 , 12 ) distribution, which has the shape of a horseshoe.
For this example, we specify λp as the local prior for each of the p regression
coefficients and τ as the global prior in the Stan parameter block, where we set
τ0 = 1. Note that in the Stan model block, the regression coefficients have mean
zero and a scale mixture τλp .
Horseshoe = "
data {
int<lower=1> n; // Number of data
int<lower=1> p; // Number of covariates
matrix[n,p] X;
real readscore[n];
}
parameters {
vector[p] beta;
vector<lower=0>[p] lambda; // Local prior
real<lower=0> tau; // Global prior
real alpha;
}
model {
beta ˜ normal(0, tau * lambda); // Scale mixture
tau ˜ cauchy(0, 1);
lambda ˜ cauchy(0, 1);
sigma ˜ cauchy(0, 1);
readscore ˜ normal(alpha + X * beta, sigma);

}
// For posterior predictive checking and loo cross-validation
vector[n] log_lik;
for (i in 1:n) {
readscore_rep[i] = normal_rng(alpha + X[i,:] * beta, sigma);
log_lik[i] = normal_lpdf(readscore[i] | alpha + X[i,:]
* beta, sigma);
}
}
"
10.5 Regularized Horseshoe Prior

A limitation of the conventional horseshoe prior relates to the regularization of the
large coefficients. Specifically, it is still the case that large coefficients can transcend
the global scale set by τ0 , with the impact being that the posteriors of these large
coefficients can become quite diffused, particularly in the case of weakly-identified
coefficients (Betancourt, 2018a). To remedy this issue, Piironen and Vehtari (2017)
proposed a regularized version of the horseshoe prior (also known as the Finnish
horseshoe prior). Following the notation used in Betancourt (2018a),
βp ∼ N(0, τλ̃p ) (10.6a)

cλp
λ̃p = q (10.6b)
c2 + τ2 λ2p
λp ∼ C+ (0, 1), (10.6c)
ν ν

c2 ∼ inv-gamma , s2 (10.6d)
2 2
τ ∼ C+ (0, τ0 ) (10.6e)
where s2 is the variance for each of the p predictor variables. As pointed out by
Piironen and Vehtari (2017), those variables that have large variances would be
considered more relevant a priori, and while it is possible to provide predictor
specific values for s2 , generally we scale the variables ahead of time so that s2 =
1. Finally, c2 is the slab width which controls the size of the large regression
coefficients.
To gain an intuition of the regularized horseshoe, first note that the form
of Equation (10.6a) is quite similar to the horseshoe prior, however λ̃p places a
control on the size of the coefficients by introducing a slab width c2 in Equation
(10.6b). Following Piironen and Vehtari (2017), notice that if τ2 λ2p ≪ c2 , then this
means that βp is close to zero and λ̃p → λp , which is the original horseshoe in
Section 10.4. However, if τ2 λ2p ≫ c2 , then λ̃p → c2 /τ2 and the prior begins to
approach the N(0, c2 ), where, again, the choice of c2 controls the size of the large
coefficients. Because c2 is a slab width that might not be well known, it follows
that it should be given a prior distribution, and Piironen and Vehtari (2017)
recommend the inverse-gamma distribution in Equation (10.6d), which induces a
relatively non-informative Student’s - t slab when coefficients are far from zero.
Example: 10.4: The Regularized Horseshoe Prior
In setting up Stan first recall that as with all of the methods for sparsity, the
data are first standardized to have a mean of zero and standard deviation of
one. Also, recall that Stan works with standard deviations and not variances or
precisions. To start, for the regularized horseshoe we first need to indicate our
belief regarding the number of large coefficients. This is required because the
global scale parameter τ0 inside the transformed parameter block is a function of
the number of large coefficients assumed by the researcher ahead of analyzing the
data. In the transformed data block, this is indicated by the line real p0=5;.
PISA18sampleScale <- read.csv(file.choose(),header=T)
n <- nrow(PISA18sampleScale)
X <- PISA18sampleScale[,2:11]
readscore <- PISA18sampleScale[,1]
p <- ncol(X)
data.list <- list(n=n, p=p, X=X, readscore=readscore)
# Stan code adapted and modified from from Betacourt 2018 #
modelString = "
data {
int <lower=1> n; // number of observations
int <lower=1> p; // number of predictors
real readscore[n]; // outcome
matrix[n,p] X; // inputs
}
transformed data {
real p0 = 5;
}
Next, in the parameters block, we define the parameters of the regularized horse-
shoe given in Equations (10.6a) - (10.6e).
parameters {
vector[p] beta;
vector<lower=0>[p] lambda;
real<lower=0> c2;
real<lower=0> tau;
real alpha;
In the transformed parameters we specify tau0 in line with Betancourt (2018a) and
we write λ̃ as in Equation (10.6d).
real tau0 = (p0 / (p - p0)) * (sigma / sqrt(1.0 * n));
vector[p] lambda_tilde =
sqrt(c2) * lambda ./ sqrt(c2 + square(tau) * square(lambda));
}
We now put everything together in the model block.
model {
beta ˜ normal(0, tau * lambda_tilde);
lambda ˜ cauchy(0, 1);
c2 ˜ inv_gamma(2,8);
tau ˜ cauchy(0, tau0);

sigma ˜ cauchy(0, 1);
readscore ˜ normal(X * beta + alpha, sigma);
}
// For posterior predictive checking and loo cross-validation
vector[n] log_lik;
for (i in 1:n) {
readscore_rep[i] = normal_rng(alpha + X[i,:] * beta, sigma);
log_lik[i] = normal_lpdf(readscore[i] | alpha + X[i,:]
* beta, sigma);
}
}
"
10.6 Comparison of Regularization Methods

It may be of interest to run a side-by-side comparison of the regularization methods
in terms of their effects on parameter estimates and standard deviations and LOO
cross-validation. The comparison is displayed below in Table 10.1. Bayesian linear
regression with non-informative priors using the standardized data is given under
the BLR column for comparison purposes.
TABLE 10.1. Comparison of posterior results based on different regularization

methods
Ridge Lasso Horseshoea Reg. horseshoeb
Variable Parameter Mean (SD) Mean (SD) Mean (SD) Mean (SD)
Intercept alpha −0.04 (0.12) −0.05 (0.12) −0.01 (0.10) −0.01 (0.10)
FEMALE beta1 0.08 (0.16) 0.08 (0.17) 0.02 (0.10) 0.02 (0.10)
ESCS beta2 0.27 (0.09) 0.28 (0.09) 0.24 (0.10) 0.23 (0.10)
METASUM beta3 0.37 (0.08) 0.36 (0.08) 0.33 (0.09) 0.33 (0.09)
PERFEED beta4 −0.10 (0.10) −0.10 (0.10) −0.05 (0.08) −0.05 (0.07)
JOYREAD beta5 0.10 (0.09) 0.09 (0.10) 0.07 (0.08) 0.06 (0.08)
MASTGOAL beta6 −0.20 (0.08) −0.19 (0.09) −0.14 (0.09) −0.13 (0.09)
ADAPTIVITY beta7 −0.05 (0.11) −0.04 (0.11) −0.02 (0.07) −0.02 (0.06)
TEACHINT beta8 0.03 (0.09) 0.02 (0.09) 0.00 (0.06) 0.00 (0.06)
SCREADDIFF beta9 −0.13 (0.10) −0.13 (0.10) −0.11 (0.09) −0.10 (0.09)
SCREADCOMP beta10 0.18 (0.10) 0.17 (0.10) 0.11 (0.09) 0.13 (0.10)
Residual sigma 0.83 (0.06) 0.83 (0.06) 0.83 (0.06) 0.83 (0.06)
LOO-ICc 258.1 (12.4) 258.0 (12.4) 256.9 (12.1) 257.1 (11.9)
a
262 divergent transitions generated after warmup.
b
29 divergent transitions generated after warmup.
c
Value in parentheses are LOO-IC standard errors.
First, note that the horseshoe and regularized horseshoe methods generated a
warning of divergent transitions after warmup. This message needs to be taken
seriously and implies that the complexity of the model is such that the HMC/NUTS
algorithm cannot pick up small changes in the curvature of the log posterior. As
such, the estimates may be biased. A possible solution to this problem is to adjust
the alpha delta setting to beyond the default of 0.99 and max treedepth setting
to beyond the default value of 12, and of course to check the model and priors.
For this example, we set adapt delta=.9999 and max treedepth=20 and still had
divergent transitions. Generally speaking, however, if other diagnostics such as
n eff and Rhat look good, then one can proceed to interpret the results, albeit
with caution. For more information on Stan program warnings, see https://
mc-stan.org/misc/warnings.html.
With this caveat in mind, a visual inspection of the results in Table 10.1 in-
dicates that the ridge and lasso priors provide results that are somewhat similar
to Bayesian linear regression with non-informative priors that we found in Table
5.1 (when standardized). On the other hand, the original horseshoe prior and
regularized horseshoe achieve slightly more shrinkage in the posterior estimates
and standard deviations with the regularized horseshoe yielding the most shrink-
age, and indeed shrinking some of the larger coefficients (e.g., beta2 and beta3), as
expected. In terms of cross-validation, however, we find that the horseshoe prior
yields the lowest value of the LOO-IC followed closely by the regularized horse-
shoe. A comparative analysis of this kind might be worthwhile in practice if the
goal of the analysis is not only variable selection but also comparative predictive
performance.
10.6.1 An Aside: The Spike-and-Slab Prior

In this chapter, we did not demonstrate the so-called spike-and-slab prior, which
has been considered the ”gold standard” for sparsity for quite some time (Mitchell
& Beauchamp, 1988; E. I. George & McCulloch, 1993). However, in the interest of
completeness, we should say a brief word about it.
The spike-and-slab prior gets its name because the prior distribution on the in-
dividual regression coefficients come from a two-component mixture of Gaussian
distributions and can be written as
βp | λp , c, ϵ ∼ λp N(0, c2 ) + (1 − λp )N(0, ϵ2 ) (10.7a)

λp ∼ Bernoulli(π) (10.7b)
where λp ∈ {0, 1} is an indicator variable that determines whether the coefficient is

close to zero, in which case it comes from the spike (λp = 0), or nonzero, in which
case it comes from the slab (λp = 1). To create a spike, it is common to set ϵ = 0.
The slab width c and the inclusion probability π of the Bernoulli random variable
is set by the user. Notice that with ϵ = 0 the spike and slab prior can be rewritten
as
βp | λp , c ∼ λp N(0, c2 λ2 ) (10.8a)
λp ∼ Bernoulli(π) (10.8b)
The result of this setup is that λ is a discrete parameter that only takes on two
values (λp = 0, 1).
It is necessary to note that Stan cannot incorporate discrete parameters. How-
ever, studies have shown the similarity in performance between the spike-and-slab
prior and the horseshoe prior (see, e.g., Carvalho et al., 2010; Polson & Scott, 2011).
Finally, the spike-and-slab prior is similar to the regularized horseshoe prior when
the slab width c < ∞, thus providing some regularization on large coefficients.
10.7 Summary
This chapter considered the problem of Bayesian variable selection and spar-
sity. Many variable selection methods can be implemented in the frequentist and
Bayesian framework, and some are explicitly Bayesian. However, both simulation
studies and real data analyses seem to point to the original horseshoe prior or
regularized horseshoe prior as the preferred methods for inducing sparsity, par-
ticularly with respect to out-of-sample predictive performance. As usual, in the
case of large sample sizes, application of sparsity-inducing priors will likely lead
to similar conclusions. Nevertheless, it may be prudent to examine results using
different priors and choose the model that yields desirable shrinkage along with
acceptable out-of-sample predictive performance.
In the end, however, a single model is selected for interpretation, and although
the predictive performance of Bayesian shrinkage methods is often better than
regression modeling without inducing sparsity, these methods do not account
for the uncertainty that underlies the choice of a single model. An approach
to addressing the problem of model selection is simply not to choose a single

model but to carefully average over the space of possible models that could have
generated the data. The next chapter takes up the problem of model uncertainty.
11
Model Uncertainty1
11.1 Introduction
In the previous chapter, we discussed Bayesian approaches to model regularization
that have the effect of balancing the bias-variance trade-off by shrinking regression
coefficients close to, or equal, to zero and allowing large coefficients to remain large.
In the end, regardless of whether one is using variable selection methods or not,
typically a final model is selected, and this model is often discussed as though
it was the model chosen ahead of time. The Bayesian framework recognizes,
however, that model selection is conducted under pervasive uncertainty insofar
as a particular model is typically chosen among a set of competing models that
could also have generated the data. This problem has been recognized for over 40
years. Early on, Leamer (1978, p. 91) noted that
...ambiguity over a model should dilute information about the re-
gression coefficients, since part of the evidence is spent to specify the
model.
Similar observations were made later by Draper et al. (1987, p. iii) who stated
This [model selection] tends to underestimate Your actual uncer-
tainty, with the result that Your actions both inferentially in science
and predictively in decision-making, are not sufficiently conserva-
tive.[Capitalization authors’.]
Furthermore, Hoeting, Madigan, Raftery, and Volinsky (1999) wrote
Standard statistical practice ignores model uncertainty. Data analysts
typically select a model from some class of models and then proceed
as if the selected model had generated the data. This approach ig-
nores the uncertainty in model selection, leading to over-confident
inferences and decisions that are more risky than one thinks they are.
(p. 382)
As the quotes by Leamer (1978), Draper et al. (1987), and Hoeting et al. (1999)
suggest, it is risky to settle on a single model for predictive or explanatory pur-
poses. Rather, it may be prudent to draw predictive strength through combining
1
Portions of this chapter are based on Kaplan (2021).
193
models. The purpose of this chapter is to provide an overview and some examples
of methods to address the problem of model uncertainty. First, we will discuss
the elements of predictive modeling that set the foundation for our discussion of
model uncertainty. We then turn to the method of Bayesian model averaging (BMA)
as a classical approach to addressing model uncertainty. We then point out that
Bayesian model averaging rests on a strong assumption regarding the existence
of a true model among those to be averaged, and so we will discuss the notion of
true models in the general context of M-frameworks (Bernardo & Smith, 2000).
Relaxing this assumption will lead us to a discussion and example of Bayesian
stacking.
The organization of this chapter is as follows. In the next section, we discuss
Bayesian predictive modeling as embedded in Bayesian decision theory. Here
we discuss the concepts of expected utility and expected loss, and frame these
ideas within the use of information-theoretic methods for judging decisions. We
show that the action which optimizes the expected utility is the BMA solution.
Then, we discuss the statistical elements of BMA, including connections to Bayes
factors, computation considerations, and the problem parameter and model priors.
This is followed by a simple example of BMA in linear regression modeling and
a comparison of results based on different parameter and model prior settings.
The next section is a presentation of methods for evaluating the quality of a
solution based on BMA, including the use of scoring rules and how they tie back
to the information-theoretic concepts discussed earlier in the chapter. Finally, we
discuss the main problem associated with conventional BMA — namely, that BMA
assumes that the true data-generating model is contained in the set of models that
are being averaged and demonstrate the method of Bayesian stacking that directly
deals with this assumption. A simple example of Bayesian stacking is provided.
11.2 Elements of Predictive Modeling

The foundations of BMA are situated in the Bayesian predictivist framework dis-
cussed in Bernardo and Smith (2000). An excellent review of Bayesian predictive
modeling can be also found in Vehtari and Ojanen (2012), and we will borrow
notation from their paper.
Arguably, the overarching goal of statistics is prediction. In other words, a key
characteristic of statistics is the development of accurate predictive models, and
all other things being equal, a given model is to be preferred over other competing
models if it provides better predictions of what actually occurred (Dawid, 1984).
Indeed, it is hard feel confident about inferences drawn from a model that does a
poor job of predicting the extant data. The problem, however, is how to develop
accurate predictive models, and how to evaluate their accuracy. The approach
used to develop accurate predictive models discussed in this chapter is BMA, and
the evaluation of BMA-based analyses is best situated in the context of Bayesian
decision theory.
Bayesian decision theory (see, e.g., Good, 1952; Lindley, 1991; Berger, 2013)
provides a natural and intuitive approach to evaluating Bayesian predictive mod-
els generally, and BMA in particular. Specifically, as will be expanded on below,
Model Uncertainty 195
Bayesian decision theory casts the problem of predictive evaluation in the context
of maximizing the expected utility of a model – that is, the benefit that is accrued
from using a particular model to predict future observations. The greater the ex-
pected utility, the better the model is at predictive performance in comparison to
other models.
11.2.1 Fixing Notation and Concepts

Let D = {yi , xi }ni=1 be a set of data assumed to be fixed in the Bayesian sense, where
yi is an outcome of interest and xi is a (possibly vector-valued) set of predictors.
Further, let ( ỹ, x̃) be future observations of the outcome and the set of predictors,
respectively. Further, let M = {Mk }Kk=1 represent a set of models specified to provide
a prediction of the outcome ỹ, and let Mk represent a specific chosen model.
The elements of Bayesian decision theory that we adopt in this chapter have
been described by Bernardo and Smith (2000) and Vehtari and Ojanen (2012)
among many others. These elements consist of (1) an unknown state of the world
denoted as ω ∈ Ω, (2) an action a ∈ A, where A is the action space, (3) a utility
function u(a, ω) : A × Ω → R that rewards an action a when the state of the world
is realized as ω, and (4) p(ω | D) representing one’s current belief about the state
of world conditional on observing the data, D.
To provide a context for these ideas, and in anticipation of our empirical
example, we return to the problem of predicting reading performance measured
on 15 year-old students in the United States using data from PISA (OECD, 2018).
Following Bernardo and Smith (2000), Lindley (1991), Vehtari and Ojanen (2012),
and Berger (2013) and the notation given previously, (1) the states of the world
ω ∈ Ω correspond to the future observations of reading literacy ỹ ∈ Y, (2) the
action a ∈ A is the actual prediction of those future observations, (3) the utility
function u(a, ỹ) defines the reward attached to the prediction, and (4) p( ỹ | D, M∗ ) is
a posterior predictive distribution that encodes our belief about the future reading
literacy observations conditional on the data, D, and a belief model, M∗ (described
later).
11.2.2 Utility Functions for Evaluating Predictions

The goal of predictive modeling is to optimize the utility of taking an action a. A
number of utility functions exist, but common utility functions rest on the negative
quadratic loss function
u(a, ỹ) = −( ỹ − a)2 (11.1)
The optimal action a∗ is the one that maximizes the posterior expected utility, written
as (see Clyde & Iversen, 2013)
Z
a∗ = u(ω, a)p(ω | D)dω (11.2)
Ω
The idea here is to take an action a that maximizes the utility u when the future
observation is ỹ. Clyde and Iversen (2013) show that the optimal decision obtains
when a∗ = E( ỹ | D), which is the posterior predictive mean of ỹ given the data D.
Under the assumption that the true data-generating model exists and is among
the set of models under consideration, this can be expressed as
K
X K
X
E( ỹ | D) = E( ỹ | Mk , D)p(Mk | D) = p(Mk | D) ỹˆ Mk (11.3)
k=1 k=1
where ỹˆ Mk is the posterior predictive mean of ỹ under Mk . The expression in

Equation (11.3) represents Bayesian model averaging.
It is important to note that when considering the selection of a single model, one
might be tempted to chose the model with the highest posterior model probability
(PMP) p(Mk | D). In the case of only two models, the model with the largest PMP
will be the closest to the BMA solution. However, for more than two models,
Clyde and Iversen (2013) point out that the model closest to the BMA solution
might not be the one with the largest PMP.
11.3 Bayesian Model Averaging

In a complex real-world setting such as the one we find ourselves in when trying
to develop a predictive model of reading literacy, substantive discussions would
suggest that many different models could be entertained as reasonable models
of reading literacy. In this case, we agree with Clyde and Iversen (2013) that to
maximize our utility it would be best to average over the space of possible models
via BMA.
Bayesian model averaging has had a long history of theoretical developments
and practical applications. Early work by Leamer (1978) laid the foundation for
BMA. Fundamental theoretical work on BMA was conducted in the mid-90s by
Madigan and his colleagues (e.g., Madigan & Raftery, 1994; Raftery, Madigan, &
Hoeting, 1997; Hoeting et al., 1999). Additional theoretical work was conducted
by Clyde (1999, 2003). Draper (1995) discussed how model uncertainty can arise
even in the context of experimental designs, and Kass and Raftery (1995) provided
a review of Bayesian model averaging and the costs of ignoring model uncertainty.
Reviews of the general problem of model uncertainty can be found in Clyde and
George (2004) and more recently in Steel (2020) with a focus on economics. A
review of Bayesian model averaging with a focus on psychology can be found in
Hinne, Gronau, van den Bergh, and Wagenmakers (2020).
Practical applications and developments of Bayesian model averaging can be
found across a wide variety of domains. A perusal of the extant literature shows
applications of Bayesian model averaging to economics (e.g., Fernández, Ley, &
Steel, 2001b), political science (e.g., Montgomery & Nyhan, 2010), bioinformatics
of gene express (e.g., Yeung, Bumgarner, & Raftery, 2005), weather forecasting
(e.g., Raftery, Gneiting, Balabdaoui, & Polakowski, 2005; Sloughter, Gneiting,
& Raftery, 2013), propensity score analysis (Kaplan & Chen, 2014), structural
equation modeling (Kaplan & Lee, 2016), missing data (Kaplan & Yavuz, 2019),
and probabilistic forecasting with large-scale educational assessment data (Kaplan
& Huang, 2021).
The popularity of BMA across many different disciplines is due to the fact
that BMA is known to provide better out-of-sample predictive performance than
any other model under consideration as measured by the logarithmic scoring
rule (Raftery & Zheng, 2003). In addition, Bayesian model averaging has been
implemented in the R software programs BMA (Raftery, Hoeting, Volinsky, Painter,
& Yeung, 2020), BMS (Zeugner & Feldkircher, 2015), and BAS (Clyde, 2017).
These packages are quite general, allowing Bayesian model averaging over linear
models, generalized linear models, and survival models, with flexible handling of
parameter and model priors.
11.3.1 Statistical Specification of BMA

Following Madigan and Raftery (1994), consider a quantity of interest such as a
future observation. Again, denoting this quantity as ỹ, our goal is to obtain an
optimal prediction of ỹ, in the sense that the utility of predicting ỹ is maximized.
Next, consider a set of competing models M = {Mk }Kk=1 that are not necessarily
nested. The posterior distribution of ỹ given the data D can be written as a
mixture distribution,
K
X
p( ỹ | D) = p( ỹ | Mk )p(Mk | D) (11.4)
k=1
where p(Mk | D) is the posterior probability of model Mk written as

p(D | Mk )p(Mk )
p(Mk | D) = PK (11.5)
l=1 p(D | Ml )p(Ml )
where the first term in the numerator on the right-hand side of Equation (11.5) is
the probability of the data given model k, also referred to as the integrated likelihood
written as Z
p(y | Mk ) = p(y | θk , Mk )p(θk | Mk )dθk (11.6)
where p(θk | Mk ) is the prior distribution of the parameters θk under model Mk

(Raftery et al., 1997). The posterior model probabilities can be considered mixing
weights associated with the mixture distribution given in Equation (11.4) (Clyde
& Iversen, 2013). The second term p(Mk ) on the right-hand side of Equation (11.5)
is the prior model probability for model k, allowing each model to have a different
prior probability based on past performance of that model or a belief regarding
which of the models might be the true data-generating model. The denominator
of Equation (11.5) ensures that p(Mk | y) integrates to 1.0, as long as the true model
is in the set of models under consideration. We will defer the discussion of a true
model until later and then explicitly deal with the case where the true model is
not in the set of models under consideration.
11.3.2 Computational Considerations

The problem of reducing the overall number of models that one could incorporate
in the summation of Equation (11.4) has led to two important computational solu-
tions. One solution is based on the so-called Occam’s window criterion (Madigan &
Raftery, 1994) and the other is based on a Metropolis sampler referred to as Markov
chain Monte Carlo model composition (Madigan & York, 1995).
Occam’s Window
To motivate the idea behind Occam’s window, consider the problem of finding
the best subset of predictors in a linear regression model. Following closely the
discussion given in Raftery et al. (1997), we would initially start with a very
large number of predictors; but the goal would be to pare this down to a smaller
number of predictors that provide accurate predictions. As noted in the earlier
quote by Hoeting et al. (1999), the concern in drawing inferences from a single best
model is that the choice of a single set of predictors ignores uncertainty in model
selection. Occam’s window provides an approach to BMA that reduces the subset
of models under consideration, but instead of settling on a final ”best” model, we
instead integrate over the parameters of the smaller set with weights reflecting the
posterior uncertainty in each model.
The algorithm proceeds as follows (Raftery et al., 1997). In the initial step,
the space of possible models is reduced by implementing the so-called leaps and
bounds algorithm developed by Furnival and Wilson, Jr. (1974) in the context of
best subsets regression (see also Raftery, 1995). This initial step can substantially
reduce the number of models, after which Occam’s window can then be employed.
The general idea is that models are eliminated from Equation (11.4) if they predict
the data less well than the model that provides the best predictions based on a
caliper value C chosen in advance by the analyst. The caliper C sets the width of
Occam’s window. Formally, consider again a set of models {Mk }Kk=1 . Then, the set
A′ is defined as ( )
maxl {p(Ml | y)}
A′ = Mk : ≤C (11.7)
p(Mk | y)
In words, Equation (11.7) compares the model with the largest posterior model
probability, maxl {p(Ml | y)}, to a given model, p(Mk | y). If the ratio in Equation
(11.7) is greater than the chosen value C, then it is discarded from the set A′ of
models to be included in the model averaging. Notice that the set of models
contained in A′ is based on Bayes factor values.
The set A′ now contains models to be considered for model averaging. In the
second, optional, step, models are discarded from A′ if they receive less support
from the data than simpler submodels. Formally, models are further excluded
from Equation (11.4) if they belong to the set
( )
p(Ml | y)
B = Mk : ∃Ml ∈ A′ , Ml ⊂ Mk , >1 (11.8)
p(Mk | y)
Again, in words Equation (11.8) states that there exists a model Ml within the set
A′ and where Ml is simpler than Mk . If a complex model receives less support
from the data than a simpler submodel — again based on the Bayes factor — then
it is excluded from B. Notice that the second step corresponds to the principle of
Occam’s razor (Madigan & Raftery, 1994).
With Step 1 and Step 2, the problem of reducing the size of the model space
for BMA is simplified by replacing Equation (11.4) with
X
p( ỹ | y, A) = p( ỹ | Mk , y)p(Mk | y, A) (11.9)
Mk ∈A
In other words, models under consideration for BMA are those that are in A′ but
not in B.
Madigan and Raftery (1994) outline an approach to the choice between two
models to be considered for Bayesian model averaging. To make the approach
clear, consider the case of just two models M1 and M0 , where M0 is the simpler of the
two models. This could be the case where M0 contains fewer predictors than M1 in
a regression analysis. In terms of posterior odds, if the odds are positive, indicating
support for M1 , then we reject M0 . If the posterior odds are large and negative,
then we reject M1 in favor of M0 . Finally, if the posterior odds lie in between the
pre-set criterion, then both models are retained. For linear regression models, the
leaps and bounds algorithm combined with Occam’s window is available in the
bicreg option in the R program BMA (Raftery et al., 2020).
11.3.3 Markov Chain Monte Carlo Model Composition

Markov chain Monte Carlo model composition (MC3 ) is based on the Metropolis-
Hastings algorithm (see, e.g., Gilks, Richardson, & Spiegelhalter, 1996b), discussed
in Chapter 4, and is also designed to reduce the space of possible models that can
be explored via Bayesian model averaging. Following Hoeting et al. (1999), the
MC3 algorithm proceeds as follows. First, let M represent the space of models
of interest; in the case of linear regression this would be the space of all possible
combinations of variables. Next, the theory behind MCMC allows us to construct
a Markov chain {M(t), t = 1, 2, . . . , } which converges to the posterior distribution
of model k, that is, p(Mk | y).
The manner in which models are retained under MC3 is as follows. First,
for any given model currently explored by the Markov chain, we can define a
neighborhood for that model which includes one more variable and one less
variable than the current model. So, for example, if our model has four predictors
x1 , x2 , x3 , and x4 , and the Markov chain is currently examining the model with x2
and x3 , then the neighborhood of this model would include {x2 }, {x3 }, {x2 , x3 , x4 }, and
{x1 , x2 , x3 }. Now, a transition matrix is formed such that moving from the current
model M to a new model M′ has probability zero if M′ is not in the neighborhood
of M and has a constant probability if M′ is in the neighborhood of M. The model
M′ is then accepted for model averaging with probability
pr(M′ | y)
( )
min 1, (11.10)
pr(M | y)
otherwise, the chain stays in model M. We recognize the term inside Equation
(11.10) as the Markov acceptance ratio presented in Equation (4.4).
11.3.4 Parameter and Model Priors

One of the key steps when implementing BMA is to choose priors for both the
parameters of the model and the model space itself. Discussions of the choice of
parameter and model priors can be found in Fernández, Ley, and Steel (2001a);
Liang, Paulo, Molina, Clyde, and Berger (2008); Eicher, Papageorgiou, and Raftery
(2011) and Feldkircher and Zeugner (2009), with applications found in Fernández
et al. (2001a) and an example comparing parameter and model priors with large-
scale educational data can be found in Kaplan and Huang (2021).
Parameter Priors
A large number of choices for parameter priors are available in the R software
program BMS (Zeugner & Feldkircher, 2015) and are based on variations of Zell-
ner’s g-prior (Zellner, 1986). Specifically, Zellner introduced a natural-conjugate
normal-gamma g-prior for regression coefficients β under the normal linear re-
gression model, written as
yi = x′i β + ε (11.11)
where ε is iid N(0, σ2 ). For a given model, say Mk , Zellner’s g-prior can be written
as −1
βk | σ2 , Mk , g ∼ N 0, σ2 g x′k xk (11.12)
Feldkircher and Zeugner (2009) have argued for using the g-prior for two reasons:
its consistency in asymptotically uncovering the true model, and its role as a
penalty term for model size.
The g-prior has been the subject of some criticism. In particular, Feldkircher
and Zeugner (2009) have pointed out that the particular choice of g can have a
very large impact on posterior inferences drawn from BMA. Specifically, small
values of g can yield a posterior mass that is spread out across many models while
large values of g can yield a posterior mass that is concentrated on fewer models.
Feldkircher and Zeugner (2009) use the term supermodel effect to describe how
values of g impact the posterior statistics, including posterior model probabilities
(PMPs) and posterior inclusion probabilities (PIPs).
To account for the supermodel effect, researchers such as Fernández et al.
(2001a), Liang et al. (2008), Eicher et al. (2011), and Feldkircher and Zeugner
(2009) have proposed alternative priors based on extensions of the work of Zellner
(1986). Generally speaking, these alternatives can be divided into two categories:
fixed priors and flexible priors. Examples of fixed parameter priors include the
unit information prior, the risk inflation criterion prior, the benchmark risk inflation
criterion, and the Hannan-Quinn prior (Hannan & Quinn, 1979). Examples of
flexible parameter priors include the local empirical Bayes prior (E. George & Foster,
2000; Liang et al., 2008; Hansen & Yu, 2001), and the family of hyper-g priors
(Feldkircher & Zeugner, 2009).
Model Priors
In addition to parameter priors, it is essential to consider priors over the space of
possible models, which concerns our prior belief regarding whether the true model
lies within the space of possible models. Among those implemented in BMS, the
uniform model prior is a common default prior which specifies that if there are Q
predictors in the model, then the prior on the model space is 2−Q . The problem
with the uniform model prior is that the expected model size is Q/2, when in
fact there are many more models of intermediate size than there are models with
extreme sizes. For example, with six variables, there are more models of size 2 or
5 than there are 1 or 6. As a result, the uniform prior ends up placing more mass
on models of intermediate size. An alternative is the binomial model prior which
proposes placing a fixed inclusion probability θ on each predictor of the model.
The problem is that θ is treated as fixed, and so to remedy this problem, Ley and
Steel (2009) proposed a beta-binomial model prior which treats θ as random specified
by a beta distribution. Unlike the uniform model prior, the beta-binomial prior
will place equal mass across the models regardless of size.
11.3.5 Evaluating BMA Results: Revisiting Scoring Rules

With such a wide variety of parameter and model priors to choose from, it is
important to have a method for evaluating the impact of these choices when
applying BMA to substantive problems. Given that the utility of BMA lies in its
optimal predictive performance, a reasonable method for evaluation should be
based on measures that assess predictive performance – referred to as scoring rules
– which we introduced in Chapter 1.
A large number of scoring rules have been reviewed in the literature (see,
e.g., Winkler, 1996; Bernardo & Smith, 2000; Jose et al., 2008; Merkle & Steyvers,
2013; Gneiting & Raftery, 2007). Here, however, we highlight two related, strictly
proper scoring rules that are commonly used to evaluate predictions arising from
Bayesian model averaging: the log predictive density score, and the Kullback-
Leibler divergence score (see, e.g., Fernández et al., 2001b; Hoeting et al., 1999;
Kaplan & Yavuz, 2019; Kaplan & Huang, 2021, for examples).
The Log Predictive Density Score

The log predictive density (LPD) score (Good, 1952; Bernardo & Smith, 2000) can
be written as X
log p( ỹi , y, x̃i )

− (11.13)
i
where, for example, ỹi is the predictive density for ith person, x and y represent
the model information for the remaining individuals, and x̃i is the information on
the predictors for individual i. The model with the lowest log predictive score is
deemed best in terms of long-run predictive performance.
Kullback-Leibler Divergence Score

The Kullback-Leibler divergence measure was introduced in our discussion of
variational Bayes in Chapter 4. To briefly reiterate, we consider two distributions,
p(y) and g(y | θ), where p(y) could be the distribution of observed reading literacy
scores, and g(y | θ) could be the prediction of these reading scores based on a
model. The KLD between these two distributions can be written as
Z !
p(y)
KLD( f, g) = p(y)log dy (11.14)
g(y | θ)
where KLD( f, g) is the information lost when g(y | θ) is used to approximate

p(y). For example, the actual reading outcome scores might be compared to the
predicted outcome using Bayesian model averaging along with different choices
of model and parameter priors. The model with the lowest KLD measure is
deemed best in the sense that the information lost when approximating the actual
reading outcome distribution with the distribution predicted on the basis of the
model is lowest.
Example 11.1: Bayesian Model Averaging
We will focus again on the reading literacy results from PISA 2018. The list of
variables used in this example is given below in Table 11.1.
TABLE 11.1. PISA 2018 predictors of reading literacy

Variable name Variable label
FEMALE Sex (1=Female)
ESCS Index of economic, social, and cultural status
METASUM Meta-cognition: summarizing
PERFEED Perceived feedback
HOMEPOS Home possessions (WLE)a
ADAPTIVE Adaptive instruction (WLE)
TEACHINT Perceived teacher’s interest
ICTRES ICT resources (WLE)
JOYREAD Joy/Like reading
ATTLNACT Attitude towards school: learning activities
COMPETE Competitiveness
WORKMAST Work mastery
GFOFAIL General fear of failure
SWBP Subjective well-being: positive affect
MASTGOAL Mastery goal orientation
BELONG Subjective well-being: sense of belonging to school (WLE)
SCREADCOMP Perception of reading competence
SCREADDIFF Perception of reading difficulty (WLE)
PISADIFF Perception of difficulty of the PISA test
PV1READ First plausible value reading score (outcome variable)b
a
Weighted likelihood estimates. See OECD (2018) for more details.
b
Plausible values. See OECD (2018) for more details.
For this example, we use the software package BMS (Zeugner & Feldkircher,
2015), which implements the so-called Birth/Death (BD) algorithm as a default for
conducting MC3. See Zeugner and Feldkircher (2015) for more details on the BD
algorithm.
The analysis steps for this example are as follows:
1. We begin by implementing BMA with default unit information priors for
the model parameters and the uniform prior on the model space. We will
outline the major components of the results including the posterior model
probabilities and the posterior inclusion probabilities.
2. We next examine the results under different combinations of parameter and
model priors available in BMS and compare results using the LPS and KLD.
BMA Results
We first call BMS and use the unit information prior (g) (g=”uip”)and the uniform
model prior (mprior) (mprior=”uniform”).
PISAbmsMod1 <- bms(PISAdata, g="uip",mprior="uniform")
# Plot the model size and obtain marginal density plots
plotModelsize(PISAbmsMod1,col="black")
density(PISAbmsMod1)
# Obtain predicted values based on BMA results

and calculate the predicted density based on the mixture of
the n best models (default 500).
predicted.values.Mod1 <- predict(PISAbmsMod1)

predDensM1 <- pred.density(PISAbmsMod1,newdata=PISAdata)
density(predicted.values.Mod1)
plot(predPlot.Mod1,col="black")
The Bayesian model averaging results under unit information priors for model
parameters and the uniform prior for the model space are shown in Tables 11.2
and 11.3. We note that there are 19 predictors and thus 219 = 524, 288 models in
the full space of models to be visited. Table 11.2 presents a summary of the BD
algorithm used to implement MC3 in BMS. We find that the algorithm only visited
471 models (0.09%) out of the total model space, however, these models accounted
for 100% of the posterior model mass.2 The column labeled ”Avg # predictors”
shows that across all of the models explored by the algorithm, the average number
of predictors was 11.8 out of 19.
2
This percentage is obtained by summing over the PMPs for all models explored by the
algorithm and dividing by the total number of those models.
In the second row of Table 11.2 below we present the posterior model prob-
abilities associated with the top 5 models out of the 471 models explored by the
algorithm. It is important to note that Model 1 would also be associated with the
lowest Bayesian information criterion. Hence, on the basis of the low PMP for
Model 1 (0.35), we can see that selecting Model 1 and acting as though this is the
model we considered ahead of time considerably underestimates the uncertainty
in our model choice. Moreover, as Clyde and Iversen (2013) remind us, this model
might not be the one closest to the BMA solution.
TABLE 11.2. Summary of birth/death algorithm and top posterior model probabilities
Summary of Algorithm
Modelspace 2K Models visited % visited % Topmodels Avg. # predictors

524288 841 0.09 100 10.8
Posterior Model Probabilities
Model 1 Model 2 Model 3 Model 4 Model 5

0.35 0.20 0.06 0.04 0.03
It may be interesting to examine the impact of the model prior on the posterior
distribution of model sizes. The results are shown below in Figure 11.1.
FIGURE 11.1. Posterior model size under unit information parameters priors and uniform
model prior.
Notice in Figure 11.1 that although the code called for the uniform model prior, we
see that the prior model size is not actually uniform. This is because, as discussed
in the earlier section on model 9, the mean model size is approximately nine.
That is, there are more models containing nine predictors than, say, containing
two predictors. The uniform model prior adjusts for this by placing more mass on
models with larger numbers of predictors. Note also that the mean posterior model
size is approximately 11, thus somewhat larger than the mean prior model size,
suggesting that after encountering data the posterior places greater importance on
slightly larger models.
Table 11.3 below presents a summary of the BMA results. The column labeled
“PIP” shows the posterior inclusion probabilities for each variable, referring to
the sum of the posterior model probabilities for all models for which the variable
was included. For example, the PIP for ESCS is 1.00, meaning that 100% of
the posterior model mass rests on models that include ESCS. In contrast, only
0.09% of the model mass rests on ATTLNACT. The PIP thus provides a different
perspective on variable importance. The columns labeled “Post Mean” and “Post
SD” are the posterior estimates of the regression coefficients and their posterior
standard deviations, respectively. The column labeled “Cond. Pos. Sign” refers
to the probability that the sign of the respective regression coefficient is positive
conditional on its inclusion in the model. We find, for example, that the sign of
ESCS is positive in 100% of the models in which ESCS appears. By contrast, the
probability that the sign of the PISADIFF effect positive is zero, meaning that in
100% of the models visited by the algorithm, the sign of PISADIFF is negative.3
Finally, we present the frequentist p-values from a simple ordinary least-square
regression applied to these data.
3
The probabilities listed under the Cond. Pos. Sign column will often range from zero to
one, but for these results, it appears that all 471 models show clarity with respect to the sign
of the posterior coefficients.
TABLE 11.3. Summary of BMA with unit information parameter priors and uniform model
priors
Predictor PIP Post. Coeff. Post. SD Cond. pos. sign Freq. p-value
ESCS 1.00 18.97 1.30 1.00 0.000
METASUM 1.00 27.99 1.29 1.00 0.000
TEACHINT 1.00 12.53 1.51 1.00 0.000
JOYREAD 1.00 10.33 1.32 1.00 0.000
GFOFAIL 1.00 11.06 1.21 1.00 0.000
MASTGOAL 1.00 −13.34 1.50 0.00 0.000
SCREADCOMP 1.00 10.13 1.49 1.00 0.000
PISADIFF 1.00 −29.71 1.46 0.00 0.000
PERFEED 0.98 −5.00 1.55 0.00 0.001
SWBP 0.87 −3.95 1.92 0.00 0.015
WORKMAST 0.86 4.28 2.16 1.00 0.008
FEMALE 0.64 5.14 4.37 1.00 0.006
ADAPTIVE 0.12 0.45 1.31 1.00 0.053
SCREADDIFF 0.08 −0.19 0.78 0.00 0.007
COMPETE 0.07 0.19 0.79 1.00 0.038
BELONG 0.05 −0.15 0.72 0.00 0.020
HOMEPOS 0.02 0.02 0.29 1.00 0.046
ICTRES 0.01 −0.02 0.23 0.00 0.020
ATTLNACT 0.00 0.00 0.09 1.00 0.500
We find that the first 12 predictors (ESCS thru GENDER) have relatively high
PIPs. The majority of these predictors have PIPs of 1.0, indicating their importance,
and these are also associated with statistically significant p-values. It is also inter-
esting to note that these predictors contain a mix of demographic measures (e.g.,
ESCS, GENDER), attitudes/perceptions (e.g., TEACHINT, JOYREAD, SCREAD-
COMP), and cognitive strategies involved in reading (e.g., METASUM). Perhaps
most importantly, we find that coefficients that are statistically significant from the
ordinary least square regression have very small posterior inclusion probabilities
when accounting for model uncertainty. For example, the variable SCREADDIFF
(self-perception of reading difficulty) has an OLS estimate of −4.5 (not shown)
and is statistically significant (p = .007). However, when accounting for model
uncertainty, this coefficient is −0.19 with a posterior inclusion probability of 0.08.
This finding is what is meant by “...over confident inferences...” (Hoeting et al.,
1999).
BMA Model Comparison

In this section, we compare the results from the model in the previous section which
used UIP parameter priors and the uniform model prior to an alternative model
with different prior settings. For this comparison, we slightly alter the previous
call to BMS by requesting the risk information criterion for parameters priors, and
the beta-binomial model prior which is requested by writing mprior=”random”.
PISAbmsMod2 <- bms(PISAdata, g="uip",mprior="random",mprior.size=9.5)
The results of the BD algorithm are shown below in Table 11.4.
TABLE 11.4. Summary of birth/death algorithm and top posterior model probabilities:
Model 2
Summary of Algorithm
Modelspace 2K Models visited % visited % Topmodels Avg. # predictors

524288 985 0.19 100 11.37
Posterior Model Probabilities
Model 1 Model 2 Model 3 Model 4 Model 5

0.09 0.06 0.04 0.03 0.03
The posterior model probabilities are uniformly smaller under this set of prior
specifications. The impact of the beta-binomial model prior on the distribution of
model size is shown below in Figure 11.2.
FIGURE 11.2. Posterior model size under unit information parameters priors and beta-
binomial model prior.
Here we see that the prior distribution of model size is completely flat and we
note that posterior model size under the beta-binomial prior is slightly higher
than under the uniform model prior. Finally, the results of the BMA are displayed
below in Table 11.5.
TABLE 11.5. Summary of BMA with unit information parameter priors and random model
priors: Model 2
Predictor PIP Post. mean Post. SD Cond. pos. sign

ESCS 1.00 19.95 1.50 1.00
METASUM 1.00 29.17 1.42 1.00
TEACHINT 1.00 11.05 1.70 1.00
JOYREAD 1.00 9.83 1.46 1.00
GFOFAIL 1.00 11.47 1.35 1.00
MASTGOAL 1.00 −13.12 1.65 0.00
SCREADCOMP 1.00 9.74 1.75 1.00
PISADIFF 1.00 −29.46 1.80 0.00
WORKMAST 0.72 3.55 2.59 1.00
PERFEED 0.63 −2.62 2.31 0.00
SWBP 0.50 −2.05 2.29 0.00
FEMALE 0.38 2.83 4.04 1.00
SCREADDIFF 0.36 −1.51 2.24 0.00
BELONG 0.32 −1.25 2.03 0.00
COMPETE 0.21 0.72 1.55 1.00
ADAPTIVITY 0.10 0.31 1.10 1.00
ATTLNACT 0.06 0.06 0.42 1.00
ICTRES 0.05 −0.13 0.70 0.00
HOMEPOS 0.04 0.10 0.73 1.00
We find that the results are virtually identical to those in Table 11.3 under uni-
form model priors, with differences among variables with much smaller inclusion
probabilities. This can be seen by comparing the PIPs in Figure 11.3 below.
FIGURE 11.3. Comparison of inclusion probabilities across variables. Model 1 = uniform

model prior; Model 2 = beta-binomial model prior.
Scoring Predictive Performance

A more formal approach to comparing the results of the two models is to compare
their predictive distributions using the log predictive score and the Kullback-
Leibler divergence measure. To call this in R we use the program LaplacesDemon
(Statisticat & LLC., 2021). The commands are
KLD(predicted.values.Mod1, PISAdata[,1])
lps.bma(predDensM1, realized.y= PISAdata[,1])
KLD(predicted.values.Mod2, PISAdata[,1])
lps.bma(predDensM2, realized.y= PISAdata[,1])
The results of this comparison are shown in Kaplan (2021) in which the KLD and
LPS measures for both model are virtually identical. The KLD values for both
models were 0.015 and the LPS scores for both models were 5.84. This finding
suggests the choice of these parameter and model priors does not impact the
predictive performance of these models, and this is perhaps not surprising given
the large sample size for this example.
11.4 True Models, Belief Models, and

M-Frameworks
Earlier, our discussion made mention of belief models and true models, and a
perusal of the extant literature on model averaging reveals a distinction between
a so-called actual belief model (sometimes referred to as a Bayesian belief model), M∗ ,
and a true model, denoted as MT . Unfortunately, there does not appear to be a
consensus about the meaning of belief models or true models, and in some cases,
they are viewed as roughly the same thing. For example, Bernardo and Smith
(2000) introduced the idea of the actual belief model but seem to be describing it as
the true model, and in fact labeled the actual belief model as MT , suggesting that
M∗ is the true but unknown data-generating model. This position appears to be
held by Clyde and Iversen (2013) who adopt a similar notation and seem to use
the terms “belief model” and ”true model” interchangeably. In contrast, Vehtari
and Ojanen (2012) suggest that M∗ is different from MT and is something that we
have access to insofar as it derived from what we have learned from our encounter
with data. For our example, M∗ would be the result of what has been learned from
the construction and criticism of a substantive model of reading literacy. That
is, if (1) one has specified a reasonable probability model for the reading literacy
outcome, (2) one has access to a rich enough set of predictors of reading literacy,
(3) all of the important prior uncertainties have been captured, and (4) the model
has withstood criticisms, say in the form of posterior predictive checks (Gelman,
Meng, & Stern, 1996), then this model is M∗ . Regarding MT , Vehtari and Ojanen
(2012, p. 155) suggest that “the properties of the true model are specified by the
modeller a priori, and they are not learned from the data properties.” Vehtari and
Ojanen (2012) view the specification of MT in very general terms, such as an iid
Gaussian assumption regarding the outcome of interest.
11.4.1 Model Averaging in the M-Closed Framework

As implied earlier, the M-closed framework for BMA may be especially difficult
to warrant in the social and behavioral sciences. Nevertheless, as pointed out by
Bernardo and Smith (2000), there may be cases in which it is reasonable to act as
though there is a true model, keeping in mind that they seem to suggest that MT is
what we are referring to here as M∗ . For example, we may wish to act as though M-
closed holds when a model has demonstrated good predictive capabilities under a
wide variety of situations, but that under a new situation, new uncertainties arise.
As long as the analyst is comfortable assigning model priors, then the M-
closed framework can be adopted. Nevertheless, the truth or falsity of the M-closed
framework notwithstanding, it is important to reiterate that conventional BMA
takes place under the M-closed framework and, indeed, readily available BMA
software typically employ a non-informative prior to the space of models as a
default, with the idea that the true model lies in the model space. Indeed, as noted
by Clyde and Iversen (2013), the posterior model probabilities will converge to 1.0
for the model that is closest to the true model (under Kullback-Leibler divergence),
and will equal the true model in an infinitely large sample size. If MT < M, then
BMA is not consistent.
11.4.2 Model Averaging in the M-Complete Framework

With the M-closed assumption unlikely to hold in practice, we are faced with
the problem of how to obtain the benefits of model averaging with respect to
predictive accuracy. One approach would be to create a list of simpler “proxy”
linear models, {Mk }Kk=1 specified for clarity of communication and ease of analysis
(Bernardo & Smith, 2000). Each of these models would be evaluated in light of the
true model. This is referred to as the M-complete framework (Bernardo & Smith,
2000). Under M-complete , BMA would not, in principle, be conducted as it does
not make sense to place a discrete prior on the model space when one does not
believe that MT ∈ M. Instead, as suggested by Clyde and Iversen (2013), Yao et
al. (2018a), and Vehtari and Ojanen (2012), one simply selects the model Mk ∈ M
that maximizes expected utility with respect to predictive distributions. Such a
model might also have been estimated using one of a number of sparsity-inducing
priors (see Chapter 10). However, this suggests that a single model is being used
for predictive purposes with the result that model uncertainty is again not being
addressed.
More recently, Clarke and Clarke (2018) discussed the idea that M-complete
could constitute a range of inaccessibility to MT , and that methods such as BMA
could be justified under M-complete insofar as the model priors would encode
one’s belief as to how good an approximation a given model is to MT under an
M-closed situation. In any event, modeling under the M-complete framework does
not provide an approach to directly address the problem of model uncertainty,
when M-closed is hard to maintain.
11.4.3 Model Averaging in the M-Open Framework

If it is difficult to warrant model priors as required under M-closed, and if se-
lecting a single model under M-complete that maximizes expected utility is not
satisfactory, then we need an approach that allows for model averaging without
the need to assume that MT ∈ M. This is referred to as the M-open framework
(Bernardo & Smith, 2000). An example of an M-open problem is in specifying a set
of regression models with different choices of predictors. These different regres-
sion models would represent reasonable alternative belief models. Then, rather
than using posterior model probabilities as weights, each model would yield a
separate score without presuming the existence of a true model underlying any
of the separate models. These models would be combined using their scores as
weights, and the resulting predictive distribution would be obtained. This type of
model averaging in the M-open framework describes the methodology of Bayesian
stacking which we consider next.
11.5 Bayesian Stacking

The method of stacking was originally developed in the machine learning literature
by Wolpert (1992) and Breiman (1996) and brought into the Bayesian paradigm
by Clyde and Iversen (2013). The basic idea behind stacking is to enumerate a
set of K (k = 1, 2, . . . K) models and then create a weighted combination of their
predictions. Returning to our reading literacy example, we can specify a set of
candidate (belief) models of reading literacy as
y = fk (x) + ϵ (11.15)
where fk are different models of the reading literacy outcome, for example, some
models may include only demographic predictors, while others may include vari-
ous combinations of attitudes and behaviors related to reading literacy. Indeed, fk
might even reflect a non-linear model of reading literacy. Predictions from these
separate models are then combined (stacked) as (see Le & Clarke, 2017)
K
X
ỹ = ŵk fˆk (x) (11.16)
k=1
where fˆk estimates fk . The weights, ŵk (ŵ1 , ŵ2 , . . . ŵK ), are obtained as

n  K
2
X X 
ŵ = argmin  yi −

w k fˆk,−i (xi ) (11.17)
w
 
i=1 k=1
where fˆk,−i (xi ) is an estimate of fk based on n − 1 observations, leaving the ith

observation out.
11.5.1 Choice of Stacking Weights

It is interesting to note that LOO-CV has connections to other types of weights
that can be used for stacking. For example, as noted in Section 6.3.4, in the case of
maximum likelihood estimation LOO-CV weights are asymptotically equivalent
to Akaike information criterion (AIC) weights (Stone, 1977) that are used in
frequentist model averaging applications (Yao et al., 2018a, see also; Burnham &
Anderson, 2002; Fletcher, 2018). In addition, so-called pseudo-BMA weights were
proposed by Geisser and Eddy (1979, see also; Gelfand, 1996). This approach
replaces marginal likelihoods with LOO-CV predictive densities. The difficulty
with pseudo-BMA weights is that they do not take into account uncertainty in
future data distributions. To address this issue, Yao et al. (2018a) proposed an
approach that combines the Bayesian bootstrap (see Rubin, 1981) with the ELPD
defined earlier. They refer to this approach as pseudo-BMA+ and show that it
performs better than BMA and pseudo-BMA in M − open settings, but not as well
known as stacking using the log-score.
Example 11.2: Bayesian Stacking
For this chapter, we demonstrate Bayesian stacking using the software pro-
gram loo with the same PISA 2018 dataset used to demonstrate BMA. The analysis
steps for this demonstration are as follows:
1. Specify four models of reading literacy. From Table 11.1, Model 1 in-
cludes only demographic measures (FEMALE, ESCS, HOMEPOS, ICTRES);
Model 2 includes only attitudes and behaviors specifically directed toward
reading (JOYREAD, PISADIFF, SCREADCOMP, SCREADDIF); Model 3 in-
cludes predictors related to academic mindset as well as general well-being;
(METASUM, GFOFAIL, MASTGOAL, SWBP, WORKMAST, ADAPTIVITY,
COMPETE); and Model 4 includes attitudes toward school (PERFEED,
TEACHINT, BELONG).
2. Obtain results from log-score stacking weights, pseudo-BMA weights, and
pseudo-BMA+ weights.
3. Obtain posterior predictive distributions using the R software program rsta-
narm (Goodrich et al., 2022).
4. Obtain KLD measures comparing the predicted distribution of reading
scores to the observed distribution.
The following code is can be used to implement Bayesian stacking. To begin,
we require rstanarm, loo, and LaplacesDemon.
library(rstanarm)
library(loo)
library(LaplacesDemon)
After reading in the data, we write a list containing the models to be compared.
fits <- list()

# model 1 variables
fits[[1]] <- update(fits[[1]], formula = PV1READ ˜ Female + ESCS
+ HOMEPOS + ICTRES)
# model 2 variables
fits[[2]] <- update(fits[[1]], formula = PV1READ ˜ JOYREAD + PISADIFF
+ SCREADDIFF + SCREADCOMP)
# model 3 variables
fits[[3]] <- update(fits[[1]], formula = PV1READ ˜ METASUM + GFOFAIL
+ MASTGOAL + SWBP + WORKMAST + ADAPTIVITY + COMPETE)
# model 4 variables
fits[[4]] <- update(fits[[1]], formula = PV1READ ˜ PERFEED + TEACHINT
+ BELONG)
Next we compute the ELPD and LOO-IC for each model.
loo_list <- lapply(fits, loo, cores = 4)
Next we compute the stacking, pseudobma, and pseudobma+ weights.
wtsStacking <- loo_model_weights(loo_list, method = "stacking")

print(wtsStacking)
wtsPseudoBMA <- loo_model_weights(loo_list, method = "pseudobma",

BB=FALSE)
print(wtsPseudoBMA)
wtsPseudoBMABB <- loo_model_weights(loo_list, method = "pseudobma")

print(wtsPseudoBMABB)
With the weights in hand, we next produce a weighted combination of the posterior
predictions under each type of weight. The command posterior predict comes
from rstanarm.
n_draws <- nrow(as.matrix(fits[[1]]))

ypredStacking <- matrix(NA, nrow = n_draws, ncol = nobs(fits[[1]]))
for (d in 1:n_draws) {
k <- sample(1:length(wtsStacking), size = 1, prob = wtsStacking)
ypredStacking[d, ] <- posterior_predict(fits[[k]], draws = 1)
}

ypredPseudoBMA<- matrix(NA, nrow = n_draws, ncol = nobs(fits[[1]]))
k <- sample(1:length(wtsPseudoBMA), size = 1, prob = wtsPseudoBMA)
ypredPseudoBMA[d, ] <- posterior_predict(fits[[k]], draws = 1)
}

ypredPseudoBMABB<- matrix(NA, nrow = n_draws, ncol = nobs(fits[[1]]))

k <- sample(1:length(wtsPseudoBMABB), size = 1,
prob = wtsPseudoBMABB)
ypredPseudoBMABB[d, ] <- posterior_predict(fits[[k]], draws = 1)
}
We now have the predicted values which we can compare to the actual reading
scores using the KLD scoring rule in the following code.
ypredStacking <- colMeans(ypredStacking)

kld1 <- KLD(ypredStacking,PV1READ)$sum.KLD.py.px
KLD(ypredStacking,PV1READ)
ypredPseudoBMA <- colMeans(ypredPseudoBMA)

kld2 <- KLD(ypredPseudoBMA,PV1READ)$sum.KLD.py.px
KLD(ypredPseudoBMA,PV1READ)
ypredPseudoBMABB <- colMeans(ypredPseudoBMABB)

kld3 <- KLD(ypredPseudoBMABB,PV1READ)$sum.KLD.py.px
KLD(ypredPseudoBMABB,PV1READ)
Table 11.6 below presents the results for Bayesian stacking with different
choices of weights.
TABLE 11.6. Log-score stacking, pseudo-BMA, and pseudo-BMA+ weights along with
LOO-IC and Kullback-Leibler divergence
Stacking Pseudo-BMA Pseudo-BMA+ LOO-IC

Model 1 0.001 0.000 0.000 48277.66
Model 2 0.594 1.000 0.995 47644.65
Model 3 0.405 0.000 0.005 47860.30
Model 4 0.000 0.000 0.000 48757.57
KLD 0.016 0.018 0.018
We find that Model 2, which includes predictors related attitudes and behaviors
directed toward reading, has the highest weight regardless of how the weights
were calculated. We find that pseudo-BMA and pseudo-BMA+ place almost all of
the weight on Model 2 whereas the stacking weights based on the log predictive
score are somewhat more spread out, with Model 3 having the next highest weight.
We also find that the Model 2 has the lowest LOO-IC value.
The bottom row of Table 11.6 presents the KLD measures obtained from com-
paring the distribution of predicted reading scores to the observed reading scores
for each method of obtaining weights. Keep in mind that the predicted distribution
under stacking is based on mixing the predicted distributions from the different
models with mixing proportions equal to the weights. Here we find that the lowest
KLD value is obtained under the log-score stacking weights. Overall, we find that
stacking using LOO weights provides overall the best predictive performance. It
may be interesting to note that the KLD values for the BMA results are uniformly
lower compared to the KLD values in Table 11.6 although it needs to be reiterated
that BMA assumes an M-closed framework.
11.6 Summary
Although the orientation of this chapter was focused on Bayesian methods for
quantifying model uncertainty, it should be pointed out that issues of model un-
certainty and model averaging have been addressed within the frequentist domain.
The topic of frequentist model averaging (FMA) has been covered extensively in
Hjort and Claeskens (2003), Claeskens and Hjort (2008), and Fletcher (2018). Our
focus on Bayesian model averaging is based on some important advantages over
FMA. As noted by Steel (2020), (1) BMA is optimal (under M-closed) in terms of
prediction as measured by the log predictive density score; (2) BMA is easier to
implement in situations where the model space is large due to very fast algorithms
such as MC3 ; (3) BMA naturally leads to substantively valuable interpretations of
posterior model probabilities and posterior inclusion probabilities; and (4) in the
majority of content domains wherein model averaging is required, BMA is more
frequently used than FMA.
12
Closing Thoughts
Throughout this book, the applications have primarily focused on regression-

based models applied to large-scale educational assessments. For each example, I
laid out a set of steps that could be described as a workflow, but I didn’t name it as
such. In recent years, a number of writers have provided examples of workflows
or checklists for Bayesian analysis. Among these are Bayesian Workflow (2020),
Schad et al. (2019) and Depaoli and van de Schoot (2017). In the following section,
I provide my version of a Bayesian workflow applicable to regression-based mod-
eling commonly encountered in education and the social sciences. I will also relate
each step of the workflow to specific chapters in the book. Following this work-
flow, I close the book with a summary of the advantages of adopting a Bayesian
framework to social science research.
12.1 A Bayesian Workflow for the Social Sciences

In this section we describe a possible workflow for a Bayesian analysis of large-
scale educational data utilizing non-informative or weakly informative priors. For
the purposes of this chapter, I proposed a workflow roughly consistent with what
appeared in the examples throughout this book, but I believe this workflow would
be more or less applicable to other research and data analysis settings. The steps
of our workflow are as follows:
1. Specify the outcome and set of predictors of interest, taking special care to
note the assumptions regarding the distribution of the outcome, for example,
is the outcome assumed to be normally distributed, or does the outcome
perhaps follow some type of non-normal distribution such as the logistic or
Poisson distribution. Specifying simple Bayesian models for the moments
of the distribution (e.g., mean and variance) and examining the sensitivity
of different prior choices can be quite useful and provide a sense of the
probability model that generated the outcome. An interesting example of
specifying slightly different probability models for the same outcome was
give in Chapter 5 in our consideration of the Poisson and negative binomial
regression model.
217
2. Specify the functional form of the relationship between the outcome and the
predictors. For the social sciences, this will most likely be a type of linear or
generalized linear model, but more complex models are, of course, possible.
Chapter 8, for example, discusses Bayesian methods for continuous and
categorical latent variables. As an aside, it is important to note that there
may be more than one model that could have plausibly generated the data.
Keeping the problem of model uncertainty in the back of one’s mind is quite
important, depending on the goals of the analysis.
3. Take note of the complexities of the data structure, for example, are the data
generated from a clustered sampling design? Are there sampling weights?
Accounting for the complexities of the data structure can be handled by
careful specification of a Bayesian hierarchical model, and this was discussed
in Chapter 7 with examples involving multilevel modeling. This book did
not cover the use of sampling weights, but these can be easily incorporated
in Stan-based programs such as rstanarm (Goodrich et al., 2022) and brms
(Bürkner, 2021).
4. Decide on the prior distributions for all parameters in the model. As dis-
cussed in Chapter 2, these priors will be either non-informative, weakly
informative, informative, or a mix of all three. In addition, the goals of an
analysis might be to induce sparsity and therefore the choices of shrink-
age priors discussed in Chapter 10 are available. Because our examples
were based on large-sample data, we found little impact of prior choice
on posterior results, but of course this will not always be the case. Thus,
an important activity at this step is to generate data according to the prior
distributions and gauge the sensibility and sensitivity of the priors — so
called prior predictive checking, discussed in Chapter 6. Finally, in the spirit
of research transparency, the origin of all priors must be communicated to
the research community.
5. After running the analysis, it is essential to check the convergence criteria
of the algorithm. The basics of Bayesian computation, along with conver-
gence criteria, were discussed in Chapter 4. Note that results cannot be
communicated unless there is overwhelming evidence from a variety of di-
agnostics that the algorithm converged. There are instances where there
may be contradictory evidence of convergence. For example, trace plots
may appear fine, but Rhat values may be somewhat problematic. All at-
tempts should be made to improve these diagnostics before communicating
the results. In most cases, if the effective sample size and Rhat values are
reasonable, then one can proceed with communicating the results. This
is because these diagnostics together capture autocorrelation, mixing, and
trend in the iterations.
6. Given evidence of convergence, and with the results in hand, posterior
predictive checking is a necessary step in the Bayesian workflow. Posterior
predictive checks can be set up to gauge overall model fit, but depending on
the goals of the model, specific posterior predictive checks can be provided
regarding the fit of specific aspects of the posterior predictive distribution,
such as checks on the fit of the model to extreme values.
Closing Thoughts 219
7. Following posterior predictive checks, a full description of the posterior

distributions of the model parameters should be provided, including the
mean, standard deviation, and posterior intervals of interest. Additional
posterior intervals of substantive interest should be provided, such as the
probability that the effect is greater than zero, or the probability that the effect
lies between two values of substantive importance. Examples of exploring
the entire posterior distribution of an outcome were given in Chapter 4
and Chapter 5. Again, this aspect of the analysis will be decided based on
substantive considerations.
8. Sensitivity analyses should be conducted, examining the impact of the choice
of priors on the substantive results. Other choices of priors can include sim-
ply comparing the findings to the case where all priors are non-informative,
or to the case where very small changes to the mean and variance of the
prior distributions are made. Sensitivity analyses were provided in Chapter
5. Note that with large sample sizes such as those encountered in our ex-
amples, it is likely that results will be robust to reasonable alternative prior
distributions.
9. At this point, the analyst may wish to engage in model selection. As a general
point, model selection should be avoided, or should proceed with great
caution. As discussed in Chapter 6, one may wish to use the Bayes factor
for model comparison, understanding that it only provides an assessment
of the relative evidence in the data for the models being compared. The
Bayes factor should, in general, not be used for model selection because it
is silent as to the intended purposes of the model. From a pragmatic point
of view, it is preferable to further explore the model with detailed posterior
predictive checks. The literature also suggests that the BIC should, perhaps,
be avoided altogether for the purposes of model selection.
10. Finally, it may be important to examine model uncertainty, and this issue
was addressed in Chapter 11. Addressing model uncertainty is particularly
crucial if the goal of an analysis is to develop a model with optimal predic-
tive performance. One might also wish to investigate the extent of model
uncertainty if the analyst is specifying a number of different models. That
is, beyond model selection using the DIC, WAIC, or LOO-IC discussed in
Chapter 6, the search for a model is undertaken under pervasive uncertainty
and it might be useful to gauge the extent of model uncertainty through a
method such as BMA. However, again, if BMA is being used specifically
to develop an optimally predictive model, then it is important to recall that
conventional BMA takes place under the M-closed framework wherein a
true model is assumed to be in the space of models searched by the algo-
rithm. The approach of Bayesian stacking, also discussed in Chapter 11,
represents a viable alternative when M-closed is not assumed to hold, but
rather a list of reasonable alternative models are to be combined to obtain
an optimal mix of predictive distributions.
12.2 Summarizing the Bayesian Advantage

The 10-point workflow offered in the previous section is not designed to be a
checklist per se, but more of a framework for practical Bayesian modeling in the
social sciences. Nevertheless, it is still reasonable to ask what are the advantages to
adopting a Bayesian perspective statistical modeling in the social sciences. A nice
summary of the distinctions between Bayesian and frequentist statistical inference
can be found in Wagenmakers, Lee, Lodewyckx, and Iverson (2008). Here, I select
from their paper some of the more fundamental differences that are in line with
the organization of the book.
12.2.1 Coherence
The rules of probability, as described in Chapter 1, and manifested in Bayes’
theorem, cohere in such a way as to be internally consistent, providing only one
method for obtaining an answer. The rules of probability along with Bayes’
theorem are coherent because they align with axioms of rational decision making.
Practically speaking, coherent probability theory allows one to avoid Dutch book
bets.
12.2.2 Conditioning on Observed Data

An important distinction between the Bayesian and frequentist schools of statistics
concerns the role that the observed data play in parameter estimation. Specifically,
frequentist inference is based on a set of pre-experimental/observational assump-
tions about the behavior of an estimator under (infinite) repeated sampling of
the same experiment or observation under exactly the same conditions. These
assumptions are embedded in conventional practice around concerns regarding
sample sizes required for the results of statistical methods to be trustworthy.
Rules of thumb have been provided for a variety of different procedures com-
monly used in the social sciences. These rules of thumb have been developed
under frequentist procedures examining the behavior of estimators, and partic-
ularly their standard errors, under different sample size conditions and under a
null hypothesis of a true model for an infinite sample size. Thus, in the frequentist
practice of statistics, we warrant our conclusions about the validity of our findings
from pre-experimental/observational assumptions about the asymptotic behavior
of the estimator.
The Validity of Inferences Drawn from Bayesian analyses do not rely on
claims regarding the asymptotic behavior of an estimator before any data are col-
lected. Rather, Bayesian inference is based entirely on the posterior distribution
of the parameters conditioning on the data in hand via Bayes’ theorem, namely,
p(θ | y) ∝ p(y | θ)p(θ). Notice that there is no appeal to pre-data asymptotics to
support inferences about θ. This issue of conditioning on observed data is par-
ticularly important when considering null hypothesis significance testing, as we
discussed in Chapter 6. Moreover, as pointed out by Kadane (2011), considering
that inferences are being drawn based on data that were not observed, this is a
violation of the likelihood principle. Again, by contrast, Bayesian theory concerns
Closing Thoughts 221
itself with inferences based on the data in hand, not on inferences based on data
that have never been observed. Finally, as discussed at length in Chapter 6, the
goal of hypothesis testing in the Bayesian framework is not to make statements
in support or refutation of a null hypothesis, but rather to fully summarize the
distribution of the parameters of interest and to examine the predictive quality
of a proposed model. Posterior predictive checking provides a way of probing
whether a model can predict data that actually have occurred, and falls squarely
into neo-Popperian theory as noted by Gelman and Shalizi (2013) and discussed
in the Preface.
12.2.3 Quantifying Evidence

Returning to the use of the frequentist p-value, recall that the frequentist frame-
work uses the p-value as a means of quantifying evidence against a null hypothesis.
Well-known guidelines have been established over the decades for establishing
evidence against the null hypothesis, with the ubiquitous p < 0.05 having become
the accepted standard in the social sciences. As pointed out by Wagenmakers
et al. (2008), the problem with this approach is that if the p-value is a measure
of evidence, then it should be applied equally across all sample sizes. This is
referred to as the p-value postulate. That is, a study that achieved a p-value of,
say 0.05, based on a sample size of 10, constitutes just as much evidence against
the null hypothesis as a study based on a sample size of 100. However, there
is disagreement among applied researchers in the social sciences regarding this
claim.1 Indeed, there are those who believe that a significant p-value (say 0.05)
based on 10 observations provides stronger evidence against the null hypothesis,
than when based on 100 observations. Still others believe just the opposite (see
Royall, 1986, for a more detailed discussion), though again, we must recognize
concerns raised about Bayes factors as discussed in Chapter 6.
The inconsistency regarding the p-value postulate stems from how “evidence”
is defined. Specifically, a Bayesian perspective rejects the p-value postulate be-
cause the orientation of Bayesian hypothesis testing is to assign probabilities to
proposed hypotheses. Taking two hypotheses, we assign prior probabilities to
these hypotheses and accept the hypothesis with the higher posterior probability.
An equivalent summary of this approach is based on Bayes factors discussed in
Chapter 6 (see also Jeffreys, 1961; Kass & Raftery, 1995).
12.2.4 Validity
It has been argued by Gigerenzer et al. (2004) that Bayesian statistics provides in-
ferences that the analyst actually cares about. Namely, researchers wish, and often
report, results with respect to their hypotheses of interest. In other words, the ana-
lyst wishes to make statements about the probability of their particular hypothesis
of interest. However, the Neyman-Pearson framework with its requirement of
setting the probability of a Type I error ahead of collecting data, and the Fisherian
1
However, as pointed out by Wagenmakers et al. (2008), Fisher himself held that the p-value
postulate was correct.
framework of interpreting the p-value as the strength of evidence against the null
hypothesis, both preclude this wish. The analyst’s wish can come true, however,
using a Bayesian framework because it provides probability assessments of the
scientific hypothesis actually under consideration. Indeed, access to the posterior
distribution of the parameter(s) of interest provide a much richer and more nu-
anced description of the research question of interest than the relatively artificial
dichotomization of the research result into “significant” or “non-significant.”
12.2.5 Flexibility in Handling Complex Data Structures

Bayesian hierarchical models offer a flexible approach to the analysis of complex
data structures, such as clustered samples. Moreover, though not discussed in
Chapter 10, Bayesian approaches to sparsity allow for research studies in which
the number of observations (n) is less than the number of variables (p). Finally,
Bayesian stacking methods allow for the predictive distributions of very different
and complex models to be combined so as to optimize prediction (see Chapter 11).
12.2.6 Formally Quantifying Uncertainty

Perhaps the singular advantage of the Bayesian school of statistics is that it pro-
vides a way to quantify our uncertainty about model parameters and models
themselves. The frequentist school treats each study as if it is the first of its kind,
and that no prior information is available on the topic at hand. Moreover, the
frequentist approach to modeling treats models as fixed and ignores the uncer-
tainty inherent in the selection of a model. However, even a casual consideration
of standard frequentist practice reveals that the assumption of fixed models is
manifestly untrue. Perhaps the most obvious example within frequentist practice
is the choice of variables to be included in a model. This choice is almost certainly
made on the basis of prior information; but given that there are likely alternative
interpretations of that prior information, the uncertainty in the choice is not made
explicit. The Bayesian school, in stark contrast, incorporates prior knowledge into
an analysis both at the level of the parameters through their prior distributions
(see Chapter 2) as well as at the level of the models through model averaging or
stacking (see Chapter 11). Unlike the frequentist framework, the transparency in
specifying parameters and models is open to scrutiny by the scientific community
and provides an immediate assessment of the analyst’s view of the degree of uncer-
tainty entering into his/her parameters and models. Bayesian theory also implies
that not accounting for parameter and/or model uncertainty can manifest itself in
poorer predictive performance of models than if this uncertainty is accounted for
(see Chapter 11). For this reason, and the others discussed in this chapter, a strong
case can be made for a general shift toward Bayesian inference as the standard for
methodological practice in the social sciences.
List of Abbreviations and Acronyms
ACF Autocorrelation function. 54

ANOVA Analysis of variance. 128
BB Bayesian bootstrap. 175

BBPMM Bayesian bootstrap predictive mean matching. 175
BD Birth/Death. 203
BF Bayes factor. 113
BIC Bayesian information criterion. 104
BMA Bayesian model averaging. 194
CFA Confirmatory factor anaysis. 143
DA Data augmentation. 171

df Degrees-of-freedom. 44
DGP Data-generating process. 118
DIC Deviance information criterion. 104
ELBO Evidence lower bound. 67

elpd Expected log predictive density. 119
EM Expectation-maximization. 158
FMA Frequentist model averaging. 216
HMC Hamiltonian Monte Carlo. 51

HPD Highest posterior density. 58
ICC Intra-class correlation. 128

IEA International Association for the Evaluation of Educational Achievement. xi
IG Inverse-gamma. 35
223
iid Independently and identically distributed. 16

IW Inverse-Wishart. 44
KLD Kullback-Leibler divergence. 67
LASSO Least absolute shrinkage and selection operator. 183

LCA Latent class analysis. 143
LKJ Lewandowski, Kurowicka, & Joe. 45
LOO-CV Leave-one-out cross-validation. 120
LOO-IC Leave-one-out cross-validation information criterion. 104
M-H Metropolis-Hastings. 49
MAR Missing at random. 166
MC3 Markov chain Monte Carlo model composition. 199
MCAR Missing completely at random. 166
MCMC Markov chain Monte Carlo. 21
n eff Effective sample size. 54

NHST Null hypothesis significance testing. 101
NMAR Not missing at random. 166
NUTS No-U-Turn Sampler. 52
OECD Organization for Economic Cooperation and Development. ix
PIRLS Progress in International Reading and Literacy Survey. xi

PISA Program for International Student Assessment. x
PMP Posterior model probability. 196
PPp Posterior predictive p-value. 107
PPC Posterior predictive checking. 107
PSIS Pareto-smoothed importance sampling. 69
rhat Potential scale reduction factor. 55
SEM Structural equation modeling. 143
VB Variational Bayes. 66
WAIC Widely applicable information criterion. 104

References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood

principle. In B. N. Petrov & F. Csaki (Eds.), Second international symposium
on information theory. Budapest: Akademiai Kiado.
Allen, D. M. (1974). The relationship between variable selection and data agu-
mentation and a method for prediction. Technometrics, 16, 125–127.
Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances.
by the late Rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a letter to
John Canton, A. M., F. R. S. Philosophical Transactions of the Royal Society of
London, 53, 370-418.
Bayesian workflow. (2020). arXiv.
Berger, J. (2006). The case for objective Bayesian analysis. Bayesian Analysis, 3,
385–402.
Berger, J. (2013). Statistical decision theory and Bayesian analysis. Springer.
Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference
(with discussion). Journal of the Royal Statistical Society, Series B, 41, 113-147.
Bernardo, J. M. (1996). The concept of exchangeability and its applications. Far
East Journal of Mathematical Sciences, 4, 111-121.
Bernardo, J. M., & Smith, A. F. M. (2000). Bayesian theory. Wiley.
Betancourt, M. (2018a). Bayes sparse regression. (https://betanalpha.github.io/
assets/case studies/bayes sparse regression.html)
Betancourt, M. (2018b). A conceptual introduction to Hamiltonian Monte Carlo.
https://arxiv.org/pdf/1701.02434.pdf
Betancourt, M. (2018c). Identity crisis. (https://betanalpha.github.io/assets/
case studies/identifiability.html#1 Identifiability)
Betancourt, M. (2019). https://betanalpha.github.io/assets/case studies/
probabilistic computation.html#2 concentration of measure and
the typical set.
Birnbaum, A. (1962). On the foundations of statistical inference. Journal of the
American Statistical Association, 57(298), 269–306.
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational infer-
ence: A review for statisticians. Journal of the American Statistical Asso-
ciation, 112, 859–877. https://www.tandfonline.com/doi/abs/10.1080/
01621459.2017.1285773?journalCode=uasa20
Bollen, K. A., & Curran, P. J. (2006). Latent curve models : A structural equation
perspective. Wiley.
Box, G., & Tiao, G. (1973). Bayesian inference in statistical analysis. Addison-Wesley.
225
226 References
Box, G. E. P. (1980). Sampling and Bayes’ inference in scientific modelling and

robustness. Journal of the Royal Statistical Society. Series A (General), 143,
383–430.
Box, G. E. P. (1983). An apology for ecumenism in statistics. In G. E. P. Box,
T. Leonard, & C.-F. Wu (Eds.), Scientific inference, data analysis, and robustness
(pp. 51–84). Academic Press.
Breiman, L. (1996). Stacked regressions. Machine Learning, 24, 49-64.
Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence
of iterative simulations. Journal of Computational and Graphical Statistics, 7,
434–455.
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference:
A practical information-theoretic approach (2nd ed.). Springer.
Burstein, L. (1980). The analysis of multilevel data in educational research and
evaluation. Review of Research in Education, 8, 158–233.
Bürkner, P.-C. (2021). Bayesian item response modeling in R with brms and Stan.
Journal of Statistical Software, 100, 1–54.
Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for
sparse signals. Biometrika, 97, 465-480.
Claeskens, G., & Hjort, N. L. (2008). Model selection and model averaging. Cambridge
University Press.
Clarke, B. S., & Clarke, J. L. (2018). Predictive statistics: Analysis and inference beyond
models. Cambridge University Press.
Clogg, C. C. (1995). Latent class models. In G. Arminger, C. C. Clogg, & M. E. Sobel
(Eds.), Handbook of statistical modeling in the social and behavioral sciences (pp.
81–110). Jossey-Bass.
Clyde, M. A. (1999). Bayesian model averaging and model search strategies (with
discussion). In J. M. Bernardo, A. P. Dawid, J. O. Berger, & A. F. M. Smith
(Eds.), Bayesian statistics, 6 (pp. 157–185). Oxford University Press.
Clyde, M. A. (2003). Model averaging. In Subjective and objective Bayesian statistics
(pp. 320–335). Wiley-Interscience.
Clyde, M. A. (2017). BAS: Bayesian adaptive sampling for bayesian model aver-
aging [Computer software manual]. (R package version 1.4.7)
Clyde, M. A., & George, E. I. (2004). Model uncertainty. Statistical Science, 19,
81–94.
Clyde, M. A., & Iversen, E. S. (2013). Bayesian model averaging in the M-open
framework. In Bayesian theory and applications (pp. 483–498). Oxford Uni-
versity Press.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
Dang, K.-D., & Maestrini, L. (2022). Fitting structural equation models via vari-
ational approximations. Structural Equation Modeling: A Multidisciplinary
Journal, 29, 1-15.
Dawid, A. P. (1982). The well-calibrated Bayesian. Journal of the American Statistical
Association, 77, 605–610.
Dawid, A. P. (1984). Statistical theory: The prequential approach. Journal of the
Royal Statistical Society, Series A, 147, 278–202.
de Finetti, B. (1962). Does it make sense to speak of good probability appraisers.
In I. J. Good (Ed.), The scientist speculates – A anthology of partly-baked ideas
References 227
(pp. 357–364). Heinemann.

de Finetti, B. (1974). Theory of probability, vols. 1 and 2. Wiley.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the em algorithm (with discussion). Journal of the Royal
Statistical Society, Series B, 39, 1-38.
Depaoli, S. (2021). Bayesian structural equation modeling. Guilford Press.
Depaoli, S., Kaplan, D., & Winter, S. (2023). Foundations and extensions of
Bayesian structural equation modeling. In R. Hoyle (Ed.), Handbook of struc-
tural equation modeling (2nd ed., pp. 701–721). Guilford Press.
Depaoli, S., & van de Schoot, R. (2017). Improving transparency and replication
in Bayesian statistics: The WAMBS-Checklist. Psychological Methods, 22,
240–261.
DeWitt, M. (2018). Research and methods resources. https://medewitt.github.io/
resources/
Draper, D. (1995). Assessment and propagation of model uncertainty (with dis-
cussion). Journal of the Royal Statistical Society (Series B), 57, 55–98.
Draper, D., Hodges, J. S., Leamer, E. E., Morris, C. N., , & Rubin, D. B. (1987). A
research agenda for assessment and propagation of model uncertainty (Tech. Rep.).
Rand Corporation.
Edwards, A. W. F. (1992). Likelihood. Johns Hopkins University Press.
Efron, B. (1979). Bootstrap methods: Another look at the jacknife. Annals of
Statistics, 7, 1–26.
Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. Chapman & Hall.
Eicher, T. S., Papageorgiou, C., & Raftery, A. E. (2011). Default priors and predic-
tive performance in Bayesian model averaging, with application to growth
determinants. Journal of Applied Econometrics, 26(1), 30–55.
Enders, C. K. (2022). Applied missing data analysis (2nd ed.). Guilford Press.
Evans, M., Hastings, N. A. J., & Peacock, J. B. (2000). Statistical distributions (3rd
ed.). Wiley.
Everitt, B. S., & Hothorn, T. (2012). MVA: An introduction to applied multivari-
ate analysis with r [Computer software manual]. http://CRAN.R-project
.org/package=MVA (R package version 1.0-3)
Feldkircher, M., & Zeugner, S. (2009). Benchmark priors revisited: On adaptive
shrinkage and the supermodel effect in Bayesian model averaging (No. 9-202).
International Monetary Fund.
Fernández, C., Ley, E., & Steel, M. F. J. (2001a). Benchmark priors for Bayesian
model averaging. Journal of Econometrics, 100, 381–427.
Fernández, C., Ley, E., & Steel, M. F. J. (2001b). Model uncertainty in cross-country
growth regressions. Journal of Applied Econometrics, 16, 563–576.
Fishburn, P. C. (1986). The axioms of subjective probability. Statistical Science, 1,
335-345.
Fisher, R. A. (1941/1925). Statistical methods for research workers (84th ed.). Oliver &
Boyd.
Fletcher, D. (2018). Model averaging. Springer.
Fox, J. (2008). Applied regression analysis and generalized linear models (2nd ed.).
Sage.
228 References
Furnival, G. M., & Wilson, Jr., R. W. (1974). Regressions by leaps and bounds.
Technometrics, 16, 499–511.
Geisser, S., & Eddy, W. F. (1979). A predictive approach to model selection. Journal
of the American Statistical Association, 74, 153–160.
Gelfand, A. (1996). Model determination using sampling-based methods. In
W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain monte
carlo in practice (pp. 145–161). Chapman & Hall.
Gelman, A. (1996). Inference and monitoring convergence. In W. R. Gilks,
S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in
practice (pp. 131–143). Chapman & Hall.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical
models. Bayesian Analysis, 1, 515–533.
Gelman, A. (2013). Understanding posterior p-values. Electronic Journal of Statis-
tics.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D., Vehatari, A., & Rubin, D. B.
(2014). Bayesian data analysis (3rd ed.). Chapman & Hall.
Gelman, A., & Hill, J. (2003). Data analysis using regression and multilevel/hierarchical
models. Cambridge University Press.
Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information
criteria for Bayesian models. Statistics and Computing, 24, 997–1016.
Gelman, A., Meng, X.-L., & Stern, H. (1996). Posterior predictive assessment
of model fitness via realized discrepancies: With commentary. Statistical
Science, 6, 733–807.
Gelman, A., & Rubin, D. B. (1992a). Inference from iterative simulation using
multiple sequences. Statistical Science, 7, 457–511.
Gelman, A., & Rubin, D. B. (1992b). A single series from the Gibbs sampler
provides a false sense of security. In J. M. Bernardo, J. O. Berger, A. P. Dawid,
& A. F. M. Smith (Eds.), Bayesian statistics 4 (pp. 625–631). Oxford University
Press.
Gelman, A., & Rubin, D. B. (1995). Avoiding model selection in Bayesian social
research. Sociological Methodology, 25, 165–173.
Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian
statistics. British Journal of Mathematical and Statistical Psychology, 66, 8–38.
Gelman, A., Simpson, D., & Betancourt, M. (2017). The prior can often only be
understood in the context of the likelihood. Entropy, 19.
Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B., Yao, Y., . . .
Modrák, M. (2020). Bayesian workflow. arXiv.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and
the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6, 721–741.
George, E., & Foster, D. (2000). Calibration and empirical Bayes variable selection.
Biometrika, 87, 731–747.
George, E. I., & McCulloch, R. E. (1993). Variable selection via gibbs sampling.
Journal of the American Statistical Association, 88, 881–889.
Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual: What you
always wanted to know about significance testing but were afraid to ask.
References 229
In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social
sciences (pp. 391–408). Sage.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996a). Introducing Markov
chain Monte Carlo. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.),
Markov chain Monte Carlo in practice (pp. 1–19). Chapman & Hall.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.). (1996b). Markov chain
Monte Carlo in practice. Chapman & Hall.
Gneiting, T., & Raftery, A. (2007). Strictly proper scoring rules, prediction, and
estimation. Journal of the American Statistical Association, 102, 359–378.
Goldstein, H. (2011). Multilevel statistical models (4th ed.). Wiley.
Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society. Series B
(Methodological), 14, 107–114.
Goodrich, B., Gabry, J., Ali, I., & Brilleman, S. (2022). rstanarm: Bayesian applied
regression modeling via Stan. https://mc-stan.org/rstanarm/
Haig, B. D. (2018). The philosophy of quantitative methods: Understanding statistics.
Oxford University Press.
Hanea, A., Nane, G., Bedford, T., & French, S. (2021). Expert judgement in risk and
decision analysis. Springer Nature.
Hannan, E. J., & Quinn, B. G. (1979). The determination of the order of an
autoregression. Journal of the Royal Statistical Society. Series B (Methodological),
41(2), 190–195.
Hansen, M. H., & Yu, B. (2001). Model selection and the principle of minimum
description length. Journal of the American Statistical Association, 96, 746–774.
Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (1997). What if there were no significance
tests? Erlbaum.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning.
Springer.
Heckman, J. J., & Kautz, T. (2012). Hard evidence on soft skills. Labour Economics,
19, 451–464.
Hinne, M., Gronau, Q. F., van den Bergh, D., & Wagenmakers, E.-J. (2020). A
conceptual introduction to Bayesian model averaging. Advances in Methods
and Practices in Psychological Science, 3, 200-215.
Hjort, N. L., & Claeskens, G. (2003). Frequentist model average estimators. Journal
of the American Statistical Association, 98, 879–899.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(1), 55–67.
Hoerl, R. W. (1985). Ridge analysis 25 years later. The American Statistician, 39(3),
186–192.
Hoeting, J. A., Madigan, D., Raftery, A., & Volinsky, C. T. (1999). Bayesian model
averaging: A tutorial. Statistical Science, 14, 382–417.
Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn sampler: Adaptively
setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning
Research, 15, 1593-1623. http://jmlr.org/papers/v15/hoffman14a.html
Honaker, J., & King, G. (2010). What to do about missing values in time-series
cross-section data. American Journal of Political Science, 54, 561–581.
Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A program for missing
data. Journal of Statistical Software, 45(7), 1–47. http://www.jstatsoft.org/
230 References
v45/i07/
Howson, C., & Urbach, P. (2006). Scientific reasoning: The Bayesian approach. Open
Court.
Hsiang, T. C. (1975). A Bayesian view on ridge regression. Journal of the Royal
Statistical Society. Series D (The Statistician), 24, 267–268.
Jackman, S. (2009). Bayesian analysis for the social sciences. Wiley.
Jackman, S. (2012). pscl: Classes and methods for R developed in the political
science computational laboratory [Computer software manual]. http://
github.com/atahk/pscl
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford University Press.
Jordan, M., Ghahramani, Z., Jaakkola, T., & Saul, L. (1999). An introduction to
variational methods for graphical models. Machine Learning, 37, 183-233.
doi: 10.1023/A:1007665907178
Jöreskog, K. G. (1967). Some contributions to maximum likelihood factor analysis.
Psychometrika, 32, 443-482.
Jose, V. R. R., Nau, R. F., & Winkler, R. L. (2008). Scoring rules, generalized entropy,
and utility maximization. Operations Research, 56, 1146–1157.
Kadane, J. B. (2011). Principles of uncertainty. Chapman & Hall/CRC Press.
Kaplan, D. (1995). The impact of BIB spiraling-induced missing data patterns on
goodness-of-fit tests in factor analysis. Journal of Educational and Behavioral
Statistics, 20, 69-82.
Kaplan, D. (2000). Structural equation modeling: Foundations and extensions. Sage.
Kaplan, D. (2004). The SAGE handbook of quantitative methodology for the social
sciences. Sage.
Kaplan, D. (2008). An overview of markov chain methods for the study of stage-
sequential developmental processes. Developmental Psychology, 44, 457–467.
Kaplan, D. (2009). Structural equation modeling: Foundations and extensions. (2nd
ed.). Sage.
Kaplan, D. (2021). On the quantification of model uncertainty: A Bayesian Per-
spective. Psychometrika, 86, 215–238.
Kaplan, D., & Chen, J. (2014). Bayesian model averaging for propensity score
analysis. Multivariate Behavioral Research, 49, 505–517.
Kaplan, D., & Depaoli, S. (2012). Bayesian structural equation modeling. In
R. Hoyle (Ed.), Handbook of structural equation modeling (pp. 650–673). Guil-
ford Press.
Kaplan, D., & Huang, M. (2021). Bayesian probabilistic forecasting with large-scale
educational trend data: a case study using NAEP. Large-scale Assessments in
Education, 9.
Kaplan, D., & Kuger, S. (2016). The methodology of PISA: Past, present, and
future. In S. Kuger, E. Klieme, N. Jude, & D. Kaplan (Eds.), Assessing contexts
of learning world-wide – Extended context assessment frameworks. Springer.
Kaplan, D., & Lee, C. (2016). Bayesian model averaging over directed acyclic
graphs with implications for the predictive performance of structural equa-
tion models. Structural Equation Modeling, 23, 343–353.
Kaplan, D., & Su, D. (2016). On matrix sampling and imputation of context
questionnaires with implications for the generation of plausible values in
References 231
large-scale assessments. Journal of Educational and Behavioral Statistics, 41(1),

57–80.
Kaplan, D., & Yavuz, S. (2019). An approach to addressing multiple imputation
model uncertainty using Bayesian model averaging. Multivariate Behavioral
Research, 23, 343–353.
Kass, R. E., & Raftery, A. (1995). Bayes factors. Journal of the American Statistical
Kass, R. E., & Wasserman, L. (1996). The selection of prior distributions by formal
rules. Journal of the American Statistical Association, 91, 1343–1370.
Kolmogorov, A. N. (1956). Foundations of the theory of probability (2nd ed.). Chelsea.
Kuger, S., Klieme, E., Jude, N., & Kaplan, D. (2016). Assessing contexts of learning:
An international perspective. Springer.
Lancaster, T. (2004). An introduction to modern Bayesian econometrics. Blackwell.
Laplace, P. S. (1774/1951). Essai philosophique sur les probabilities. Dover.
Lawley, D. N., & Maxwell, A. (1971). Factor analysis as a statistical method. Butter-
worth.
Le, T., & Clarke, B. (2017). A Bayes interpretation of stacking for M-complete and
M-open settings. Bayesian Analysis, 12, 807–829.
Leamer, E. E. (1978). Specification searches: Ad hoc inference with nonexperimental
data. Wiley.
Lewandowski, D., Kurowicka, D., & Joe, H. (2009). Generating random correlation
matrices based on vines and extended onion method. Journal of Multivariate
Analysis, 100, 1989-2001.
Ley, E., & Steel, M. F. J. (2009). On the effect of prior assumptions in bayesian
model averaging with applications to growth regression. Journal of Applied
Econometrics, 24, 651-674.
Liang, F., Paulo, R., Molina, G., Clyde, M. A., & Berger, J. (2008). Mixtures of
g-priors for Bayesian variable selection. Journal of the American Statistical
Ligges, U., & Mächler, M. (2003). Scatterplot3d - an r package for visualizing
multivariate data. Journal of Statistical Software, 8(11), 1–20.
Lindley, D. V. (1991). Making decisions (2nd ed.). Wiley.
Lindley, D. V. (2007). Understanding uncertainty. Wiley.
Linzer, D. A., & Lewis, J. B. (2011). poLCA: An R package for polytomous
variable latent class analysis. Journal of Statistical Software, 42, 1–29. https://
www.jstatsoft.org/v42/i10/
Little, R. J. A., & Rubin, D. B. (2020). Statistical analysis with missing data (3rd. ed.).
Wiley.
Lunn, D., Spiegelhalter, D., Thomas, A., & Best, N. (2009). The BUGS project: Evo-
lution, critique and future directions (with discussion). Statistics in Medicine,
28.
Lynch, S. M. (2007). Introduction to applied Bayesian statistics and estimation for social
scientists. Springer.
Madigan, D., & Raftery, A. (1994). Model selection and accounting for model un-
certainly in graphical models using Occam’s window. Journal of the American
Statistical Association, 89, 1535–1546.
232 References
Madigan, D., & York, J. (1995). Bayesian graphical models for discrete data.
International Statistical Review, 63, 215–232.
Makowski, D., Ben-Shachar, M. S., & Lüdecke, D. (2019). bayestestr: Describing
effects and their uncertainty, existence and significance within the Bayesian
framework. Journal of Open Source Software, 4, 1541.
Martin, A. D., Quinn, K. M., & Park, J. H. (2011). MCMCpack: Markov chain
monte carlo in R. Journal of Statistical Software, 42, 22.
McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman
& Hall/CRC.
Meinfelder, F. (2011). BaBooN: Bayesian bootstrap predictive mean matching –
multiple and single imputation for discrete data [Computer software man-
ual]. https://cran.r-project.org/src/contrib/Archive/BaBooN/
Mengersen, K. L., Robery, C. P., & Guihenneuc-Jouyax, C. (1999). MCMC conver-
gence diagnostics: A review. Bayesian Statistics, 6, 415–440.
Merkle, E. C., Fitzsimmons, E., Uanhoro, J., & Goodrich, B. (2021). Efficient
Bayesian structural equation modeling in Stan. Journal of Statistical Software,
100, 1-22.
Merkle, E. C., & Steyvers, M. (2013). Choosing a strictly proper scoring rule.
Decision Analysis, 10, 292–304.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E.
(1953). Equations of state calculations by fast computing machines. Journal
of Chemical Physics, 21, 1087-1091.
Mislevy, R. J. (1991). Randomization-based inference about latent variables from
complex samples. Psychometrika, 56, 177–196.
Mitchell, T. J., & Beauchamp, J. J. (1988). Bayesian variable selection in linear
regression. Journal of the American Statistical Association, 83(404), 1023–1032.
Montgomery, J. M., & Nyhan, B. (2010). Bayesian model averaging: Theoretical
developments and practical applications. Political Analysis, 18, 245–270.
Morey, R. D., Romeijn, J.-W., & Rouder, J. N. (2016). The philosophy of bayes
factors and the quantification of statistical evidence. Journal of Mathematical
Psychology, 72, 6-18.
Mullis, I. V. S., & Martin, M. O. (2015). PIRLS 2016 assessment framework (2nd ed.).
Boston College: TIMSS and PIRLS International Study Center. http://
timssandpirls.bc.edu/pirls2016/framework.html
Muthén, L. K., & Muthén, B. (1998–2017). Mplus user’s guide (Eighth ed.). Muthén
& Muthén.
NCES. (2001). Early childhood longitudinal study: Kindergarten class of 1998-99: Base
year public-use data files user’s manual (Tech. Rep. No. NCES 2001-029). U.S.
Government Printing Office.
NCES. (2018). Early Childhood Longitudinal Program (ECLS) - Overview. https://
nces.ed.gov/ecls/
Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test
criteria for purposes of statistical inference. Biometrika, 29A, Part I, 175–240.
OECD. (2002). PISA 2000 technical report. Organization for Economic Cooperation
and Development.
OECD. (2010). PISA 2009 Results (Vol. I-VI). OECD.
OECD. (2017). PISA 2015 Technical Report. OECD.
References 233
OECD. (2018). Equity in education: Breaking down barriers to social mobil-

ity (Tech. Rep.). https://www.oecd.org/education/equity-ineducation
-9789264073234-en.htm.
OECD. (2018). PISA 2018 Technical Report. https://www.oecd.org/pisa/data/
pisa2018technicalreport/
OECD. (2019). PISA 2018 Results: (Volumes I-IV): What students know and can do.
https://doi.org/10.1787/5f07c754-en
O’Hagan, A., Buck, C. E., Daneshkhah, A., Eiser, J. R., Garthwaite, P. H., Jenkinson,
D. J., . . . Rakow, T. (2006). Uncertain judgements: Eliciting experts’ probabilities.
Wiley.
Park, T., & Casella, G. (2008). The Bayesian lasso. Journal of the American Statistical
Piironen, J., & Vehtari, A. (2017). Comparison of Bayesian prediction methods for
model selection. Statistics and Computing, 27, 711-735.
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-PLUS. Springer.
Plummer, M. (2022). rjags: Bayesian graphical models using mcmc [Computer soft-
ware manual]. https://CRAN.R-project.org/package=rjags (R package
version 4-13)
Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). CODA: Convergence
diagnosis and output analysis for MCMC. R News, 6, 7–11. http://CRAN.R
-project.org/doc/Rnews/
Polson, N. G., & Scott, J. G. (2011). Shrink globally, act locally: Sparse bayesian
regularization and prediction. In J. M. Bernardo et al. (Eds.), Bayesian statistics
9. Oxford University Press.
Press, S. J. (2003). Subjective and objective Bayesian statistics: Principles, models, and
applications (2nd ed.). Wiley.
Raftery, A. (1995). Bayesian model selection in social research (with discussion). In
P. V. Marsden (Ed.), Sociological methodology (Vol. 25, pp. 111–196). Blackwell.
Raftery, A., Gneiting, T., Balabdaoui, F., & Polakowski, M. (2005). Using Bayesian
model averaging to calibrate forecast ensembles. Monthly Weather Review,
133, 1155-1174.
Raftery, A., Hoeting, J., Volinsky, C., Painter, I., & Yeung, K. Y. (2020). BMA:
Bayesian model averaging [Computer software manual]. https://CRAN.R
-project.org/package=BMA (R package version 3.18.12)
Raftery, A., Madigan, D., & Hoeting, J. A. (1997). Bayesian model averaging for
linear regression models. Journal of the American Statistical Association, 92,
179–191.
Raftery, A., & Zheng, Y. (2003). Discussion: Performance of Bayesian model
averaging. Journal of the American Statistical Association, 98, 931-938.
Ramsey, F. P. (1926). Truth and probability. In The foundations of mathematics and
other logical essays. Humanities Press.
Rässler, S. (2002). Statistical matching: A frequentist theory, practical applications, and
alternative Bayesian approaches. Springer.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and
data analysis methods (2nd ed.). Sage.
Robert, C., & Casella, G. (2011). A short history of Markov chain Monte Carlo:
Subjective recollections from incomplete data. Statistical Science, 26, 102–115.
234 References
Royall, R. M. (1986). The effect of sample size on the meaning of significance tests.
The American Statistician, 40(4), 313–315.
Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. Chapman & Hall.
Rubin, D. B. (1981). The Bayesian bootstrap. Annals of Statistics, 9, 130–134.
Rubin, D. B. (1986). Statistical matching using file concatenation with adjusted
weights and multiple imputation. Journal of Business and Economic Statistics,
4, 87–95.
Rubin, D. B. (1987). Multiple imputation in nonresponse surveys. Wiley.
Ryan, R. M., & Deci, E. L. (2009). Promoting self-determined school engagement:
Motivation, learning, and well-being. In K. R. Wenzel & A. Wigfield (Eds.),
Handbook of motivation at school (p. 171-195). Routledge/Taylor & Francis
Group.
Savage, L. J. (1954). The foundations of statistics. Wiley.
Schad, D. J., Betancourt, M., & Vasishth, S. (2019). Toward a principled Bayesian
workflow in cognitive science. arXiv. https://arxiv.org/abs/1904.12765
Schafer, J. L. (1997). Analysis of incomplete multivariate data. Chapman & Hall/CRC.
Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics, 6,
461–464.
Shafer, J. L. (2012). norm: Analysis of multivariate normal datasets with miss-
ing values [Computer software manual]. http://CRAN.R-project.org/
package=norm (R package version 1.0-9.4, Ported to R by Alvaro A. Novo.)
Silvey, S. D. (1975). Statistical inference. CRC Press.
Sloughter, J. M., Gneiting, T., & Raftery, A. (2013). Probabilistic wind vector fore-
casting using ensembles and Bayesian model averaging. Monthly Weather
Review, 141, 2107–2119.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian
measures of model complexity and fit (with discussion). Journal of the Royal
Statistical Society, Series B (Statistical Methodology), 64, 583–639.
Stan Development Team. (2020). RStan: the R interface to Stan. http://mc-stan
.org/ (R package version 2.21.1)
Stan Development Team. (2021a). Stan modeling language users guide and
reference manual,version 2.26 [Computer software manual]. https://
mc-stan.org (ISBN 3-900051-07-0)
Stan Development Team. (2021b). Stan reference manual,version 2.30 [Computer
software manual]. https://mc-stan.org/docs/reference-manual/index
.html (ISBN 3-900051-07-0)
Statisticat, & LLC. (2021). Laplacesdemon: Complete environment for bayesian
inference [Computer software manual]. Bayesian-Inference.com. https://
cran.r-project.org/web/packages/LaplacesDemon/index.html
Steel, M. F. J. (2020). Model averaging and its use in economics. Journal of Economic
Literature, 58, 644–719.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions.
Journal of the Royal Statistical Society. Series B (Methodological), 36, 111–147.
Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation
and akaike’s criterion. Journal of the Royal Statistical Society. Series B (Method-
ological), 39, 44–47.
References 235
Su, Y.-S., Gelman, A., Hill, J., & Yajima, M. (2011). Multiple imputation with
diagnostics (mi) in R: Opening windows into the black box. Journal of
Statistical Software, 45(2), 1–31. http://www.jstatsoft.org/v45/i02/
Suppes, P. (1986). Comment on Fishburn, 1986. Statistical Science, 1, 347–350.
Tanner, M. H., & Wong, W. A. (1987). The calculation of posterior distributions
by data augmentation (with discussion). Journal of the American Statistical
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society. Series B (Methodological), 58, 267–288.
Tomasetti, N., Forbes, C. S., & Panagiotelis, A. (2022). Updating variational bayes:
Fast sequential posterior inference. Statistics and Computing, 32.
Tourangeau, K., Nord, C., Lê, T., Sorongon, A. G., & Najarian, M. (2009). Early child-
hood longitudinal study, kindergarten class of 1998–99 (ECLS-K), combined user’s
manual for the ECLS-K eighth-grade and K–8 full sample data files and electronic
codebooks (NCES 2009–004). National Center for Education Statistics.
Tran, M.-N., Nguyen, T.-N., & Dao, V.-H. (2021). A practical tutorial on variational
bayes. arXiv. https://arxiv.org/abs/2103.01327
Ulitzsch, E., & Nestler, S. (2022). Evaluating Stan’s variational Bayes algorithm
for estimating multidimensional IRT models. Psych, 4, 73–88. https://
www.mdpi.com/2624-8611/4/1/7
van Buuren, S. (2012). Flexible imputation of missing data. Chapman & Hall.
van de Schoot, R., Depaoli, S., King, R., Kramer, B., Märtens, K., Tadesse, M.,
. . . Yau, C. (2021). Bayesian statistical modelling. Nature Reviews Methods
Primers, 1, 1-26.
van Erp, S. (2020). A tutorial on Bayesian penalized regression with shrinkage
priors for small sample sizes. In R. van de Schoot & M. Miočević (Eds.),
Small sample size solutions (p. 71-84). Taylor & Francis.
van Erp, S., Oberski, D. L., & Mulder, J. (2019). Shrinkage priors for Bayesian
penalized regression. Journal of Mathematical Psychology, 89, 31-50.
Vehtari, A., Gabry, J., Yao, Y., & Gelman, A. (2019). loo: Efficient leave-one-out cross-
validation and WAIC for Bayesian models. https://CRAN.R-project.org/
package=loo (R package version 2.1.0)
Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation
using leave-one-out cross-validation and WAIC. Statistics and Computing,
27, 1413–1432.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.-C. (2021).
Rank-normalization, folding, and localization: An improved b R for assessing
convergence of MCMC. Bayesian Analysis..
Vehtari, A., & Ojanen, J. (2012). A survey of Bayesian predictive methods for
model assessment, selection and comparison. Statistics Surveys, 6, 142-228.
DOI:10.1214/12-SS102
Vehtari, A., Simpson, D., Gelman, A., Yao, Y., & Gabry, J. (2021). Pareto smoothed
importance sampling. https://arxiv.org/abs/1507.02646
von Davier, M. (2013). Imputing proficiency data under planned missingness in
population models. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.),
Handbook of international large-scale assessment: Background, technical issues,
and methods of data analysis. Chapman & Hall/CRC Press.
236 References
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p

values. Psychonomic Bulletin & Review, 14, 779-804. doi: 10.3758/BF03194105
Wagenmakers, E.-J., Lee, M., Lodewyckx, T., & Iverson, G. (2008). Bayesian versus
frequentist inference. In H. Hoijtink, I. Klugkist, & P. A. Boelen (Eds.),
Bayesian evaluation of informative hypotheses (pp. 181–207). Springer.
Watanabe, S. (2010). Asymptotic equivalence of bayes cross validation and widely
applicable information criterion in singular learning theory. Journal of Ma-
chine Learning Research, 11, 3571-3594.
Weakliem, D. (1999). A critique of the Bayesian information criterion for model
selection. Sociological Methods & Research, 27, 359-397.
Whitehurst, G. J., & Chingos, M. M. (2011). Class size: What research says and
what it means for state policy (Report). Brown Center on Educational Policy
at Brookings. https://www.brookings.edu/research/class-size-what
-research-says-and-what-it-means-for-state-policy/
Wigfield, A., Tonks, S., & Klauda, S. L. (2009). Expectancy-value theory. In
K. R. Wenzel & A. Wigfield (Eds.), Handbook of motivation at school (p. 55-75).
Routledge/Taylor & Francis Group.
Winkler, R. L. (1996). Scoring rules and the evaluation of probabilities. Test, 5,
1–60.
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241 - 259.
Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018a). Using stacking to average
Bayesian predictive distributions (with discussion). Bayesian Analysis, 13,
917–1007.
Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018b). Yes, but did it work?:
Evaluating variational inference. https://arxiv.org/abs/1802.02538
Yeung, K. Y., Bumgarner, R. E., & Raftery, A. (2005). Bayesian model averaging:
development of an improved multi-class, gene selection, and classification
tool for microarray data. Bioinformatics, 21, 2394–2402.
Zellner, A. (1986). On assessing prior distributions and Bayesian regression
analysis with g prior distributions. In P. Goel & A. Zellner (Eds.), Bayesian
inference and decision techniques: Essays in honor of Bruno de Finetti. Studies in
Bayesian econometrics (p. 233-243). Elsevier.
Zeugner, S., & Feldkircher, M. (2015). Bayesian model averaging employing fixed
and flexible priors: The BMS package for R. Journal of Statistical Software,
68(4), 1–37.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
67, 301 - 320.
Author Index
Akaike, H., 117 Carvalho, C. M., 180, 191

Allen, D. M., 120 Casella, G., 47, 180
Anderson, D. R., 212 Chen, J., 196
Chingos, M. M., 21
B Claeskens, G., 216
Clarke, B. S., 211, 212
Balabdaoui, F., 196 Clarke, J. L., 211
Bates, D. M., 125 Clogg, C. C., 42
Bayes, T., 9 Clyde, M. A., 195, 196, 197, 200, 204, 210, 211,
Beauchamp, J. J., 190 212
Bedford, T., 21 Cohen, J., 101
Ben-Shachar, M. S., 114 Cowles, K., 65
Berger, J., 20, 194, 195, 200 Curran, P. J., 126
Bernardo, J. M., 16, 20, 194, 195, 201, 210, 211
Best, N. G., 65, 117
D
Betancourt, M., 51, 52, 56, 104, 185, 186, 188
Birnbaum, A., 103 Dang, K. D., 70
Blackwell, M., 174 Dao, V. H., 67
Blei, D. M., 67 Dawid, A. P., 8, 11, 194
Bollen, K. A., 126 de Finetti, B., 6, 8, 17
Box, G., 58, 59 Deci, E. L., 144
Box, G. E. P., 104 Dempster, A. P., 173, 174
Breiman, L., 212 Depaoli, S., 104, 126, 143, 144, 160, 217
Brooks, S. P., 55 DeWitt, M., 145
Bryk, A. S., 125, 126, 137, 138, 141 Draper, D., 193, 196
Bumgarner, R. E., 196
Bürkner, P. C., 121, 218 E
Burnham, K. P., 212
Burstein, L., 125 Eddy, W. F., 212
Edwards, A. W. F., 23
Efron, B., 173, 175
C
Eicher, T. S., 200
Carlin, B. P., 117 Enders, C. K., 165, 169, 172, 177
Carlin, J. B., 21, 55, 105, 108 Evans, M., 32
237
238 Author Index
Everitt, B. S., 32 Hoerl, R. W., 181

Hoeting, J. A., 193, 196, 197, 198, 199, 201,
F 206
Hoffman, M. D., 51, 52, 53
Feldkircher, M., 197, 200, 203 Honaker, J., 174
Fernández, C., 196, 200, 201 Hothorn, T., 32
Fishburn, P. C., 6 Howson, C., 6
Fisher, R. A., 6 Hsiang, T. C., 180, 181
Fletcher, D., 212, 216 Huang, M., 196, 200, 201
Forbes, C. S., 70 Hwang, J., 118, 119
Foster, D., 200
Fox, J., 73, 82
I
French, S., 21
Friedman, J., 180 Iversen, E. S., 195, 196, 197, 204, 210, 211, 212
Furnival, G. M., 198 Iverson, G., 220
G J
Gabry, J., 55, 68, 119 Jaakkola, T., 67
Geisser, S., 212 Jackman, S., 17, 32, 127, 128
Gelfand, A., 212 Jeffreys, H., 19, 103, 221
Gelman, A., 18, 21, 36, 51, 52, 53, 55, 68, 104, Jordan, M., 67
105, 107, 108, 116, 117, 118, 119, 120, 123, Jöreskog, K. G., 143
125, 128, 173, 210, 221 Jose, V. R. R., 8, 201
Geman, D., 47
Geman, S., 47 K
George, E. I., 190, 196, 200
Ghahramani, Z., 67 Kadane, J. B., 103, 220
Gigerenzer, G., 101, 221 Kaplan, D., 60, 104, 117, 126, 143, 144, 151,
Gilks, W. R., 48, 50, 199 160, 166, 176, 177, 193, 196, 200, 201, 209
Gneiting, T., 8, 196, 201 Kass, R. E., 20, 112, 113, 196, 221
Goldstein, H., 125 Kennard, R. W., 180, 181
Good, I. J., 194, 201 King, G., 174
Goodrich, B., 121, 213, 218 Klauda, S. L., 144
Gronau, Q. F., 196 Kolmogorov, A. N., 3
Guihenneuc-Jouyax, C., 53 Krauss, S., 101
Kucukelbir, A., 67
Kuger, S., 60
H
Hanea, A., 21 L
Hannan, E. J., 200
Hansen, M. H., 200 Laird, N. M., 173
Harlow, L. L., 101 Lancaster, T., 16
Hastie, T., 180, 182 Laplace, P. S., 19, 34
Hastings, N. A. J., 32 Lawley, D. N., 144
Hill, J., 125, 173 Le, T., 212
Hinne, M., 196 Leamer, E. E., 193, 196
Hjort, N. L., 216 Lee, C., 196
Hoerl, A. E., 180, 181 Lee, M., 220
Author Index 239
Lewandowski, D., 45 O
Lewis, J. B., 158
Oberski, D. L., 184
Ley, E., 196, 200, 201
O’Hagan, A., 21
Liang, F., 200
Ojanen, J., 194, 195, 210, 211
Ligges, U., 32
Lindley, D. V., 6, 22, 194, 195
Linzer, D. A., 158 P
Little, R. J. A., 165, 169, 171, 174, 177 Painter, I., 197
LLC, 209 Panagiotelis, A., 70
Lodewyckx, T., 220 Papageorgiou, C., 200
Lüdecke, D., 114 Park, J. H., 32
Lunn, D., 143 Park, T., 180
Lynch, S. M., 28, 29 Paulo, R., 200
Peacock, J. B., 32
M Pearson, E. S., 6, 101
Piironen, J., 180, 186, 187
Mächler, M., 32
Pinheiro, J. C., 125
Madigan, D., 193, 196, 197, 198, 199
Plummer, M., 65, 118
Maestrini, L., 70
Polakowski, M., 196
Makowski, D., 114
Polson, N. G., 180, 191
Martin, A. D., 32
Press, S. J., 5, 6, 7
Maxwell, A., 144
McAuliffe, J. D., 67
McCullagh, P., 86 Q
McCulloch, R. E., 190 Quinn, B. G., 200
Meinfelder, F., 175 Quinn, K. M., 32
Meng, X. L., 210
Mengersen, K. L., 53 R
Merkle, E. C., 143, 160, 201
Metropolis, N., 47, 49 Raftery, A., 8, 102, 112, 113, 115, 116, 117, 193,
Mislevy, R. J., 75 196, 197, 198, 199, 200, 201, 221
Mitchell, T. J., 190 Ramsey, F. P., 6
Molina, G., 200 Rässler, S., 166, 177
Montgomery, J. M., 196 Raudenbush, S. W., 125, 126, 137, 138, 141
Morey, R. D., 116 Richardson, S., 48, 199
Mulaik, S. A., 101 Robert, C., 47
Mulder, J., 184 Robery, C. P., 53
Muthén, B., 143 Romeijn, J. W., 116
Muthén, L. K., 143 Rosenbluth, A. W., 47
Rosenbluth, M. N., 47
N Rouder, J. N., 116
Royall, R. M., 23
Nane, G., 21 Rubin, D. B., 55, 116, 117, 123, 165, 169, 170,
Nau, R. F., 8 171, 173, 174, 175, 177, 212
Nelder, J. A., 86 Ryan, R. M., 144
Nestler, S., 67, 68, 70
Neyman, J., 6, 101 S
Nguyen, T. N., 67
Nyhan, B., 196 Saul, L., 67
240 Author Index
Savage, L. J., 6 van den Bergh, D., 196

Schad, D. J., 104, 217 van der Linde, A., 117
Schafer, J. L., 165, 174, 177 van Erp, S., 179, 184
Schwarz, G. E., 115 Vasishth, S., 104
Scott, J. G., 180, 191 Vehtari, A., 55, 68, 69, 118, 119, 120, 121, 123,
Shafer, J. L., 172 180, 186, 187, 194, 195, 210, 211
Shalizi, C. R., 18, 107, 117, 123, 221 Vines, K., 65
Silvey, S. D., 47 Vitouch, O., 101
Simpson, D., 55, 68, 69, 104, 120 Volinsky, C. T., 193, 197
Sloughter, J. M., 196 von Davier, M., 60
Smith, A. F. M., 16, 194, 195, 201, 210, 211
Spiegelhalter, D. J., 48, 117, 199 W
Steel, M. F. J., 196, 200, 201, 216
Steiger, J. H., 101 Wagenmakers, E. J., 102, 196, 220, 221
Stern, H., 210 Wasserman, L., 20
Steyvers, M., 201 Watanabe, S., 119, 123
Stone, M., 120, 212 Weakliem, D., 117
Su, D., 166 Whitehurst, G. J., 21
Su, Y. S., 173 Wigfield, A., 144
Suppes, P, 17 Wilson, Jr., R. W., 198
Winkler, R. L., 8, 201
Winter, S., 160
T
Wolpert, D. H., 212
Tanner, M. H., 171 Wong, W. A., 171
Teller, A. H., 47
Teller, E., 47 Y
Tiao, G., 58, 59
Tibshirani, R., 173, 180, 183 Yajima, M., 173
Tomasetti, N., 70 Yao, Y., 55, 68, 69, 120, 157, 211, 212
Tonks, S., 144 Yavuz, S., 176, 177, 196, 201
Tran, M. N., 67 Yeung, K. Y., 196, 197
York, J., 198
Yu, B., 200
U
Ulitzsch, E., 67, 68, 70 Z
Urbach, P., 6
Zellner, A., 200
Zeugner, S., 197, 200, 203
V
Zheng, Y., 197
van Buuren, S., 171, 172, 177 Zou, H., 182
van de Schoot, R., 104, 180, 217
Subject Index
Note. Page numbers followed by f or t indicate a figure or a table.
Actual belief model, 210–211 Bayesian central limit theorem, 27–29

Ad hoc deletion methods, 166–167, 177. See Bayesian decision theory, 194–195
also Missing data Bayesian generalized linear models. See
Akaike information criterion (AIC), 117–118 Generalized linear models
Analysis of variance (ANOVA) Bayesian hierarchical model. See
intercepts and slopes as outcomes Hierarchical models
model, 137, 138, 139–141 Bayesian information criterion (BIC), 104,
random effects analysis of variance and, 114–115, 115t, 116–117, 123
128, 129–132, 134t Bayesian linear regression model, 73–85,
Assessing convergence. See Convergence 78f, 79f, 80f, 81t, 83f, 84f, 85t, 99–100, 194
diagnostics; Diagnostics Bayesian logistic regression. See Logistic
Assessment, model. See Model assessment regression
Autocorrelation function (ACF) plot Bayesian model averaging (BMA)
confirmatory factor analysis and, 149f computational considerations, 197–199
informative priors and, 84f evaluating results from, 201–209, 202t,
logistic regression, 90, 90f, 91t 204f, 206t, 207f, 208t, 209f
non-informative priors, 79–80, 80f Markov chain Monte Carlo (MCMC)
overview, 54 model composition, 199
random effects analysis of variance and, overview, 194, 196–209, 202t, 204f, 206t,
133f 207f, 208t, 209f
Axioms of probability, 3–5 parameter and model priors, 200–201
predictive modeling and, 194–195
B statistical specification of, 197
See also Model uncertainty
Base-rate fallacy, 10 Bayesian multilevel modeling. See
Bayes factors (BF), 112–115, 114t, 115t, 116– Multilevel modeling
117, 123 Bayesian multinomial regression. See
Bayes’ theorem overview, 9–11, 13–15, 29, Multinomial regression
220–222 Bayesian negative binomial regression. See
Bayesian belief models, 210–211 Negative binomial regression
Bayesian bootstrap, 171, 175–176, 176t, 212 Bayesian Poisson regression. See Poisson
Bayesian bootstrap predictive mean regression
matching (BBPMM), 175–176, 176t
241
242 Subject Index
Bayesian predictive modeling. See Conditioning on observed data, 220–221

Predictive modeling Confidence intervals, 57
Bayesian shrinkage. See Shrinkage factor Confirmatory factor analysis (CFA), 143–
Bayesian stacking, 194, 211, 212–216, 215t, 150, 148f, 149f, 150t, 160–161
219, 222 Conjugacy, 21–22
Bayesian statistical inference, 13–14 Conjugate prior distribution
Bayesian structural equation modeling, 143 informative priors and, 21–22
Bayesian workflow non-informative priors, 81–83
overview, 104, 217–219 overview, 46
posterior predictive checking, 107–112, posterior distribution and, 25–26
109f, 110f, 111f, 112f Continuous variables, 14, 218
post-estimation steps, 104 Convergence diagnostics
prior predictive checking, 104–107 autocorrelation plots, 54
prior probability distribution and, 107f confirmatory factor analysis and, 148–
Bayesianly proper, 170–177, 172t, 173t, 175t, 150, 148f, 149f, 150t
176t effective sample size, 54
Benchmark risk inflation criterion, 200 informative priors, 83, 83f, 84f
Bernoulli trial, 24 logistic regression, 89–91, 89f, 90f, 91t
Beta distribution, 25–26 non-informative priors, 78–80, 78f, 79f,
Beta prior, 40–41, 41f 80f
Beta-binomial model, 201 overview, 53–56, 70, 218
Binomial distribution, 40–42, 41f posterior density plots, 53
Binomial distribution with beta prior, potential scale reduction factor, 55
25–26 random effects analysis of variance and,
Binomial likelihood, 41, 41f 132–134, 132f, 133f, 134f
Binomial model prior, 201 trace (history) plots, 53
Binomial probability model, 24 See also Autocorrelation function (ACF)
Binomial regression, 98–99, 99t plot; Density plots; Diagnostics;
Bootstrapping methods, 171, 175–176, 176t, Histogram plots; Posterior density
212 plots; Trace (history) plots
Borrowing strength, 127 Cromwell’s rule, 22–23. See also Prior
Brier score, 8–9. See also Scoring rules distribution
C D
Calibration, 7–9, 11 Data augmentation (DA) algorithm, 171–
Casewise deletion, 167, 177 172, 172t, 177
Categorical latent variables, 218 Data distributions
Central limit theorem, 27–29 binomial distribution, 40–42, 41f
Chained equation algorithm, 171, 172–173, Inverse-Wishart (IW) distribution,
173t, 177 44–45, 46
Clustered sampling designs, 125, 218 LKJ distribution for correlation matrices,
Coherence, 3, 6–7, 7t, 220 45, 46, 46f
Complete pooling, 15–16 multinomial distribution, 42–44, 44f
Conditional exchangeability, 18, 126, 127. overview, 31–32, 46, 218
See also Exchangeability poission distribution, 38–40, 39f
Conditional probability, 4–5. See also See also Gaussian distribution; Prior
Probability distribution
Subject Index 243
Data generating process (DGP), 13, 118 Expected utility of a model, 195
Data precision, 27 Exploratory factor analysis (EFA), 143–144
Decision error types, 101–102, 221–222. Ex-post probability evaluations, 8. See also
See also Hypothesis testing; Null Scoring rules
hypothesis
Decision theory, 194–195 F
Density plots
confirmatory factor analysis (CFA) and, Finite exchangeability, 16–17. See also
149f Exchangeability
informative priors and, 84f Finnish horseshoe prior. See Regularized
logistic regression, 90, 90f horseshoe prior
posterior predictive checking, 109–110, Fisher information matrix, 20–21, 42
109f Fisherian hypothesis testing, 101, 102, 221–
random effects analysis of variance and, 222. See also Hypothesis testing
133f Fixed effects, 125
See also Convergence diagnostics; Fixed priors, 200
Posterior density plots Flexible priors, 200
Deviance information criterion (DIC), 104, Frequentist model averaging (FMA), 216
117–118, 123, 219 Frequentist parameter estimation, 47–48
Diagnostics, 68–70. See also Convergence Frequentist probability, 5–6, 11
diagnostics Frequentist statistical inference, 13–14, 103,
Different bounds, 34–35, 35f 220, 221. See also Statistical inference
Diffuse priors. See Non-informative priors;
Prior distribution G
Dirichlet distribution, 43, 44f Gamma function, 26
Discrete variables, 14 Gamma prior, 38–39, 39f
Distributions. See Data distributions; Gaussian distribution
Gaussian distribution; Prior Bayesian central limit theorem and,
distribution 28–29
Divergent transitions error message, 55–56, Gaussian prior, 32–33, 33f
190 generalized linear models and, 85–86
half-Cauchy prior, 36–37, 37f
E informative priors and, 22
Effective sample size, 54, 64. See also inverse-gamma prior, 35–36, 36f
Sample size Jeffreys’ prior for, 37–38
EM bootstrap, 173–174, 175t. See also overview, 32–38, 33f, 35f, 36f, 37f
Expectation-maximization (EM) posterior distribution and, 26
Epistemic belief, 3 uniform distribution as non-informative
Epistemic probability, 3, 6–9, 7t prior, 33–35, 35f
Evidence lower bound (ELBO), 67–68 Gaussian linear regression model, 73. See
Ex-ante probability evaluations, 8. See also also Bayesian linear regression model
Scoring rules Gaussian prior, 32–33, 33f
Exchangeability, 15, 16–18, 126–127, 128, 141. Generalized linear models, 85–86, 86t,
See also Conditional exchangeability 99–100, 218
Expectation-maximization (EM), 158–161, Generalized Pareto distribution, 69
160t, 173–174, 175t Gibbs sampler
Expected a posteriori (EAP) estimate, 56–57 Hamiltonian Monte Carlo (HMC) and,
51–52
244 Subject Index
Gibbs sampler (continued) Informative priors

missing data and, 171, 172 convergence diagnostics, 83, 83f, 84f
overview, 50–51, 70 linear regression model and, 74–85, 78f,
Global-local shrinkage priors, 184–185. See 79f, 80f, 81t, 83f, 84f, 85t
also Shrinkage factor overview, 21–22, 218
See also Prior distribution
H Integrated likelihood, 197
Intercept coefficients, 137–141, 140t
Half-Cauchy prior, 36–37, 37f Intercepts as outcomes model, 135–141, 140t
Hamiltonian Monte Carlo (HMC), 51–53, Interval summaries of the posterior
55–56, 66–70 distribution, 57–59, 59f. See also
Hannan-Quinn prior, 200 Posterior distribution
Hierarchical models, 15–16, 125, 141, 218, Intra-class correlation (ICC), 128
222. See also Multilevel modeling Inverse probability theorem. See Bayes’
Highest posterior density (HPD), 57, theorem overview
58–59, 59f, 65, 65t. See also Posterior Inverse-gamma distribution, 82
distribution Inverse-gamma prior, 35–36, 36f
Histogram plots, 110–112, 110f, 111f, 112f, Inverse-Wishart (IW) distribution, 44–45,
134f. See also Convergence diagnostics 46
History plots. See Trace (history) plots
Horseshoe prior, 180, 180f, 184–186, 189–191,
J
189t. See also Regularized horseshoe
prior Jeffreys’ prior
Hot deck imputation, 169–170 for the binomial distribution, 41–42
Hyper-g priors, 200 for Gaussian distribution, 37–38
Hyperparameters for the multinomial distribution, 43–44
informative priors and, 22 overview, 19–20, 46
non-informative priors, 81–82 for Poisson distribution, 39–40
overview, 15 See also Prior distribution
pooling and, 15–16
posterior distribution and, 25–26 K
Hyperprior distribution, 15
Hypothesis testing, 101–103, 112, 221– k-fold cross-validation (k-fold CV), 120. See
222 also Leave-one-out cross-validation
information criterion (LOO-IC)
Kolmogorov axioms of probability, 3–4
I
Kullback-Leibler divergence (KLD)
Identity link, 86. See also Link function Bayesian information criterion and,
iid assumption, 16–17. See also 116–117
Exchangeability Bayesian model averaging, 202–203, 202t,
Importance ratios, 69 209
Importance weighting, 14–15, 68–69, 218 deviance information criterion and, 118
Imputation model uncertainty, 176–177. variational Bayes and, 67
See also Multiple imputation; Single
imputation methods L
Infinite exchangeability, 16. See also
Exchangeability Label-switching problem, 70, 154–158, 155f,
Informative conjugate priors, 81–83. See also 157f, 158t
Conjugate prior distribution Laplace prior, 180f
Subject Index 245
Lasso prior, 180, 182–184, 189–191, 189t Markov chain Monte Carlo (MCMC) model
Latent class analysis (LCA) composition, 198, 199
comparison of VB to the EM algorithm, Markov chain Monte Carlo (MCMC)
158–160, 160t sampling
label-switching problem, 154–158, 155f, Bayesian central limit theorem and, 29
157f, 158t convergence diagnostics, 53–56
overview, 70, 143, 150–161, 154t, 155f, 157f, Gibbs sampler, 50–51
158t, 160t Hamiltonian Monte Carlo (HMC), 51–53
prediction of class membership, 153–154, informative priors and, 21–22
154t overview, 47–49, 70
Latent variable growth curve modeling, 126 posterior distribution and, 25
Law of likelihood, 23–25. See also random walk Metropolis-Hastings
Likelihood algorithm, 49–50
Leapfrog steps L, 52–53 variational Bayes (VB) and, 66–70
Leaps and bounds algorithm, 198 See also Posterior distribution
Least absolute shrinkage and selection Maximum a posteriori (MAP) estimate, 57,
operator (LASSO). See Lasso prior 174
Leave-one-out cross-validation information Maximum deviation, 170
criterion (LOO-IC), 104, 119–123, 137, Maximum likelihood, 47–48, 173–174, 175t.
140–141, 215, 219 See also Likelihood
Leave-one-out cross-validation (LOO-CV), Mean imputation, 168
119–120, 123, 212 Mean known, variance unknown, 35–37,
Likelihood 36f, 37f
likelihood principle, 24, 103, 123 Mean unknown, variance known, 32–33,
marginal likelihood, 113 33f
model assessment and, 104, 123 Metropolis-Hastings algorithm (M-H
overview, 14–15, 23–25, 29 algorithm)
See also Maximum likelihood Hamiltonian Monte Carlo (HMC) and,
Linear model, 86t, 218 51–52
Linear regression, 74–85, 78f, 79f, 80f, 81t, overview, 70
83f, 84f, 85t, 99–100, 194 random walk Metropolis-Hastings
Link function, 86, 86t algorithm, 49–50
Listwise deletion, 167, 177 M-frameworks, 210–211, 219
LKJ distribution for correlation matrices, Missing at random (MAR) data, 166. See
45, 46, 46f also Missing data
Local empirical Bayes prior, 200 Missing completely at random (MCAR)
Log predictive density score (LPD), 201 data, 166, 167. See also Missing data
Logarithmic scoring rule, 8–9. See also Missing data
Scoring rules ad hoc deletion methods for, 166–167
Logistic regression, 86t, 87–91, 89f, 90f, 91t, multiple imputation and, 170–177, 172t,
99–100 173t, 175t, 176t
Longitudinal data, 126 overview, 165–166, 177
Long-run frequency, 5–6, 11. See also single imputation methods, 167–170
Frequentist probability Model assessment
Bayesian information criterion and,
M 114–115, 116–117
deviance information criterion, 117–118
Mahalanobis, 170 leave-one-out cross-validation
Marginal likelihood, 113. See also information criterion, 119–123
Likelihood
246 Subject Index
Model assessment (continued) Multinomial regression, 86t, 91–94, 93t

overview, 103–112, 107f, 109f, 110f, 111f, Multiple imputation, 170–177, 172t, 173t,
112f, 123, 218 175t, 176t
posterior predictive checking, 107–112,
109f, 110f, 111f, 112f N
predictive modeling, 194–196
prior predictive checking and, 104–107, Negative binomial regression, 98–100, 99t,
107f 122–123, 217
widely applicable information criterion, Newton-Raphson algorithms, 158–160, 160t
118–119, 121–123 Neyman–Pearson approach, 5–6, 101–102,
Model averaging, 210–211. See also Bayesian 221–222. See also Hypothesis testing
model averaging (BMA) No pooling, 15–16
Model comparison Non-informative priors
Bayes factors, 112–117, 114t, 115t linear regression model and, 74–85, 78f,
Bayesian model averaging, 206–208, 207f, 79f, 80f, 81t, 83f, 84f, 85t
208t, 209f overview, 18–19, 218
deviance information criterion, 117–118 uniform distribution as, 33–35, 35f
leave-one-out cross-validation weakly informative priors and, 21
information criterion, 119–123 See also Prior distribution
overview, 104, 112–123, 114t, 115t Normalizing factor, 9–10, 24–25
widely applicable information criterion, Not missing at random (NMAR) data, 166.
118–119, 121–123 See also Missing data
Model fit, 104, 218 No-U-Turn Sampler (NUTS), 52–53, 55–56,
Model priors, 201 66–70
Model selection, 116, 219 Null hypothesis, 101–103, 221–222. See also
Model uncertainty Hypothesis testing
Bayesian stacking, 212–216, 215t Null hypothesis significance testing
imputation model uncertainty, 176–177 (NHST), 101–103. See also Hypothesis
overview, 193–194, 216, 219, 222 testing
true models, belief models and
M-frameworks, 210–211 O
See also Bayesian model averaging Observed data, 220–221
(BMA); Uncertainty Occam’s window, 197–199
Monte Carlo integration, 48. See also Optimization, 67
Markov chain Monte Carlo (MCMC) Ordinary least squares, 47–48
sampling
Monty Hall problem, 10–11
P
Multilevel modeling
exchangeability and, 126–127 Pairwise deletion, 167, 177
imputation model uncertainty and, Parameter priors, 200
176–177 Pareto-k values, 157
intercepts and slopes as outcomes Pareto-smoothed importance sampling
model, 135–141, 140t (PSIS), 68–70
overview, 125–126, 141, 218 Partial pooling, 15–16. See also Pooling
random effects analysis of variance, Personal probability, 6. See also Epistemic
127–134, 132f, 133f, 134f probability
Multinomial beta function, 43 Point estimates of the posterior
Multinomial distribution, 42–44, 44f distribution, 56–57. See also Posterior
distribution
Subject Index 247
Poisson distribution, 38–40, 39f Cromwell’s rule, 22–23

Poisson model, 172 informative priors, 21–22
Poisson regression, 86t, 94–98, 95f, 98t, Jeffreys’ prior, 19–20
99–100, 122–123, 217 model assessment and, 103–104
Pooling, 15–16 non-informative priors, 18–19
Posterior density plots, 53, 63, 63f. See also overview, 14–15, 18–23, 29, 46, 218, 219,
Density plots 222
Posterior distribution random effects analysis of variance and,
Bayesian central limit theorem and, 29 128
Bayesian information criterion and, weakly informative priors, 20–21
116–117 See also Data distributions
interval summaries of, 57–59, 59f Prior precision, 27
non-informative priors, 80–81, 81f Prior predictive checking, 104–107, 107f, 218
overview, 14–15, 25–26, 29, 46, 56–59, 59f, Prior probability distribution, 14–15,
70, 219 103–104
point estimates of, 56–57 Probability, 3, 4–5, 9–11, 220
variational Bayes and, 66–70 Probability axioms, 3–5
See also Markov chain Monte Carlo p-value, 102, 107–108, 110, 221
(MCMC) sampling p-value postulate, 221
Posterior predictive checking (PPC),
107–112, 109f, 110f, 111f, 112f, 116, 123, Q
134, 134f, 218–219. See also Model
assessment; Model comparison; Model Quadratic form, 45
selection Quadratic scoring rule (Brier score), 8–9.
Posterior predictive distribution, 116 See also Scoring rules
Posterior predictive mean matching, 175–
176, 176t R
Posterior predictive p-value (PPp), 107–108, Random effects, 125, 128
110. See also p-value Random effects analysis of variance, 127–
Posterior probability intervals (PPIs), 57–58, 134, 132f, 133f, 134f
65. See also Posterior distribution Random walk Metropolis-Hastings
Post-estimation steps. See Model algorithm (M-H algorithm), 49–50
assessment; Model comparison; Model Reference priors, 21
selection Regression, linear. See Bayesian linear
Potential scale reduction factor, 55 regression model; Linear regression
Power, 102 Regression, multinomial, 86t, 91–94, 93t. See
Precision, 27–29 also Multinomial regression
Predictive distribution, 118 Regression, Poisson. See Poisson regression
Predictive mean matching, 170, 171 Regression coefficients (slopes), 137–141,
Predictive modeling, 194–196 140t
Predictive performance, 209 Regression imputation, 168
Principle of Insufficient Reason, 20, 34 Regularized horseshoe prior, 180, 180f, 186–
Prior distribution 191, 189t. See also Horseshoe prior
Bayesian model averaging (BMA) and, Rényi axioms of probability, 4–5
200–201 Ridge prior, 180, 180f, 181–182, 189–191, 189t
confirmatory factor analysis (CFA) and, Risk inflation criterion prior, 200
144–150, 148f, 149f, 150t
248 Subject Index
S logistic regression, 89, 89f

overview, 53
Sample size, 54, 102, 219, 220
random effects analysis of variance and,
Sample space, 4
132f
Sampling weights, 218
Transition kernel, 48
Schwarz criterion. See Bayesian
Transition matrix, 48
information criterion (BIC)
True models, 210–211
Scoring rules, 7–9, 201–209, 202t, 204f, 206t,
Type I and Type II errors, 101–102, 221–222.
207f, 208t, 209f
See also Hypothesis testing; Null
Sensitivity analyses, 21, 85, 219
hypothesis
Shrinkage factor, 28–29
Significance testing, 102–103
Single imputation methods, 167–170. See U
also Missing data Uncertainty, 6. See also Epistemic
Slopes as outcomes model, 137–141, 140t probability; Model uncertainty
Sparsity Uniform distribution as non-informative
comparison of regularization methods, prior, 33–35, 35f. See also Non-
189–191, 189t informative priors
horseshoe prior, 184–186 Uniform probability distribution, 19–21
lasso prior, 182–184 Unit information priors (UIP), 117, 200
overview, 179, 180f, 191 Unnormalized posterior distribution, 14
regularized horseshoe prior, 186–189 Utility functions, 195–196
ridge prior, 181–182
Sparsity-inducing priors, 179 V
Spherical scoring rule, 8–9. See also Scoring
rules Vague priors. See Non-informative priors;
Spike-and-slab prior, 190–191 Prior distribution
Split R^, 55, 64 Validity, 221–222
Stacking. See Bayesian stacking Variable selection
Stan software program, 47, 60–66, 63f, 64f, comparison of regularization methods,
65t, 70 189–191, 189t
Statistical inference, 13–14 horseshoe prior, 184–189
Step size ∈, 52–53 lasso prior, 182–184
Stochastic regression imputation, 169 overview, 179–180, 180f, 191
Structural equation modeling (SEM), 143 ridge prior, 181–182
Subjective belief, 6. See also Epistemic Variational Bayes (VB), 66–70
probability Variational inference, 67
Subjective probability, 6. See also Epistemic
probability W
Supermodel effect, 200
Weakly informative priors, 20–21, 218. See
also Prior distribution
T Weighted values, 14–15, 68–69, 218
Target distribution, 48 Well-calibrated probability assessment,
Thinning, 54 7–8. See also Calibration
Trace (history) plots Widely applicable information criterion
confirmatory factor analysis and, 148f (WAIC), 104, 118–119, 121–123, 219
informative priors and, 83f Workflow, Bayesian. See Bayesian workflow
label-switching problem, 155–157, 155f,
157f
About the Author
David Kaplan, PhD, is the Patricia Busk Professor of Quantitative Methods

in the Department of Educational Psychology at the University of Wisconsin–
Madison. He holds affiliate appointments in the University of Wisconsin’s
Department of Population Health Sciences, the Center for Demography and
Ecology, and the Nelson Institute for Environmental Studies. Dr. Kaplan’s
program of research focuses on the development of Bayesian statistical
methods for education research. His work on these topics is directed toward
applications to large-scale cross-sectional and longitudinal survey designs.
He has been actively involved in the OECD Program for International Student
Assessment (PISA), where he served on its Technical Advisory Group from
2005 to 2009 and its Questionnaire Expert Group from 2004 to present, and
also served as chair of the Questionnaire Expert Group for PISA 2015. He also
is a member of the Design and Analysis Committee and the Questionnaire
Standing Committee for the National Assessment of Educational Progress
(NAEP).
At the national level, Dr. Kaplan is an elected member of the National
Academy of Education and served as the chair of its Research Advisory
Committee. He is a Fellow of the American Psychological Association and was
a recipient of the Samuel J. Messick Distinguished Scientific Contributions
Award from APA (Division 5). He was also past president of the Society for
Multivariate Experimental Psychology, currently president of the Psychometric
Society, and a Jeanne Griffith Fellow at the National Center for Education
Statistics. Internationally, Dr. Kaplan was a recipient of the Alexander von
Humboldt Research Award; a visiting fellow at the Luxembourg Institute for
Social and Economic Research; and is currently a fellow of the Leibniz Institute
for Educational Trajectories in Bamberg, Germany. He was appointed the
249
250 About the Author
Johann von Spix International Visiting Professor at the Universität Bamberg,

the Max Kade Visiting Professor at the Universität Heidelberg, and is currently
International Guest Professor at the Universität Heidelberg. At the University
of Wisconsin–Madison, he received the Hilldale Award for the Social Sciences.
His home page can be found at https://edpsych.education.wisc.edu/fac-staff/kaplan-
david/.

David Kaplan - Bayesian Statistics For The Social Sciences-2024

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

David Kaplan - Bayesian Statistics For The Social Sciences-2024

Uploaded by

Copyright:

Available Formats

Bayesian Statistics for the Social Sciences

Methodology in the Social Sciences

THEORY CONSTRUCTION AND MODEL-BUILDING SKILLS:

LONGITUDINAL STRUCTURAL EQUATION MODELING WITH Mplus:

COMPOSITE-BASED STRUCTURAL EQUATION MODELING:

BAYESIAN STRUCTURAL EQUATION MODELING

INTRODUCTION TO MEDIATION, MODERATION, AND CONDITIONAL

THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY, SECOND EDITION

APPLIED MISSING DATA ANALYSIS, SECOND EDITION

PRINCIPLES AND PRACTICE OF STRUCTURAL EQUATION MODELING,

MACHINE LEARNING FOR SOCIAL AND BEHAVIORAL RESEARCH

BAYESIAN STATISTICS FOR THE SOCIAL SCIENCES, SECOND EDITION

LONGITUDINAL STRUCTURAL EQUATION MODELING, SECOND EDITION

Series Editor’s Note by Todd D. Little

THE GUILFORD PRESS

All rights reserved

No part of this book may be reproduced, translated, stored in a retrieval

Printed in the United States of America

This book is printed on acid-free paper.

Last digit is print number: 9 8 7 6 5 4 3 2 1

Library of Congress Cataloging-in-Publication Data

• Chapter 1 now contains new material on coherence, Dutch book bets,

• Chapter 2 now includes an extended discussion of prior distribu-

• Chapter 3 now includes a description of Jeffreys’ prior associated

• Chapter 4 adds new material on the Metropolis-Hastings algorithm.

• Chapter 5 (Chapter 6 in the first edition) retains material on Bayes-

• Chapter 6 (Chapter 5 in the first edition) covers model evaluation

• Chapter 7 (Chapter 8 in the first edition) retains the material on

• Chapter 8 (Chapter 9 in the first edition) has been revamped in light

reader to Depaoli’s text for a more detailed treatment of Bayesian SEM

• Chapter 9 (Chapter 7 in the first edition) addresses missing data with

• Chapter 10 provides new material on Bayesian variable selection and

• Chapter 11 is now dedicated to model averaging reflecting both his-

• Chapter 12 closes the book by providing a Bayesian workflow for

International Student Assessment (PISA; OECD, 2019), the Early Childhood

Program for International Student Assessment (PISA)

Early Childhood Longitudinal Study:

Progress in International Reading Literacy Study (PIRLS)

operational advantage of less disruption to the school’s day-to-day business

dinary professionalism, but also from a deep-seated affection for her as my

2 • STATISTICAL ELEMENTS OF BAYES’ THEOREM 13

2.7 The Bayesian Central Limit Theorem

3 • COMMON PROBABILITY DISTRIBUTIONS AND THEIR PRIORS 31

4 • OBTAINING AND SUMMARIZING 47

PART II. BAYESIAN MODEL BUILDING

6 • MODEL EVALUATION AND COMPARISON 101

7 • BAYESIAN MULTILEVEL MODELING 125

8 • BAYESIAN LATENT VARIABLE MODELING 143

PART III. ADVANCED TOPICS AND METHODS

10 • BAYESIAN VARIABLE SELECTION AND SPARSITY 179

11 • MODEL UNCERTAINTY 193

11.5 Bayesian Stacking / 212

12 • CLOSING THOUGHTS 217

LIST OF ABBREVIATIONS AND ACRONYMS 223

AUTHOR INDEX 237

SUBJECT INDEX 241

ABOUT THE AUTHOR 249

The companion website

In this chapter, we consider fundamental issues in probability that underlie

1.1 Relevant Probability Axioms

1.1.1 The Kolmogorov Axioms of Probability

1.1.2 The Rényi Axioms of Probability

which states that knowing S tells us nothing about C.