You are on page 1of 564

Applied Missing Data Analysis

Methodology in the Social Sciences


David A. Kenny, Founding Editor
Todd D. Little, Series Editor
www.guilford.com/MSS

This series provides applied researchers and students with analysis and research design books that
emphasize the use of methods to answer research questions. Rather than emphasizing statistical
theory, each volume in the series illustrates when a technique should (and should not) be used and
how the output from available software programs should (and should not) be interpreted. Common
pitfalls as well as areas of further development are clearly articulated.

RECENT VOLUMES

PSYCHOMETRIC METHODS: THEORY INTO PRACTICE


Larry R. Price

MEASUREMENT THEORY AND APPLICATIONS FOR THE SOCIAL SCIENCES


Deborah L. Bandalos

CONDUCTING PERSONAL NETWORK RESEARCH: A PRACTICAL GUIDE


Christopher McCarty, Miranda J. Lubbers, Raffaele Vacca, and José Luis Molina

QUASI-EXPERIMENTATION: A GUIDE TO DESIGN AND ANALYSIS


Charles S. Reichardt

THEORY CONSTRUCTION AND MODEL-BUILDING SKILLS:


A PRACTICAL GUIDE FOR SOCIAL SCIENTISTS, SECOND EDITION
James Jaccard and Jacob Jacoby

LONGITUDINAL STRUCTURAL EQUATION MODELING WITH Mplus:


A LATENT STATE–TRAIT PERSPECTIVE
Christian Geiser

COMPOSITE-BASED STRUCTURAL EQUATION MODELING:


ANALYZING LATENT AND EMERGENT VARIABLES
Jörg Henseler

BAYESIAN STRUCTURAL EQUATION MODELING


Sarah Depaoli

INTRODUCTION TO MEDIATION, MODERATION, AND CONDITIONAL


PROCESS ANALYSIS: A REGRESSION-BASED APPROACH, THIRD EDITION
Andrew F. Hayes

THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY, SECOND EDITION


R. J. de Ayala

APPLIED MISSING DATA ANALYSIS, SECOND EDITION


Craig K. Enders
Applied Missing
Data Analysis
SECOND EDITION

Craig K. Enders

Series Editor’s Note by Todd D. Little

THE GUILFORD PRESS


New York  London
Copyright © 2022 The Guilford Press
A Division of Guilford Publications, Inc.
370 Seventh Avenue, Suite 1200, New York, NY 10001
www.guilford.com

All rights reserved

No part of this book may be reproduced, translated, stored in a retrieval system,


or transmitted, in any form or by any means, electronic, mechanical, photocopying,
microfilming, recording, or otherwise, without written permission from the publisher.

Printed in the United States of America

This book is printed on acid-free paper.

Last digit is print number: 9 8 7 6 5 4 3 2 1

Library of Congress Cataloging-in-Publication Data


Names: Enders, Craig K., author.
Title: Applied missing data analysis / Craig K. Enders.
Description: Second Edition. | New York : The Guilford Press, [2022] |
Series: Methodology in the social sciences | Revised edition of the
author’s Applied missing data analysis, c2010. | Includes
bibliographical references and index.
Identifiers: LCCN 2022009851 | ISBN 9781462549863 (hardcover ; alk. paper)
Subjects: LCSH: Social sciences—Statistical methods. | Missing
observations (Statistics) | Social sciences—Research—Methodology.
Classification: LCC HA29 .E497 2022 | DDC 300.1/5195—dc23/eng/20220225
LC record available at https://lccn.loc.gov/2022009851
Series Editor’s Note

People who know me, know how much I love missing data. When Craig Enders agreed
to do the first edition of this book, I was elated. Now that the second edition is here
with so much new trailblazing material, I’m simply tickled pink. This second edition
is like a biblical tome for researchers in any discipline where missing data arise. Craig
Enders is a rock star on the scientific stage, and I’m right there in the proverbial mosh
pit, grooving to every word and idea he presents in this edition. He’s encapsulated new
ideas (e.g., factored regression specifications, multilevel missing data methods, sensitiv-
ity analyses) and expanded on the “analytic pillars” (as he calls them) such as factored
regression specifications for maximum likelihood approaches and three whole chapters
dedicated to Bayesian estimation. This new edition is hands down the most comprehen-
sive, practical, and accessible book devoted to missing data.
As I wrote in the Series Editor’s Note in the first edition, missing data can be a real
bane to researchers across all social science disciplines. For most of our scientific his-
tory, we have approached missing data much like a doctor from the ancient world might
have used bloodletting to cure disease or amputation to stem infection (e.g., removing
the infected parts of one’s data by using listwise or pairwise deletion). My metaphor
should make you feel a bit squeamish, just as you should feel if you see published papers
that dealt with missing data using the antediluvian and ill-advised approaches of old.
When Craig ushered us into the age of modern missing data treatments in the first edi-
tion, I’d hoped we’d see most researchers embrace the modern treatments for missing
data. At the time, Craig captured what we knew then and presented it to us in a refresh-
ing pedagogical manner.
The field of missing data has advanced probably more than any other quantitative
topic area. In the second edition, Craig again captures what we know now and brings
it to us in the most accessible way. As before, he demystifies the arcane discussions of
missing data mechanisms and their labels (e.g., MNAR) and the esoteric acronyms of the
various techniques used to address them (e.g., FIML, MCMC, and the like).
v
vi Series Editor’s Note

Craig’s approachable treatise provides a comprehensive treatment of the causes of


missing data and how best to address them. He clarifies the principles by which various
mechanisms of missing data can be recovered, and provides expert guidance on which
method to implement, how to execute it, and what to report about the modern approach
you’ve chosen. Craig’s treatment deftly balances practical guidance with expert insights.
It’s rare to find a book on quantitative methods that you can read for its stated purpose
(to educate us on modern missing data procedures) and find that it treats you to a level
of insight on topics that are unmatched in the literature. Craig’s presentation of maxi-
mum likelihood, multiple imputation, and Bayesian estimation procedures, for example,
are the clearest, most understandable, and instructive discussions I’ve read—your inner
geek will be delighted, really.
Craig successfully translates the state-of-the-art technical missing data literature
into an accessible reference that you can readily rely on and use. Among the treasures
of this work are the myriad ways he shows you exactly what the technical literature
obtusely presents. Because he provides such careful guidance on the foundations and
the step-by-step processes involved, you will quickly master the concepts and issues
of this essential component of nearly all research endeavors. Another treasure is the
broad collection of real-world data examples, including a whole chapter of illustrative
examples that deal with a broad array of issues that he pragmatically and clearly guides
us through. Moreover, the accompanying website (www.appliedmissingdata.com) is one
of the richest treasures he’s produced. Here, you will find, for example, up-to-date syn-
tax files for the examples presented as well as practical details of the different software
programs for handing missing data.
As I said in the first edition, what you will learn from Craig is that missing data
imputation is not cheating. In fact, you’ll learn why the egregious scientific error would
be the business-as-usual approaches that still permeate our journals. You’ll learn that
because modern missing data procedures are so effective, they afford the use of inten-
tionally missing data designs, which often can provide more valid and generalizable
results than traditional data collection protocols. You’ll learn to rethink how you collect
data to maximize your ability to recover any missing data mechanisms. You’ll learn that
many quandaries of design and analysis become resolvable when recast as a missing
data problem. You’ll learn that Craig Enders is a gifted quantitative specialist who can
share his fountain of knowledge with diverse readers from beginners to seasoned vet-
erans. Bottom line, you’ll learn, after you read this book, to go forth and impute with
impunity!

Todd D. Little
Still virtually circumnavigating the world from home
Lubbock, Texas
Preface

Thank you for investigating the second edition of Applied Missing Data Analysis. This is
a brand-new book, written from the ground up with only a small handful of paragraphs
carrying over from the original. In fact, a lot has changed in the missing data world,
and the tools we have at our disposal today represent important leaps forward from
the now-classic methods described in the first edition. A major overhaul was needed
to put these recent innovations front and center. Consistent with the first edition, my
overarching goal for this second edition was to translate the missing data literature into
a comprehensive, accessible resource that serves both substantive researchers who use
sophisticated quantitative methods in their work as well as quantitative specialists. I
hope you enjoy the new edition and find it useful in your work.
Rewinding back to 2007, when I began writing the first edition of Applied Missing
Data Analysis, missing data-handling methods were more primitive than they are today.
For researchers in the social and behavioral sciences, commercial structural equation
modeling programs offered full information maximum likelihood estimation, but this
option was limited to multivariate normal data (Arbuckle, 1996). The predominant mul-
tiple imputation approach at the time—joint model imputation (Schafer, 1997)—was
similarly based on the multivariate normal distribution, and fully conditional speci-
fication imputation for mixed variable types was new to the scene (van Buuren, 2007;
van Buuren & Groothuis-­Oudshoorn, 2011). The multivariate normal distribution was
(and still is) a flexible missing data-handling tool, but applying normal curve-based
approaches to real data often required compromises. A few such examples include
imputing categorical variables as though they were normal, then rounding imputes to
achieve discrete values; dummy coding level-2 units to preserve clustering effects with
multilevel missing data; and treating incomplete interaction terms as independent, nor-
mally distributed variables. The list goes on.
Cut to today, and missing data analyses rarely require meaningful compromises.
Perhaps the biggest innovation since the first edition of this book has been the develop-
vii
viii Preface

ment of factored regression specifications that decompose a multivariate distribution


into a sequence of simpler univariate regression functions (Ibrahim, Chen, & Lipsitz,
2002; Lipsitz & Ibrahim, 1996). The beauty of this strategy is that it readily accommo-
dates a broad constellation of features that are incompatible with multivariate distribu-
tions such as the normal curve. This includes analyses with mixtures of discrete and
numeric variables, models with interactive and nonlinear effects, and multilevel mod-
els with random coefficients, to name a few. Factored regression specifications feature
prominently through the book.
The structure of the second edition mimics that of the first, albeit with different
beta weights attached to each topic. Maximum likelihood, Bayesian estimation, and
multiple imputation are again the main analytic pillars. The maximum likelihood chap-
ters cover the classic estimators described in the first edition, as well as newer methods
based on factored regressions. Whereas the first edition primarily described the Bayes-
ian framework as an estimation method co-opted for multivariate multiple imputation,
this edition takes the much broader view that Bayesian analyses are an alternative to
maximum likelihood estimation—and one that is arguably more capable and mature at
this point in history. Accordingly, this edition features three full chapters on Bayesian
estimation, including a new chapter devoted to incomplete categorical variables. The
multiple imputation chapter includes expanded coverage of two classic approaches—
joint model imputation and fully conditional specification—as well as sections devoted
to newer model-based (or substantive model-compatible) imputation strategies deriving
from factored regressions. The emergence of multilevel missing data-handling meth-
ods is another important recent development since the first edition, and there is now a
chapter devoted to this topic. Next, analyses for missing not at random processes are
now simpler than ever to implement, and the new chapter on these methods empha-
sizes sensitivity analyses that investigate the influence of untestable assumptions about
missingness. The penultimate chapter consists of a series of real data analysis examples
that illustrate a broad range of specialized topics and practical issues, and the book con-
cludes with a wrap-up chapter that provides reporting guidelines and a brief tour of the
current software landscape.
In closing, I have a long list of people to thank. First, I would like to thank the own-
ers and baristas of Espresso Cielo and Gnarwhal Coffee in Santa Monica, Groundwork
Coffee in Venice, and Rose Café in Venice for allowing me to spend countless hours
in their shops working on this book. Next, I want to thank C. Deborah Laughton for
her vast expertise and valuable advice, and for giving me a second chance to explain
missing data. Turning to the academic world, both Tim Hayes and Brian Keller read
draft chapters and provided invaluable feedback, for which I am very thankful. I would
also like to thank the initially anonymous Guilford reviewers who provided helpful
suggestions from their use of the first edition and on changes to the second: Stephen
Kilgus, Educational Psychology, University of Wisconsin; Keenan A. Pituch, Nursing
and Health Innovations, Arizona State University; and Russell G. Almond, Educational
Psychology, Florida State University. Next, I am eternally grateful to my quantitative
psychology colleagues at Arizona State. Leona Aiken, Dave MacKinnon, Roger Mill-
sap, and Steve West were incomparable mentors, and I’m honored to have worked with
such an accomplished and supportive group of people. Their influence and lessons flow
Preface ix

throughout this book. Turning to my current colleagues, collaborations with Han Du


have been immensely impactful, and I am fortunate to have a friend and colleague who
graciously shares her expertise on all things Bayesian. Additionally, working on missing
data problems and developing the Blimp software application with Brian Keller has been
the highlight of my academic career to date, and our collaborations had an immeasur-
able influence shaping this second edition. On a more personal note, I have been incred-
ibly fortunate to have a fulfilling academic career that I love, and I largely owe that good
fortune to two people. The first is my academic advisor, Debbi Bandalos. Debbi has had
an enormous impact on my academic career, and I continue to be the beneficiary of her
friendship, support, and guidance, all of which I value tremendously. The second is my
mother, Billie Enders. Simply put, none of this would have been possible without her
guidance and support.
Contents

1 • Introduction to Missing Data 1


1.1 Chapter Overview / 1
1.2 Missing Data Patterns / 2
1.3 Missing Data Mechanisms / 3
1.4 Diagnosing Missing Data Mechanisms / 14
1.5 Auxiliary Variables / 17
1.6 Analysis Example: Preparing for Missing Data Handling / 20
1.7 Older Missing Data Methods / 23
1.8 Comparing Missing Data Methods via Simulation / 31
1.9 Planned Missing Data / 37
1.10 Power Analyses for Planned Missingness Designs / 43
1.11 Summary and Recommended Readings / 45

2 • Maximum Likelihood Estimation 47


2.1 Chapter Overview / 47
2.2 Probability Distributions versus Likelihood Functions / 47
2.3 The Univariate Normal Distribution / 50
2.4 Estimating Unknown Parameters / 55
2.5 Getting an Analytic Solution / 58
2.6 Estimating Standard Errors / 60
2.7 Information Matrix and Parameter Covariance Matrix / 64
2.8 Alternative Approaches to Estimating Standard Errors / 67
2.9 Iterative Optimization Algorithms / 70
2.10 Linear Regression / 75
2.11 Significance Tests / 79
2.12 Multivariate Normal Data / 84
2.13 Categorical Outcomes: Logistic and Probit Regression / 90
2.14 Summary and Recommended Readings / 96

xi
xii Contents

3 • Maximum Likelihood Estimation with Missing Data 98


3.1 Chapter Overview / 98
3.2 The Multivariate Normal Distribution Revisited / 99
3.3 How Do Incomplete Data Records Help? / 103
3.4 Standard Errors with Incomplete Data / 107
3.5 The Expectation Maximization Algorithm / 112
3.6 Linear Regression / 115
3.7 Significance Testing / 124
3.8 Interaction Effects / 125
3.9 Curvilinear Effects / 130
3.10 Auxiliary Variables / 132
3.11 Categorical Outcomes / 143
3.12 Summary and Recommended Readings / 145

4 • Bayesian Estimation 147


4.1 Chapter Overview / 147
4.2 What Makes Bayesian Statistics Different? / 148
4.3 Conceptual Overview of Bayesian Estimation / 149
4.4 Bayes’ Theorem / 154
4.5 The Univariate Normal Distribution / 155
4.6 MCMC Estimation with the Gibbs Sampler / 159
4.7 Estimating the Mean and Variance with MCMC / 160
4.8 Linear Regression / 166
4.9 Assessing Convergence of the Gibbs Sampler / 171
4.10 Multivariate Normal Data / 180
4.11 Summary and Recommended Readings / 185

5 • Bayesian Estimation with Missing Data 188


5.1 Chapter Overview / 188
5.2 Imputing an Incomplete Outcome Variable / 189
5.3 Linear Regression / 192
5.4 Interaction Effects / 199
5.5 Inspecting Imputations / 204
5.6 The Metropolis–Hastings Algorithm / 206
5.7 Curvilinear Effects / 211
5.8 Auxiliary Variables / 214
5.9 Multivariate Normal Data / 217
5.10 Summary and Recommended Readings / 221

6 • Bayesian Estimation for Categorical Variables 222


6.1 Chapter Overview / 222
6.2 Latent Response Formulation for Categorical Variables / 223
6.3 Regression with a Binary Outcome / 226
6.4 Regression with an Ordinal Outcome / 232
Contents xiii

6.5 Binary and Ordinal Predictor Variables / 239


6.6 Latent Response Formulation for Nominal Variables / 244
6.7 Regression with a Nominal Outcome / 248
6.8 Nominal Predictor Variables / 252
6.9 Logistic Regression / 256
6.10 Summary and Recommended Readings / 260

7 • Multiple Imputation 261


7.1 Chapter Overview / 261
7.2 Agnostic versus Model‑Based Multiple Imputation / 262
7.3 Joint Model Imputation / 263
7.4 Fully Conditional Specification / 272
7.5 Analyzing Multiply Imputed Data Sets / 279
7.6 Pooling Parameter Estimates / 282
7.7 Pooling Standard Errors / 282
7.8 Test Statistic and Confidence Intervals / 285
7.9 When Might Multiple Imputation Give Different Answers? / 287
7.10 Interaction and Curvilinear Effects Revisited / 288
7.11 Model‑Based Imputation / 290
7.12 Multivariate Significance Tests / 293
7.13 Summary and Recommended Readings / 299

8 • Multilevel Missing Data 301


8.1 Chapter Overview / 301
8.2 Random Intercept Regression Models / 302
8.3 Random Coefficient Models / 313
8.4 Multilevel Interaction Effects / 320
8.5 Three‑Level Models / 324
8.6 Multiple Imputation / 331
8.7 Joint Model Imputation / 332
8.8 Fully Conditional Specification Imputation / 338
8.9 Maximum Likelihood Estimation / 343
8.10 Summary and Recommended Readings / 346

9 • Missing Not at Random Processes 348


9.1 Chapter Overview / 348
9.2 Missing Not at Random Processes Revisited / 349
9.3 Major Modeling Frameworks / 349
9.4 Selection Models for Multiple Regression / 352
9.5 Model Comparisons and Individual Influence Diagnostics / 358
9.6 Selection Model Analysis Examples / 361
9.7 Pattern Mixture Models for Multiple Regression / 367
9.8 Pattern Mixture Model Analysis Examples / 374
9.9 Longitudinal Data Analyses / 379
9.10 Diggle–Kenward Selection Model / 382
xiv Contents

9.11 Shared Parameter (Random Coefficient) Selection Model / 384


9.12 Random Coefficient Pattern Mixture Models / 385
9.13 Longitudinal Data Analysis Examples / 388
9.14 Summary and Recommended Readings / 399

10 • Special Topics and Applications 401


10.1 Chapter Overview / 401
10.2 Descriptive Summaries, Correlations, and Subgroups / 401
10.3 Non‑Normal Predictor Variables / 407
10.4 Non‑Normal Outcome Variables / 417
10.5 Mediation and Indirect Effects / 422
10.6 Structural Equation Models / 428
10.7 Scale Scores and Missing Questionnaire Items / 439
10.8 Interactions with Scales / 449
10.9 Longitudinal Data Analyses / 457
10.10 Regression with a Count Outcome / 462
10.11 Power Analyses for Growth Models with Missing Data / 465
10.12 Summary and Recommended Readings / 469

11 • Wrap‑Up 470
11.1 Chapter Overview / 470
11.2 Choosing a Missing Data‑Handling Procedure / 470
11.3 Software Landscape / 473
11.4 Reporting Results from a Missing Data Analysis / 474
11.5 Final Thoughts and Recommended Readings / 483

Appendix: Data Set Descriptions 485

References 493

Author Index 519

Subject Index 529

About the Author 546

The companion website (www.appliedmissingdata.com)


includes datasets and analysis examples from the book,
up-to-date software information, and other resources.
1

Introduction to Missing Data

1.1 CHAPTER OVERVIEW

It goes without saying that missing data are a pervasive interdisciplinary problem. Not
surprisingly, how we deal with the issue can have a major impact on the validity of sta-
tistical inferences and the substantive conclusions from a data analysis. In a highly cited
paper nearly 20 years ago, Schafer and Graham (2002) described maximum likelihood
estimation and Bayesian multiple imputation as “state-of-the-art” missing data-­handling
procedures. A lot has changed since then, and these approaches are now considerably
more mature and far more capable than they were at the time. The Bayesian paradigm has
simultaneously gained in popularity and is now an important alternative to maximum
likelihood and multiple imputation rather than an estimation method co-opted for the
latter. This trio of contemporary analytic approaches forms the core of the book, which
I’ve rewritten from the ground up to showcase new developments and applications.
Modern missing data-­handling procedures have a lot to offer, but we need to under-
stand when and why they work. The first half of this chapter sets the stage with a sum-
mary of Rubin and colleagues’ theoretical framework for missing data problems (Little &
Rubin, 1987, 2020; Mealli & Rubin, 2016; Rubin, 1976). This nearly universal classifica-
tion system comprises three missing data mechanisms or processes that describe differ-
ent ways in which the probability of missing values relates to the data. From a practical
perspective, Rubin’s mechanisms function as data analysis assumptions that dictate the
validity of our statistical inferences. As you will see, these assumptions involve mostly
untestable propositions, although we can take steps to make certain conditions more
plausible. This includes leveraging additional variables that carry information about the
missing values but are not part of the main analysis plan.
The middle section of the chapter describes a small selection of older missing data-­
handling methods. Methodologists have been studying missing data problems for the
better part of a century, and the statistical literature is replete with potential solutions,
most of which are historical footnotes. Researchers are now broadly aware that bet-
ter options are available, so I limit this section to a small collection of strategies you
1
2 Applied Missing Data Analysis

may still encounter in published research articles or statistical software packages. I use
computer simulation studies to highlight the shortcomings of these methods relative to
modern approaches such as maximum likelihood estimation.
The chapter concludes with sections on planned missing data designs that intro-
duce intentional missing values as a device for reducing respondent burden or lowering
research costs. Purposefully creating missing data might seem like a bad idea, but this
strategy is perfectly appropriate and cannot introduce bias. Although analyzing fewer
data points necessarily reduces power, the reduction can be surprisingly small, espe-
cially for longitudinal variants of these designs. I describe strategies for creating good
designs, and I illustrate how to use computer simulations to vet their power.

1.2 MISSING DATA PATTERNS

A missing data pattern refers to the configuration of observed and missing values in
a data set. This term should not be confused with a missing data mechanism, which
describes possible relationships between the data and one’s propensity for missing val-
ues. Roughly speaking, patterns describe where the holes are in the data, whereas mech-
anisms describe why the values are missing. Figure 1.1 shows six prototypical missing
data patterns, with shaded areas representing the location of the missing values. The
univariate pattern in panel a has missing values isolated on a single variable. This pat-
tern could occur, for example, in an experimental setting where outcome scores are
missing for a subset of participants. A univariate pattern is one of the earliest missing
data problems to receive attention in the statistics literature, and a number of classic
resources are devoted to this topic (e.g., Little & Rubin, 2020, Ch. 2). Panel b shows a
monotone missing data pattern from a longitudinal study where individuals with miss-
ing data at a particular measurement occasion are always missing subsequent measure-
ments. Monotone patterns received attention in the early literature, because this con-
figuration of missing values can be treated without complicated iterative estimation
algorithms (Jinadasa & Tracy, 1992; Schafer, 1997, pp. 218–238).
The general pattern in panel c has missing values scattered throughout the entire
data matrix. Importantly, the three contemporary methods that form the core of this
book—­maximum likelihood, Bayesian estimation, and multiple imputation—­work well
with this configuration, so there is generally no reason to choose an analytic method
based on the missing data pattern alone. Panel d illustrates a planned missing data
pattern where three of the variables are intentionally missing for a large proportion
of respondents (Graham, Hofer, & MacKinnon, 1996; Graham, Taylor, Olchowski, &
Cumsille, 2006). As described later in the chapter, planned missingness designs can
reduce respondent burden and research costs, often with minimal impact on statistical
power. Panel e shows a pattern where a latent variable (denoted Y4*) is missing for the
entire sample. This pattern will surface in Chapter 6 with categorical variable models
that view discrete responses as arising from an underlying latent variable distribution
(Albert & Chib, 1993; Johnson & Albert, 1999).
One final configuration warrants attention, because it can introduce estimation
problems for modern missing data-­handling procedures. For lack of a better term, I refer
Introduction to Missing Data 3

(a) Univariate Pattern (b) Monotone Pattern


Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4

(c) General Pattern (d) Planned Missingness


Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4

(e) Latent Variable Pattern (f) Underidentified Pattern


Y1 Y2 Y3 Y *4 Y1 Y2 Y3 Y4

FIGURE 1.1. Six missing data patterns. The gray shaded areas of each bar represent missing
observations.

to the configuration in panel f as an underidentified missing pattern, because the data


provide insufficient support for estimation. The figure depicts a situation where the pro-
portion of cases with data on both Y3 and Y4 is so small that it would be difficult or impos-
sible to estimate the bivariate association between these variables. This pattern often
occurs with pairs of categorical variables, where unbalanced group sizes and missing
data combine to produce very low or even zero cell counts in a cross-­tabulation table. It
is important to screen for this configuration prior to conducting a missing data analysis.

1.3 MISSING DATA MECHANISMS

Rubin and colleagues (Little & Rubin, 1987; Rubin, 1976) introduced a classification
system for missing data problems that is virtually universal in the literature. This work
outlines three missing data mechanisms or processes that describe different ways in
which the probability of missing values relates to the data: missing completely at ran-
dom (MCAR), missing at random (MAR), and missing not at random (MNAR). From a
4 Applied Missing Data Analysis

practical perspective, these processes are vitally important, because they function as
statistical assumptions for a missing data analysis. However, the terms can be confusing
(e.g., missing at random refers to a systematic process), and published research articles
sometimes conflate their meaning. In the years since Rubin’s seminal work, methodolo-
gists have clarified certain aspects of his original definitions (Mealli & Rubin, 2016;
Raykov, 2011; Seaman, Galati, Jackson, & Carlin, 2013) and have added special sub-
types of processes (Diggle & Kenward, 1994; Little, 1995). As an aside, I mostly avoid
acronyms throughout the book, but I generally refer to missing data mechanisms by
their abbreviations.

Partitioning the Data


Rubin’s missing data theory envisions a hypothetically complete data set partitioned into
observed and missing components. To illustrate, Table 1.1 shows a data excerpt from a
sample of 500 observations and three variables. The complete data in the leftmost set of
columns is partly imaginary, because some its values are missing. The would-be scores
are shown in bold typeface. The table’s middle two sets of columns separate the observed
and missing parts of the data. Symbolically, this partition is Y(com) = (Y(obs), Y(mis)), where
Y(com) denotes the hypothetically complete data, Y(obs) represents the observed scores,
and Y(mis) contains the would-be values of the missing data. Although Y(com) and Y(mis)
are fairly standard in the literature, other sources use Y(0) and Y(1) (Little & Rubin, 2020;
Mealli & Rubin, 2016).
The missing data mechanisms described below are essentially models that explain
whether a participant has missing values and how those tendencies relate to the real-
ized data in Y(obs) or Y(mis). The target of these missingness models is a set of missing
data indicators that functions as random variables. We may or may not need to specify

TABLE 1.1. Would‑Be Complete Data Partitioned into Observed


and Missing Parts
Complete Observed Missing Indicators
ID Y1 Y2 Y3 Y1 Y2 Y3 Y1 Y2 Y3 M1 M2 M3
  1 13 30 15 13 30 — — — 15 0 0 1
  2 19 38 28 19 38 28 — — — 0 0 0
  3 20 18 8 20 18 8 — — — 0 0 0
  4 17 39 28 — 39 — 17 — 28 1 0 1
  5 22 26 12 22 26 12 — — — 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
496 14 36 22 — 36 22 14 — — 1 0 0
497 28 12 7 28 — 7 — 12 — 0 1 0
498 22 30 10 22 30 10 — — — 0 0 0
499 24 38 13 24 38 13 — — — 0 0 0
500 29 8 8 — — 8 29 8 — 1 1 0
Introduction to Missing Data 5

distributions for these variables, but they are nevertheless integral to the theory. The
rightmost set of columns in Table 1.1 show the matrix of binary missing data indicators
M that code whether scores are observed or missing; Mv = 0 if a participant’s score on
variable Yv is observed, and Mv = 1 if Yv is missing.
Missing data mechanisms describe different ways in which the pattern of 0’s and
1’s may relate to the realized data in Y(obs) or Y(mis). Rubin’s framework describes three
possibilities: The MCAR mechanism stipulates that the propensity for missing values is
unrelated to the data; an MAR process posits that missingness is related to the observed
parts of the data only; and an MNAR mechanism allows missingness to depend on the
unseen scores. To make each mechanism more concrete, I used computer simulation to
create bivariate data sets that conform exactly to each process. I modeled the artificial
samples after the perceived control over pain and depression variables from the chronic
pain data set on the companion website. This data set includes psychological correlates
of pain severity (e.g., depression, pain interference with daily life, perceived control
over pain) from a sample of N = 275 individuals suffering from chronic pain. Figure
1.2 shows the scatterplot of the hypothetical complete data (i.e., Y(com)) for an artificial
sample of the same size. The contour rings convey the perspective of a drone hovering
over the peak of the bivariate normal population distribution. I subsequently deleted
50% of the depression scores following each mechanism.
35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 1.2. Complete-data scatterplot showing the would-be values of two variables from a
sample of 250 participants.
6 Applied Missing Data Analysis

Missing Completely at Random


A missing completely at random mechanism states that the probability of missing val-
ues is unrelated to both the observed and missing parts of the realized data. This process
is what researchers think of as purely haphazard missingness. The formal definitions of
Rubin’s mechanisms involve the conditional distribution of the indicator variables in M
given the realized data in Y(obs) and Y(mis). The distribution for an MCAR process is

=
Pr (
M 1| Y( obs ) , Y( mis=
),φ )=
Pr ( M 1| φ ) (1.1)

where φ is a set of missingness model parameters that link the data to the indicators (e.g.,
φ could contain logistic or probit regression coefficients). The left side of the expression,
which contains the full complement of possible associations between the indicators and
the data, says that the probability of a missing score depends on both the observed and
missing parts of the data, as well as some parameters that dictate missingness. The
MCAR process on the right side of the expression simplifies by eliminating all depen-
dence on the realized data. In other words, the equation says that all participants have
the same chance of missing values, and the parameters in φ define the overall probabili-
ties of missing data.
A directed acyclic graph is a useful graphical tool for representing the missing data
mechanism in Equation 1.1 (Mohan, Pearl, & Tian, 2013; Thoemmes & Mohan, 2015).
Figure 1.3a depicts an MCAR process involving a complete variable, X, an incomplete
variable, Y, and a binary missing data indicator, MY. The white circle labeled Y repre-
sents the hypothetically complete variable (i.e., the combination of Y(mis) and Y(obs)), and
the circle labeled Y * represents realized values of Y (i.e., Y * = Y when the missing data

(a) MCAR (b) MAR (c) MMAR

X Y X Y X Y

Y* Y* Y*

MY MY MY

FIGURE 1.3. Directed acyclic graphs that depict missing data processes involving one com-
plete variable, X, one incomplete variable, Y, and a binary missing data indicator, MY. The white
circle labeled Y represents the hypothetically complete variable, and the circle labeled Y * denotes
the realized values of Y.
Introduction to Missing Data 7

indicator MY = 0 and is missing whenever MY = 1). Two features of the graph convey
an MCAR mechanism. First, the absence of arrows pointing to MY indicates that all
sources of missingness are contained in the indicator and no other variables predict
nonresponse. Second, directed acyclic graph rules tell us that the unseen values of Y are
unrelated to MY, because the MY → Y * ← Y path connecting the two variables is blocked
by a third variable with two incoming arrows (Y * is a so-called “collider variable”).
Rubin’s missing data mechanisms can further be viewed as distributional assump-
tions for the missing values. The definition in Equation 1.1 implies that the missing and
observed scores share the same overall (marginal) distributions. To illustrate this point,
I randomly removed 50% of the artificial depression scores from the complete data set
in Figure 1.2 (i.e., missingness was determined by an electronic coin toss). Figure 1.4
shows the scatterplot of the resulting data, with gray circles representing complete cases
and black crosshairs denoting partial data records with perceived control over pain
scores but no depression values. Figure 1.2 shows that missing scores are unsystemati-
cally dispersed throughout the entire distribution, such that the circles and crosshairs
completely overlap, with no differences in their center, spread, or association. The graph
highlights that the observed data are a simple random sample of the hypothetically
complete data set.
35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 1.4. Scatterplot showing an MCAR process where 50% of the scores are missing hap-
hazardly in a way that does not depend on the data. Circles denote complete observations, and
crosshairs denote pairs with missing depression scores.
8 Applied Missing Data Analysis

Missing at Random
A missing at random mechanism states that the probability of missing values is related
to the observed but not the missing parts of the realized data. The formal definition of
this process is as follows:

=
Pr(M 1| Y( obs ) , Y( mis=)
),φ =
Pr (
M 1| Y( obs ) , φ ) (1.2)

The right side of the equation says that the would-be scores in Y(mis) carry no additional
information about missingness above and beyond that in the observed data. The term
missing at random is often misunderstood, because it seems to imply a haphazard pro-
cess instead of a systematic one. Rather, the phrase means that missingness is purely
random after conditioning on or controlling for the observed data. Said differently, two
participants with identical observed score profiles would share the same chance of miss-
ing values, whereas two participants with different observed score profiles would have
different missingness rates. To clarify this idea, Graham (2009) refers to this mecha-
nism as conditionally missing at random (CMAR), and I often do so as well.
The directed acyclic graph in Figure 1.3b depicts an MAR process that features a
directed arrow from X to MY. The graph shows that the unseen values in Y are poten-
tially related to missingness via the MY ← X → Y path (in the parlance of this graphical
framework, Y and MY are said to be d-connected). Graphing rules further tell us that
conditioning on X eliminates the association between Y and MY (i.e., satisfies a condi-
tionally MAR process) by closing the MY ← X → Y path. Procedurally, conditioning on X
means that the missing data-­handling procedure leverages all available data, including
the partial records for observations with missing Y values. The three analytic pillars of
this book—­maximum likelihood, Bayesian estimation, and multiple imputation—­do
just that.
To further illustrate an MAR mechanism, I deleted 50% of the artificial depression
scores in Figure 1.2 following a process where the chance of a missing value increased
as perceived control over pain decreased (e.g., participants with little control over their
pain were more likely to experience pain-­related disruptions that could lead them to
drop out of the study). The selection process was relatively strong, with the predicted
probability of missing data increasing from about 16% at one standard deviation above
the perceived control mean to 84% at one standard deviation below the mean. Figure
1.5 shows the scatterplot of the data, with gray circles again representing complete cases
and black crosshairs denoting partial data records with perceived control scores but no
depression values. The figure clearly depicts a systematic process where missing scores
are primarily located on the left side of the contour plot. Unlike Figure 1.4, the gray
circles (cases with complete data on both variables) are no longer representative of the
hypothetically complete data, because there are too many scores at the high end of the
perceived control distribution and too few at the low end.
An MAR mechanism can also be viewed as a distributional assumption for the
missing values. The definition in Equation 1.2 implies that the observed and unseen
values of a variable share the same distribution after controlling for the observed values
of other variables (i.e., the two sets of scores follow the same conditional distribution).
Introduction to Missing Data 9

35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 1.5. Scatterplot showing an MAR process where 50% of the depression scores are
missing for participants with low perceived control over pain values. Circles denote complete
observations, and crosshairs denote pairs with missing depression scores.

Applied to the bivariate normal data in Figure 1.5, this assumption stipulates that the
observed and missing depression scores are normally distributed around points on the
regression line and share the same constant variation (i.e., the depression distribution
is the same for any two individuals with the same perceived control over pain score,
regardless of whether they have missing data). Visually, this feature is evident by the fact
that the circles and crosshairs lock together like puzzle pieces around the regression line
from the hypothetically complete data.
Viewing the MAR process as a distributional assumption provides intuition about
the inner workings of contemporary analytic procedures. Although they do so in dif-
ferent ways, maximum likelihood, Bayesian estimation, and multiple imputation all
attempt to infer the location of the missing values based on the corresponding observed
data. Consider the task of imputing a missing depression score. Given a suitable esti-
mate of the regression line, the MAR process implies that imputations can be sampled
from normal distributions centered along the regression line. To illustrate, Figure 1.6
shows the distribution of plausible imputations at three values of perceived control over
pain. Candidate imputations fall exactly on the vertical hashmarks, but I added horizon-
tal jitter to emphasize that more scores are located at higher contours near the regression
line. Randomly selecting one of the circles from each distribution generates an imputed
10 APPLIED MISSING DATA ANALYSIS

depression score (technically, imputations are not restricted to the circles displayed in
the graph and could be selected from anywhere in the normal distribution, but you get
the idea). In fact, Bayesian estimation and multiple imputation both invoke an iterative
version of this exact procedure.
Finally, an MAR process is very general and readily extends to multivariate data,
although it is more awkward to think about in this context. Returning to the data in
Table 1.1, the mechanism must be viewed on a pattern-by-pattern basis. Considering the
first row of data (and all other rows where only Y3 is missing), an MAR process requires
that Y3’s missingness is fully explained by Y1 and Y2. Moving to the fourth row of data,
the mechanism requires that the likelihood of a pattern where Y1 and Y3 are both miss-
ing depends only on Y2. Notice that this condition contradicts the statement for the first
row, which allows missing values on Y3 to depend on Y1. As a final example, the mecha-
nism requires the chance of missing both Y1 and Y2 (the pattern in the bottom row of the
table) to depend only on Y3. Again, parts of this proposition are at odds with conditions
that govern other patterns. Despite its somewhat clunky construction with multivariate
data, Little and Rubin (2020, p. 23) argue that a MAR process is a better approximation
to reality than the simpler MCAR mechanism.
35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 1.6. Distributions of plausible depression imputations at three values of perceived


control over pain. Candidate imputations fall exactly on vertical hashmarks, but I added hori-
zontal jitter to emphasize that more scores are located near the regression line.
Introduction to Missing Data 11

Missing Not at Random


A missing not at random mechanism (also referred to as a not missing at random pro-
cess) states that the probability of missing values is related to the observed and missing
parts of the data. The formal definition of this mechanism is as follows.

(
Pr M = 1| Y( obs ) , Y( mis ) , φ ) (1.3)

Unlike the previous expressions, the conditional distribution of the missing data indi-
cators doesn’t simplify and features two distinct determinants of missingness. Under
such a process, two participants with identical observed score profiles no longer have
the same chance of a missing value, as the would-be scores themselves carry additional
information above and beyond the observed data. Gomer and Yuan (2021) refer to Equa-
tion 1.3 as diffuse MNAR, because missingness depends on both components of the
hypothetically complete data, and they define a focused MNAR process as one that
depends only on the unseen values in Y(mis).

(
Pr M = 1| Y( mis ) , φ ) (1.4)

Although there is no way to differentiate MNAR subtypes from the observed data, the
authors argue that the distinction is important, because diffuse and focused processes
can differentially impact one’s analysis results. I return to this issue in Chapter 9.
The directed acyclic graph in Figure 1.3c depicts a (diffuse) MNAR process involv-
ing the same variables as before. The graph suggests that the unseen values in Y are
potentially related to missingness via the MY ← X → Y path and the Y → MY path. As
before, conditioning on X closes the MY ← X → Y path, thereby eliminating part of the
association between Y and its missingness indicator. However, the would-be values of
Y still influence missingness via their direct pathway to MY. Graphing rules tell us that
a pair of connected variables adjacent in a chain cannot be separated, so conditioning
on the observed data does not eliminate the dependence between Y and its missing data
indicator.
To further illustrate an MNAR mechanism, I deleted 50% of the artificial depres-
sion scores in Figure 1.2 following a process where participants with higher levels of
depression were more likely to have missing values (e.g., those with acute symptoms
would leave the study to seek treatment elsewhere). The selection process was relatively
strong, with the predicted probability of missing data increasing from about 16% at
one standard deviation below the depression mean to 84% at one standard deviation
above the mean. Figure 1.7 shows the scatterplot of the data, with gray circles again
representing complete cases and black crosshairs denoting partial data records with
perceived control scores but no depression values. The figure illustrates a systematic
process where missing scores are primarily located in the top half of the contour plot
above the regression line. The gray circles (cases with complete data on both variables)
are clearly unrepresentative of the hypothetically complete data.
Unlike the conditionally MAR mechanism, which stipulates that the observed
and missing scores share the same distribution after controlling for other variables,
12 APPLIED MISSING DATA ANALYSIS

35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 1.7. Scatterplot showing an MNAR process where 50% of the depression scores are
missing for participants with high depression. Circles denote complete observations, and cross-
hairs denote pairs with missing depression scores.

an MNAR process implies that the two sets of scores have different distributions. This
situation is clear in Figure 1.7, where the vast majority of the missing scores are above
the regression line, and the complete data are mostly below the line. This feature makes
imputation considerably more difficult, because there are no data with which to esti-
mate the unique parameters of the missing data distribution. For example, leveraging
the perceived control over pain scores alone would create imputations that fall on either
side of the regression line, and there is no way to formulate an appropriate adjustment
without knowing the unseen depression values. As you will see, analytic procedures for
MNAR processes (e.g., selection models or pattern mixture models) can only counteract
this indeterminacy by invoking relatively strong assumptions about the unseen data.

Mechanisms and Inference


A subtle nuance about Rubin’s mechanisms is that they describe missingness in a specific
data set; that is, the indicators in M are fixed at their realized values, and the definitions
make no reference to missingness patterns or observed data that could arise from differ-
ent samples. Rubin’s (1976) seminal work clarifies that an MAR mechanism is necessary
for obtaining valid maximum likelihood estimates (the same is true for Bayesian estima-
Introduction to Missing Data 13

tion and multiple imputation), but this conclusion does not hold for standard errors and
significance tests that rely on the frequentist framework and repeated sampling argu-
ments (Kenward & Molenberghs, 1998; Savalei, 2010).
Getting accurate measures of uncertainty under a particular process requires
a stricter assumption that the same missingness process always generates data sets.
Returning to Equation 1.2, valid inferences require the MAR definition to hold for any
Y(obs) that you could have worked with, not just the Y(obs) in a particular sample of data.
Statisticians refer to this condition as missing always at random (Bojinov, Pillai, &
Rubin, 2020; Mealli & Rubin, 2016) or everywhere missing at random (Seaman et al.,
2013), and Mealli and Rubin (2016) define parallel conditions for MCAR and MNAR pro-
cesses known as missing always completely at random and missing not always at random,
respectively. Because missingness mechanisms are so prevalent throughout the book, I
refer to them by their simpler monikers, with the understanding that measures of uncer-
tainty and significance tests require slightly different definitions.

Ignorable and Nonignorable Missingness


The terms ignorable and nonignorable missingness are often used synonymously to
refer to conditionally MAR and MNAR processes, respectively. In fact, these terms have
a somewhat broader definition, although the distinction is relatively unimportant in
practice. Rubin’s classification scheme features two models: the focal analysis model you
would have estimated had the data been complete, and a model that describes the miss-
ingness mechanism. These models have parameters θ and φ, respectively. The param-
eters in φ, whatever they happen to be, are essentially a nuisance, because they are
unrelated to the substantive research goals. A key question is, in what situations can
we simply estimate θ from the observed data without worrying about or estimating the
missingness model and the parameters in φ? This is the essence of ignorability.
The missingness model is said to be ignorable if (1) the missing values follow a
conditionally MAR process, and (2) the nuisance parameters in φ carry no information
about the focal parameters in θ (i.e., φ and θ are distinct). Bayesian analyses further
require that the two models have independent prior distributions. As mentioned pre-
viously, the missing data indicators in M function as random variables that follow a
distribution. The left side of the equation below is a shorthand way of writing the joint
(multivariate) distribution of the observed data and the missing data indicators.

(
f Y= ) (
( obs ) , M | θ, φ ) (
f M | Y( obs ) , φ × f Y( obs ) | θ ) (1.5)

I use generic function notation f(∙) throughout the book to represent distributions in the
abstract without specifying their type or form (e.g., “f of something” could be a normal
curve, a binomial distribution). If the parameters in θ and φ are independent, applying
rules of probability gives the factorization on the right side of the equation. The missing-
ness model is ignorable in this case, because f(M|Y(obs), φ) functions as a constant, and
estimating the focal model parameters from the observed data gives the same results
with or without this term. In contrast, the missingness model is said to be nonignorable
if the missing values follow an MNAR process or the nuisance parameters in φ carry
14 Applied Missing Data Analysis

information about the focal parameters in θ. In this situation, we can only get valid esti-
mates of θ by pairing the focal analysis model with an additional model for missingness
(see Chapter 9).
Ignorability is ultimately something we just take on faith, because there is no
way to evaluate either of its propositions. Referring to distinctness, (Schafer, 1997,
p. 11) says, “In many situations this is intuitively reasonable, as knowing θ [the focal
model’s parameters] will provide little information about ξ [the missingness model’s
parameters] and vice-versa.” The MAR part of the assumption can be a bit trickier. Van
Buuren (2012, p. 33) warns that “the label ‘ignorable’ does not mean that we can be
entirely careless about the missing data,” and he goes on to emphasize that satisfying
this assumption requires the missing data-­handling procedure to condition on all the
important determinants of missingness. The next three sections address this point in
more detail.

1.4 DIAGNOSING MISSING DATA MECHANISMS

Unfortunately, the observed data do not contain the necessary information to evaluate
a conditionally MAR or MNAR mechanism, because both make propositions about the
unseen scores—­the former says the would-be values are unrelated to missingness after
conditioning on the observed data, and the latter says they are related. Although meth-
odologists have proposed various diagnostic procedures for evaluating these conditions
(Bojinov et al., 2020; Potthoff, Tudor, Pieper, & Hasselblad, 2006; Yuan, 2009a), the
validity of contemporary missing data-­handling procedures ultimately relies on untest-
able assumptions and our own expert knowledge about the data and possible reasons for
missingness. This leaves an unsystematic MCAR process as the only mechanism with
testable propositions.
In truth, evaluating whether missingness is consistent with an unsystematic pro-
cess isn’t necessarily useful, because contemporary methods do not require this strict
assumption, and finding that haphazard missingness is (or is not) plausible does not
change the recommendation to use these approaches. To this point, Raykov (2011,
p. 428) suggests that “the desirability of the MCAR condition has been frequently over-
rated in empirical social and behavioral research,” and I couldn’t agree more. Never-
theless, the logic of evaluating an MCAR process warrants brief discussion, because
applications of MCAR tests abound in published research articles, and it is important to
understand what these tests do and do not tell us about the missing data.
As explained previously, an MCAR process implies that missing and observed scores
share the same overall (marginal) distributions; that is, even without conditioning on
the observed data, the observed and would-be scores have identical means, variances,
and associations with other variables. Kim and Bentler (2002) refer to this condition as
homogeneity of means and covariances. Methodologists have proposed numerous proce-
dures for evaluating the MCAR mechanism (Chen & Little, 1999; Jamshidian & Jalal,
2010; Kim & Bentler, 2002; Little, 1988b; Muthén, Kaplan, & Hollis, 1987; Park & Lee,
1997; Raykov & Marcoulides, 2014), most of which involve comparing features of the
observed data across different missing data patterns. I focus on two simple approaches
Introduction to Missing Data 15

that consider group mean differences, as these methods enjoy widespread use and are
readily available in statistical software.

Univariate Pattern Mean Differences


Perhaps the simplest way to check for an unsystematic process is to form groups of cases
with observed or missing scores on a variable Yv and examine mean differences on other
variables (Dixon, 1988; Raykov & Marcoulides, 2014). For lack of a better term, I refer to
this as the pattern mean difference approach. Returning to the hypothetical data in Table
1.1, this method compares whether the M1 groups differ on Y2 or Y3, the M2 groups differ
with respect to Y1 or Y3, and the M3 groups differ on Y1 or Y2.
Returning to the artificial data in Figure 1.4, the pattern mean difference approach
creates a missing data indicator that codes whether depression scores are missing or
observed and compares the perceived control over pain group means. The n(obs) = 134
observations with depression scores had a mean of M(obs) = 20.08, and the n(mis) = 141
cases with missing data had an average of M(mis) = 20.52. This difference equates to less
than one-tenth of a standard deviation unit, which is well below Cohen’s (1988) small
effect size benchmark of |d| > 0.20. Because I created missing values by randomly delet-
ing half the scores, it isn’t surprising that the mean difference is nonsignificant, t(273)
= .71, p = .48. Raykov (2011) explains that the absence of group differences is necessary
but insufficient for demonstrating a purely random process. As such, a safe interpreta-
tion is that the data do not contain evidence that refutes the MCAR mechanism.
It is instructive to apply the pattern mean difference approach to data that are not
MCAR. Returning to the artificial data in Figure 1.5, participants with low perceived
control over pain were more likely to have missing depression scores. In this case,
observations with and without depression scores have a perceived control mean of M(obs)
= 23.27 and M(mis) = 17.19, respectively. This difference is equivalent to 1.43 standard
deviation units (well above Cohen’s large effect size benchmark of |d| > 0.80) and is
statistically significant, t(273) = –11.82, p < .001. Importantly, the significant differ-
ence implies there is evidence against a purely unsystematic process, but it says nothing
about whether a conditionally MAR process is plausible. As a final example, reconsider
the MNAR mechanism from Figure 1.7, where participants with elevated depressive
symptoms were more likely to have missing depression scores. Despite a very different
underlying process, the pattern mean difference is significant and in the same direction,
M(obs) = 21.47 versus M(mis) = 19.13, t(273) = –3.80, p < .001. This example highlights that
the observed data cannot differentiate MAR and MNAR processes. A significant group
mean difference implies there is evidence against an MCAR process and nothing more.
Significance tests of pattern mean differences come with a few important caveats.
First, a large data set with many variables can yield a staggering number of tests, and
correlations among variables allow a univariate difference to masquerade as several sig-
nificant comparisons. Raykov, Lichtenberg, and Paulson (2012) outline a multiple com-
parison procedure for this situation, and the multivariate tests of group differences are
another option for mitigating false flags (Kim & Bentler, 2002; Little, 1988b). Second,
significance tests often suffer from very low power, making them dubious tools for argu-
ing in favor of an unsystematic missingness process. In particular, the power of such
16 Applied Missing Data Analysis

tests will be at a maximum when a variable has 50% missing data, because its missing-
ness indicator has equal group sizes. Conversely, lower (or higher) missing data rates
cause unbalanced group sizes and lower power. To illustrate, consider the conditionally
MAR process depicted in Figure 1.5. Achieving power equal to .80 with a 50% missing
data rate requires a standardized pattern mean difference of |d| > 0.34 or larger (a small
effect size). Had I instead deleted 10% of the data (i.e., group sizes of n(obs) = 247 and n(mis)
= 28), the effect size requirement to achieve the same power increases to |d| > 0.56 or
larger (a medium effect size). Finally, a pattern mean difference does not automatically
imply that the variable in question is a source of nonresponse bias, as the variable’s cor-
relation with the focal analysis variables also plays an important role (Collins, Schafer,
& Kam, 2001). I return to this point in Section 1.5.

Little’s MCAR Test


Little (1988b) proposed a multivariate extension of the pattern mean difference approach
that simultaneously evaluates mean differences across a set of variables. The test defines
G groups of cases that share the same missing data pattern, and it computes the arithme-
tic means of each pattern’s observed data. These pattern-­specific means are then com-
pared to maximum likelihood estimates of the grand means. Chapter 3 gives a detailed
description of maximum likelihood missing data handling, but for now it is sufficient
to know that the estimator leverages the entire sample’s observed data without discard-
ing any information. Finally, a test statistic uses the maximum likelihood estimate of
the variance–­covariance matrix to standardize differences between the pattern-­specific
means and the grand means. The sum of these standardized differences should be rela-
tively small and close to 0 if scores are MCAR.
Little’s test statistic is as follows:
G

∑ (
g =1

) (
TL = n g Yg − μˆ g Sˆ −g 1 Yg − μˆ g ) (1.6)

where G is the number of missing data patterns, ng is the number of cases in missing
data pattern g, Yg contains the arithmetic means for that group, and μ̂g and Ŝg contain
the rows and columns of μ̂ and Ŝ (the maximum likelihood estimates) that correspond
to the observed variables in Yg. The parentheses contain deviations between pattern g’s
arithmetic averages and the corresponding grand means, and these are squared (and
summed) via matrix multiplication. Multiplying by the inverse of the covariance matrix
(the matrix analogue of division) standardizes the discrepancies, such that the numeri-
cal value of TL is a weighted sum of G squared z-scores. If values are missing completely
at random, TL is approximately distributed as a chi-­square statistic with Svg – V degrees
of freedom, where vg is the number of observed scores in pattern g, and V is the total
number of variables. Consistent with the mean difference approach, a significant test
statistic suggests that missingness is not purely random.
To illustrate Little’s test, reconsider the conditionally MAR process depicted in
Figure 1.5. In practice, the primary motivation for using Little’s test is to evaluate a
larger number of variables in Yg, but a bivariate application is useful for illustrating the
Introduction to Missing Data 17

mechanics of the equation. To begin, the maximum likelihood estimates of the grand
means and variance–­covariance matrix are as follows:
 20.31  ˆ  27.27 −13.80 
= μˆ =  S   (1.7)
 14.29   −13.80 36.15 
These means are the benchmark against which to compare pattern-­specific means. There
are just two missing data patterns in this example: n(obs) = 141 observations have scores
on both variables (i.e., v1 = 2), and n(mis) = 134 cases have missing depression scores (i.e.,
v2 = 1). The pattern-­specific arithmetic means for the two groups are as follows:
 23.27   17.19 
= Y1 =  Y2   (1.8)
 12.79   NA 
Substituting the estimates into Equation 1.6 gives the following test statistic:

  23.27   20.31  ′  27.27 −13.80    23.27   20.31  


TL =
141 ×   −       −  
  12.79   14.29    −13.80 36.15    12.79   14.29   (1.9)
(17.19 − 20.31)2
+ 134 × =
98.27
27.27

If an unsystematic process generated the data, this test statistic should approximate a
chi-­square statistic with Svg – V = (2 + 1) – 2 = 1 degrees of freedom. The test is statisti-
cally significant, TL(1) = 98.27, p < .001, indicating that the MCAR mechanism is not
plausible for these data. In a multivariate application with more than two variables, a
significant test statistic indicates that two or more patterns differ, but the test gives no
indication about which variables might be responsible.

1.5 AUXILIARY VARIABLES

A conditionally MAR mechanism will be our default assumption until Chapter 9. To


refresh, this process stipulates that the would-be scores in Y(mis) are unrelated to whether
a participant has missing values after conditioning on the observed data. There are at least
two ways this assumption could be violated. First, the unseen scores themselves might
predict missingness above and beyond the observed data, as in Figure 1.3c. The only
way to counteract nonresponse bias in this scenario is to fit a specialized model that
pairs the focal analysis with a nuisance model for missingness (e.g., a selection model or
pattern mixture model). Alternatively, the unseen scores may be associated with miss-
ingness, because the missing data-­handling procedure simply failed to condition on
certain variables. In this situation, the MAR assumption could be satisfied by controlling
for additional or different variables. This scenario is not hard to imagine in practice, as
real-world data sets often have hundreds of variables, and controlling for every observa-
tion is infeasible. For lack of a better term, I refer to this situation as MNAR by omission.
To illustrate an MNAR-by-­omission process, suppose that the focal analysis model
is the linear regression of Y on X:
18 Applied Missing Data Analysis

Yi = β0 + β1 X i + ε i (1.10)

Moreover, suppose that the outcome is missing due to another measured variable A that
also correlates with Y. Figure 1.8a shows a directed acyclic graph that depicts theoretical
associations among the three variables and the missing data indicator, MY. As before, Y
represents the hypothetically complete variable, and Y * represents realized values of Y
(i.e., Y * equals Y when the missing data indicator MY = 0 and is missing whenever MY =
1). Graphing rules imply that Y is potentially related to missingness via two pathways:
MY ← X → Y and MY ← A → Y.
As explained previously, directed acyclic graphs clarify that conditioning on or
controlling for the middle variable in a path eliminates the dependency between the two
outer variables. The regression model conditions on X and therefore eliminates part of
the association between Y and MY by closing the MY ← X → Y path. However, Y and MY
are still related via the MY ← A → Y path, so the analysis induces an MNAR-by-­omission
process, because it fails to condition on A. Whether the open path introduces substantial
bias depends the magnitude of the association between A and MY and A and Y (Collins et
al., 2001), but the analysis is nevertheless at odds with the MAR assumption.
Perhaps the simplest way to condition on A is to simply include it as an additional
covariate in the analysis model as follows:

Yi = β0 + β1 X i + β2 Ai + ε i (1.11)

This analysis is consistent with an MAR process, because it eliminates all sources of
dependency between Y and MY. However, the model achieves this desirable status by

(a) Category A (b) Category B (c) Category C

X Y X Y X Y

Y* Y* Y*

A MY A MY A MY

FIGURE 1.8. Directed acyclic graphs that depict missing data processes involving one com-
plete variable, X, one incomplete variable, Y, a binary missing data indicator, MY, and an auxiliary
variable A. The white circle labeled Y represents the hypothetically complete variable, and the
circle labeled Y * denotes the realized values of Y.
Introduction to Missing Data 19

modifying the meaning of a focal parameter—­the β1 coefficient is a now partial slope


that reflects the net influence of X above and beyond that of A, a variable that wasn’t
slated to appear in the analysis had the data been complete. Chapters 3 and 5 describe
better ways to condition on A that don’t involve modifying the focal analysis model, but
this example nevertheless highlights the importance of conditioning on variables that
may not be part of the original analysis plan.

Inclusive Analysis Strategy


The possibility of an MNAR-by-­omission process has prompted methodologists to rec-
ommend a so-­called inclusive analysis strategy that introduces auxiliary variables into
the focal analysis model or into the imputation process (Collins et al., 2001; Rubin,
1996; Schafer, 1997; Schafer & Graham, 2002). An auxiliary variable is an extraneous
variable that carries important information for missing data handling but is not part of
the focal analysis (or analyses). Conditioning on such variables can fine-tune a missing
data analysis, either by reducing nonresponse bias or improving precision. Collins et al.
(2001) categorize candidate variables into three buckets: variables that (1) correlate with
an analysis variable Y and its missingness indicator MY, (2) correlate with an analysis
variable but not its missingness indicator, and (3) correlate with the missing data indica-
tor but not the analysis variable. The directed acyclic graphs in Figure 1.8 depict these
patterns of associations.
The number of variables in many data sets is often so large that an overinclusive
strategy is not viable. Reducing a large set of candidate auxiliary variables into one or
two principal components is one way to attack this problem (Howard, Rhemtulla, &
Little, 2015), but a more tailored approach that selects a small handful of variables often
works just as well. Conditioning on category A variables like the one in Figure 1.8a is
the top priority, because doing so can improve power and reduce nonresponse bias that
results from an MNAR-by-­omission process. Moreover, preference should be given to
auxiliary variables with the strongest semipartial correlations, as variables that account
for unique variation in the missing variables have the most to offer. Next, condition-
ing on category B auxiliary variables does not affect bias, but it can improve power by
leveraging additional sources of correlation. Again, an auxiliary variable’s semipartial
association with the incomplete variables is more important than its bivariate correla-
tion. Finally, conditioning on category C auxiliary variables offers no benefits at all. It
might seem counterintuitive that ignoring a correlate of missingness (e.g., a variable that
exhibits a pattern mean difference) doesn’t introduce bias, but the directed acyclic graph
in Figure 1.8c clarifies that an MNAR-by-­omission process isn’t possible, because A is
not located on a path that connects Y to MY. The figure reinforces my earlier statement
that a pattern mean difference doesn’t necessarily signal a source of nonresponse bias.
The utility of an auxiliary variable ultimately boils down to the magnitude of its
semipartial correlations with the incomplete analysis variables, as failing to condition
on extra variables with weak correlations is unlikely to introduce bias, nor is including
such variables going to replace a meaningful amount of missing information. Raykov
and West (2015) described a latent variable modeling approach to estimating semipartial
correlations with a set of candidate auxiliary variables. Of course, any general-­purpose
20 Applied Missing Data Analysis

statistical software application can estimate these associations, but most do so after
discarding incomplete data records. The advantage of Raykov and West’s approach is
that it leverages maximum likelihood missing data handling (or alternatively, Bayesian
estimation). Again, we don’t know how maximum likelihood estimation works yet, but
for now it is sufficient to know that the estimator leverages the entire sample’s observed
data without discarding any information.
A respectable semipartial correlation signals an auxiliary variable that contains
unique information about the missing values above and beyond that already contained
in the analysis. How large does this correlation need to be in order to reap the benefits
of conditioning on the additional variable or suffering the consequences of ignoring
it? Simulation studies in Collins et al. (2001) provide some insights. The Collins et al.
article examined auxiliary variables with semipartial correlations equal to .32 and .72.
Not surprisingly, failing to condition on a variable with a very strong correlation usu-
ally produced a bias-­inducing MNAR-by-­omission process. In contrast, ignoring a vari-
able with the smaller correlation often gave acceptable parameter estimates with little
to no bias. Based on these results, it seems reasonable to focus on auxiliary variables
with semipartial correlations at least as strong as Cohen’s (1988) medium effect size
benchmark of ±0.30. Fortunately, we don’t need to be too discerning about this cutoff,
because these simulations showed no serious consequences of overfitting with a large
set of uncorrelated variables. Nevertheless, limiting the number of auxiliary variables
is often necessary in practice, because modeling strategies for introducing these extra
variables can be prone to convergence failures (e.g., the saturated correlates model; Gra-
ham, 2003).
Finally, although the literature has long favored an inclusive strategy (Collins et al.,
2001; Rubin, 1996; Schafer, 1997; Schafer & Graham, 2002), it is hypothetically possible
that conditioning on an auxiliary variable could enhance rather than reduce nonre-
sponse bias. This could happen, for example, if an auxiliary variable’s correlation with
an analysis variable and its missingness indicator is fully explained by an unmeasured
latent variable. It is unclear how often the constellation of associations needed to cause
this problem actually occurs in practice, but interested readers can find an illustration
of this phenomenon in Thoemmes and Rose (2014).

1.6 ANALYSIS EXAMPLE: PREPARING FOR MISSING


DATA HANDLING

In practice, assuming a conditionally MAR process is usually a good starting point,


because this mechanism is more realistic than a purely unsystematic one. Moreover,
the three pillars of this book—­maximum likelihood, Bayesian estimation, and mul-
tiple imputation—­naturally leverage this assumption by default. This section serves as a
bookend that integrates previous ideas and illustrates two steps to prepare for an MAR-
based missing data analysis: comparing participants with and without missing data, and
identifying potential auxiliary variables.
To provide a substantive context, I use the chronic pain data on the companion
website to illustrate a regression analysis with missing data. The data set includes psy-
Introduction to Missing Data 21

chological correlates of pain severity (e.g., depression, pain interference with daily life,
perceived control) for a sample of N = 275 individuals with chronic pain. Because the
missing data mechanism is an assumption for a specific analysis, I build the example
around a linear regression model where depressions scores are a function of pain inter-
ference with daily life activities and a binary severe pain indicator (0 = no, little, or mod-
erate pain, 1 = severe pain).

DEPRESS = β0 + β1 ( INTERFEREi ) + β2 ( PAIN i ) + ε i (1.12)

Approximately 7.3% of the binary pain ratings are missing, and the missing data rates
for the depression and pain interference scales are 13.5 and 10.5%, respectively. I use
these same variables in Chapter 10 to illustrate missing data handling for a mediation
analysis, and I incorporate auxiliary variables from this illustration.

Identifying Correlates of Missingness


Researchers routinely use the pattern mean difference approach to explore whether
cases with missing values differ from those with observed data. To illustrate the pro-
cedure, I created three missing data indicators that code whether the analysis variables
are missing. As before, each dummy code equals 0 if a score is observed and 1 if it is
missing. There is no need to examine pattern mean differences for variables already in
the analysis, because contemporary missing data-­handling approaches automatically
condition on this information. Instead, I focus on six continuous variables outside the
analysis model: age, exercise frequency, anxiety, stress, perceived control over pain,
and psychosocial disability (a construct capturing pain’s impact on emotional behaviors
such as psychological autonomy and communication, emotional stability, etc.). Three of
the candidate variables also have missing data, but incomplete auxiliary variables can
still be beneficial as long as their scores are mostly observed whenever the analysis vari-
ables are missing (Enders, 2008).
Statistical significance tests are not that valuable for this application, because they
lack power due to the highly unbalanced group sizes (e.g., based on of n(obs) = 238 and
n(mis) = 37, the depression scale requires a standardized mean difference effect size of
nearly 0.50 to achieve .80 power). Instead, Table 1.2 gives the standardized mean dif-
ference effect size for each indicator and auxiliary variable. The pain severity indicator
produced three comparisons that exceeded Cohen’s (1988) small effect size benchmark
of ±0.20 (exercise frequency, anxiety, and stress), and the depression indicator produced
a single difference of this magnitude (anxiety). Researchers often use logistic regression
to predict missingness indicators from study variables, so I also applied this procedure
to the example. The logistic analyses further revealed that the set of auxiliary variables
explained about 3–4% of the variation in the severe pain and depression indicators, with
the anxiety scale producing the largest partial slope.
Before going further, it is useful to step back and take stock of what we can and can-
not learn from mean comparisons. First, we can conclude that an unsystematic MCAR
process is not plausible for the linear regression analysis—­it may be reasonable for a dif-
ferent analysis with a different configuration of variables, but not for this model. Second,
22 Applied Missing Data Analysis

TABLE 1.2. Standardized Mean Differences Comparing


Observed and Missing Cases on Six Auxiliary Variables
Missing data indicators
Auxiliary variable Pain Pain interference Depression
Age 0.03 –0.07 0.00
Exercise Frequency 0.30 –0.16 –0.13
Anxiety 0.43 0.11 0.33
Stress 0.24 0.00 –0.08
Control 0.07 –0.06 0.09
Disability –0.14 –0.10 –0.01

univariate mean differences do not condition on the focal variables, so the effect sizes
in Table 1.2 does not say whether a given auxiliary variable predicts missingness above
and beyond the variables already in the analysis. Finally, mean differences alone do not
signal a problem, as a bias-­inducing MNAR-by-­omission process also requires salient
semipartial correlations with the analysis variables.

Identifying Correlates of Incomplete Variables


Next, I used Raykov and West’s (2015) latent variable model to estimate the semipartial
correlations between the auxiliary variables and the three analysis variables (the same
analysis could be performed in standard statistical software using pairwise deletion).
Table 1.3 gives the semipartial correlations and their significance tests. As suggested
previously, semipartial correlations in the neighborhood of Cohen’s (1988) medium
effect size benchmark of ±0.30 are good candidates for auxiliary variables, as ignoring
such variables could create a bias-­inducing MNAR-by-­omission process if the missing
data rates are large enough (Collins et al., 2001). This rule of thumb selects three vari-
ables: anxiety, stress, and perceived control over pain. Following Collins et al.’s typol-
ogy, the anxiety scale is a “category A” auxiliary variable, because it predicts missing-
ness and uniquely correlates with depression scores. Stress and perceived control over
pain can be considered “category B” variables, because they correlate with the analysis
variables but do not predict their missingness. Note that these classifications are not
perfect, because the patterns of correlations differ across variables (e.g., anxiety is a
“category C” variable for the severe pain dummy code, because it predicts missingness
but does not uniquely correlate with pain severity ratings).
Considered as a whole, the analysis results in this section offer a simple prescrip-
tion: Estimate the regression model in a way that conditions on three extraneous vari-
ables that would not have appeared in the analysis had the data been complete. Doing so
makes the conditionally MAR process more plausible and could improve power. Select-
ing additional variables based on their semipartial correlations could identify more vari-
ables than are necessary, because these bivariate associations ignore collinearity among
candidate auxiliary variables. With few exceptions (e.g., an excessively large number
Introduction to Missing Data 23

TABLE 1.3. Semipartial Correlations between Analysis Variables


and Six Candidate Auxiliary Variables
Variable Est. SE z p
Depression|Interference and Severe Pain
Age –.20 .06 –3.31 < .001
Exercise Frequency –.12 .05 –2.35 .02
Anxiety .52 .05 10.84 < .001
Stress .46 .05 9.44 < .001
Control –.23 .05 –4.27 < .001
Disability .32 .06 5.61 < .001

Interference|Depression and Severe Pain


Age .04 .06 0.73 .47
Exercise Frequency –.20 .06 –3.59 < .001
Anxiety .06 .05 1.16 .25
Stress .02 .05 0.31 .75
Control –.30 .05 –5.64 < .001
Disability .12 .06 2.00 .05

Severe Pain|Depression and Interference


Age .05 .06 0.86 .39
Exercise Frequency –.13 .05 –2.53 .01
Anxiety –.05 .06 –0.85 .39
Stress .04 .06 0.73 .46
Control .01 .05 0.14 .89
Disability .09 .06 1.59 .11

of auxiliary variables, a peculiar pattern of associations; Hardt, Herke, & Leonhart,


2012; Thoemmes & Rose, 2014), there is usually no harm in casting a broad net and
being overly inclusive, but you may need to restrict the size of the auxiliary set if the
number of candidate variables is very large (as mentioned previously, some methods for
introducing auxiliary variables are prone to convergence failures). My own experience
suggests the payoff for adopting an inclusive analysis strategy is somewhat variable;
leveraging additional variables sometimes affects noticeable changes in the estimates
and standard errors, and other times it doesn’t.

1.7 OLDER MISSING DATA METHODS

I’ve repeatedly referenced the analytic trio that forms the basis of this book: maximum
likelihood, Bayesian estimation, and multiple imputation. These methods have been the
“state of the art” for some time (Schafer & Graham, 2002), because they are capable of
producing valid estimates and inferences in a wide range of applications. The literature
24 Applied Missing Data Analysis

describes numerous other approaches to missing data problems, some of which have
enjoyed widespread use, while others are now little more than a historical footnote.
This section describes a small collection of strategies you may still encounter in pub-
lished research articles or statistical software packages: listwise and pairwise deletion,
arithmetic mean imputation, regression imputation, stochastic regression imputation,
and last observation carried forward imputation. These methods deal with missing data
either by removing cases or by filling in the missing values with a single set of replace-
ment scores (a process known as single imputation). Except for stochastic regression
imputation, these methods are potentially problematic, because they invoke restrictive
assumptions about the missing data process or introduce bias regardless of mechanism.
In contrast, stochastic regression imputation gives valid estimates with a conditionally
MAR process, but it inappropriately shrinks standard errors. I return to the artificial
data in Figure 1.5 to illustrate these older approaches. To refresh, the scatterplot depicts
a conditionally MAR process where participants with low perceived control over their
pain were more likely to have missing depression scores.

Listwise and Pairwise Deletion


Listwise deletion (also known as complete-­case analysis) discards the data for any case
that has one or more missing values. The primary benefit of this approach is conve-
nience, as restricting analyses to the complete cases eliminates the need for special-
ized software. In contrast, pairwise deletion (also known as available-­case analysis)
mitigates the loss of data by eliminating data records on an analysis-­by-­analysis basis;
a prototypical example is a correlation matrix with each of its elements estimated from
different subsample of cases. Reviews of published research articles suggest that dele-
tion methods are quite common (Bodner, 2006; Jeličić, Phelps, & Lerner, 2009; Peugh
& Enders, 2004; Wood, White, & Thompson, 2004), despite being characterized as
being “among the worst methods available for practical applications” (Wilkinson &
Task Force on Statistical Inference, 1999, p. 598).
Deletion methods have two important shortcomings: They reduce power and require
an unsystematic MCAR mechanism where missingness is unrelated to the data. To illus-
trate the impact of the missing data process, reconsider the artificial data in Figure 1.5.
The black crosshairs denote partial data records with perceived control scores but no
depression values. Figure 1.9 shows the scatterplot after from removing the observa-
tions with missing depression scores. The gray contour rings convey the perspective of
a drone hovering over the peak of the bivariate normal population data. As you can see,
the complete score pairs are not dispersed throughout the entire range of the contour
rings, and the data overrepresent the lower right quadrant of population distribution
and underrepresent the upper left quadrant. As a result, the mean of the complete cases
(the black dot at the center of the data) is too high along the horizontal axis (perceived
control over pain) and too low along the vertical axis (depression). Not surprisingly, the
systematic absence of scores from one area of the contour plot also restricts variation
and distorts measures of association.
While the literature generally derides deletion methods, there are a few situations
where a complete-­case analysis is ideal. One such scenario occurs with linear regression
models where missing values are relegated to the outcome and missingness is due to the
Introduction to Missing Data 25

35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 1.9. Scatterplot showing data points that remain after applying listwise deletion to
an MAR process where 50% of the depression scores are missing for participants with lower
perceived control over pain. The black circle denotes the means of the complete observations.

predictors, in which case deleting incomplete data records gives the optimal maximum
likelihood estimates (Glynn & Laird, 1986; Little, 1992; von Hippel, 2007). The situa-
tion is more complicated with incomplete predictors, but deletion generally works well
if missingness is unrelated to the dependent variables. This includes an MAR process
where a covariate is missing as a function of another predictor, as well as an MNAR
mechanism where missingness is related to the would-be values of a covariate (White
& Carlin, 2010). A complete-case analysis can also provide optimal estimates of logistic
regression slope coefficients in a more limited number of scenarios (Vach, 1994; van
Buuren, 2012, p. 48).

Arithmetic Mean Imputation


Arithmetic mean imputation (also known as mean substitution) is a single imputation
approach that fills in a variable’s missing values with the average of its complete scores.
This method has no theoretical justification and distorts parameter estimates under any
missing data process. To illustrate why this is the case, Figure 1.10 shows the scatter-
plot of the artificial data after filling in the missing depression scores with an average
of the observed scores. The gray circles in the plot represent the complete data, and the
black crosshairs along a horizontal line denote score pairs with imputed data. Mean
26 APPLIED MISSING DATA ANALYSIS

35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 1.10. Scatterplot showing the data that result from applying arithmetic mean impu-
tation to an MAR process where 50% of the depression scores are missing for participants with
lower perceived control over pain. The black crosshairs denote data records with perceived con-
trol scores and imputed depression values.

imputation recoups the full set of perceived control scores, but it does a terrible job of
preserving the depression distribution. As you might expect, imputing missing scores
with values at the center of the distribution artificially reduces variability and attenu-
ates measures of association (mathematically, each missing value contributes a zero to
the sum of squares and sum of cross-products terms). If you focus on just the imputed
score pairs, you’ll notice that their correlation necessarily equals 0, because depression
scores are constant. As such, you can think of mean imputation as filling in the data
with scores that have no variation and no correlation with other variables. If you were
going to be stranded on a desert island with only one missing data-handling procedure
in your analytic suitcase, this is not the one you’d choose for your 3-hour tour.
A popular variation of mean imputation appears with questionnaire data where
multiple items tap into different aspects of the same construct. For example, the contin-
uous depression scores in the previous scatterplots result from summing item responses
measuring sadness, lack of motivation, sleep difficulties, feelings of low self-worth, and
so on. A common way to deal with item-level missing data is to compute a prorated
scale score that averages the available item responses. For example, if a participant
answered four out of six depression items, the prorated scale score would be the aver-
age of just four responses. The missing data literature often describes this procedure as
Introduction to Missing Data 27

person mean imputation, because it is equivalent to imputing missing item responses


with the average of each participant’s observed scores (Huisman, 2000; Peyre, Leplege,
& Coste, 2011; Roth, Switzer, & Switzer, 1999; Sijtsma & van der Ark, 2003). Like its
between-­person counterpart, within-­person mean imputation has serious limitations
that should deter researchers from using it. In particular, the method assumes an unsys-
tematic missingness process and requires that all intrascale means and correlations are
the same (Graham, 2009; Mazza, Enders, & Ruehlman, 2015; Schafer & Graham, 2002).

Regression Imputation
Regression imputation (also known as conditional mean imputation) replaces miss-
ing values with predicted scores from a regression equation. Regression imputation has
a long history that dates back more than 60 years (Buck, 1960), and the basic idea is
intuitively appealing: Variables tend to be correlated, so replacing missing values with
predicted scores borrows important information from the observed data. Although this
idea makes good sense, the resulting imputations can introduce substantial bias. The
nature and magnitude of these biases depend on the missing data mechanism and vary
across different estimands.
Regression imputation requires regression models that predict the incomplete vari-
ables from the complete variables. A complete-­case analysis can generate the necessary
estimates, as can maximum likelihood estimation (e.g., so-­called “EM imputation”; von
Hippel, 2004). Returning to the artificial data in Figure 1.5, imputation requires the
regression of depression on perceived control. The following equation generates the pre-
dicted scores that serve as imputations:

DEPRESSi( mis ) = γˆ 0 + γˆ 1 ( CONTROLi ) (1.13)

I use the γ symbol throughout the book to reference coefficients that are not part of the
focal analysis, and the γ’s in this equation are meant to emphasize that the regression
model is a device for imputing the data. The focal analysis could be something entirely
different (e.g., a correlation; the regression of perceived control on depression). The logic
of regression imputation is largely the same with multivariate data, but the procedure
is more cumbersome to implement, because each missing data pattern requires its own
regression equation.
Figure 1.11 shows the scatterplot of the artificial data after filling in the missing
depression scores with predicted values, with gray circles again representing the complete
cases and black crosshairs denoting score pairs with imputed data. As you can see, the
procedure recoups the full data set, but it does a subpar job of preserving the depression
distribution. In particular, the imputed values lack variation, because they fall directly on
the regression line. This feature also implies that the imputed score pairs have a correla-
tion equal to 1. In effect, regression imputation suffers from the opposite problem as mean
imputation, because it replaces missing values with perfectly correlated scores.
As mentioned previously, a complete-­case analysis or maximum likelihood estima-
tion can generate the coefficients for regression imputation. The latter option warrants a
brief discussion, because it often confuses researchers into thinking they are applying a
more sophisticated procedure than they are. This so-­called “EM imputation” procedure
28 APPLIED MISSING DATA ANALYSIS

35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 1.11. Scatterplot showing the data that result from applying regression imputation
to an MAR process where 50% of the depression scores are missing for participants with lower
perceived control over pain. The black crosshairs denote data records with perceived control
scores and imputed depression values.

first uses maximum likelihood estimation (via the expectation maximization, or EM


algorithm) to estimate the mean vector and covariance matrix. So far, so good, as these
estimates are accurate if scores are conditionally MAR. The problem arises in the next
step, where the procedure uses elements in μ μ̂ and S
Ŝ to construct regression equations
that replace the missing observations with predicted values like those in Figure 1.11.
Researchers sometimes characterize this method as maximum likelihood estimation
when all they are really doing is using maximum likelihood to get an accurate regres-
sion equation with which to destroy the data. Interested readers can consult von Hippel
(2004) for a thorough take-down of this approach, which is available in the SPSS Miss-
ing Values Analysis module, among others.

Stochastic Regression Imputation


Stochastic regression imputation also uses regression equations to predict incomplete
variables from complete variables, but it takes the additional step of augmenting each
predicted score with a random noise term from a normal distribution. Adding these
residuals to the predicted values restores lost variability to the data and effectively elimi-
nates the biases associated with standard regression imputation schemes. In fact, sto-
Introduction to Missing Data 29

chastic regression imputation is the only procedure in this section that is generally capa-
ble of producing unbiased parameter estimates when scores are conditionally MAR. As
you will see later in the book, the core idea behind stochastic regression imputation—an
imputation equals predicted value plus noise—resurfaces with Bayesian estimation and
multiple imputation. These procedures use iterative algorithms to generate imputations
over many alternate estimates of regression model parameters, but they are fundamen-
tally sophisticated relatives of stochastic regression imputation.
Applying stochastic regression imputation to the bivariate data in Figure 1.6 again
requires the regression of depression on perceived control. The residual variance from
this regression plays an important role, because it defines the spread of the random
noise terms. As before, substituting a participant’s observed data into the right side of
a regression equation gives the predicted value of the missing data point. Next, Monte
Carlo computer simulation creates a synthetic residual term by drawing a random num-
ber from a normal distribution with a mean equal to 0 and spread equal to the residual
variance estimate. Each imputation is then the sum of a predicted value and random
noise term.

DEPRESSi( mis ) = γˆ 0 + γˆ 1 ( CONTROLi ) + ε i (1.14)

(
ε i ~ N1 0, σˆ 2ε )
The bottom row of the expression says that residuals are sampled from a univariate
normal curve, and the dot accent on εi indicates that this is a synthetic value created by
Monte Carlo computer simulation.
I previously introduced the possibility of drawing replacement scores from a nor-
mal curve, and Figure 1.6 shows the distribution of plausible imputations at three values
of perceived control over pain. Candidate imputations fall exactly on the vertical hash-
marks, but I added horizontal jitter to emphasize that more scores are located at higher
contours near the regression line. Randomly selecting one of the circles from each dis-
tribution would generate an imputed depression score (technically, imputations are not
restricted to the circles displayed in the graph and could be selected from anywhere in
the normal distribution).
Figure 1.12 shows the scatterplot of the artificial data after filling in the miss-
ing depression scores with stochastic regression imputes. As before, the gray contour
rings convey the location and elevation of the bivariate normal population distribution.
Unlike the other approaches in this section, stochastic regression imputation disperses
imputations throughout the entire contour plot and doesn’t over- or underrepresent cer-
tain areas of the distribution. Comparing the plot to the hypothetically complete data
set in Figure 1.5, the filled-­in values look like good surrogates, because they preserve the
center and spread of the depression scores, as well as their correlation with perceived
control over pain. Although analyzing a stochastically imputed data set can provide
accurate parameter estimates if values are MAR, doing so artificially shrinks standard
errors and distorts significance tests; statistical software applications incorrectly treat
imputes as real data when computing measures of uncertainty, such that standard errors
reflect the hypothetical sampling variation that would have resulted had the data been
complete. Pairing stochastic regression imputation with bootstrap resampling (Efron,
30 APPLIED MISSING DATA ANALYSIS

35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 1.12. Scatterplot showing the data that result from applying stochastic regression
imputation to an MAR process where 50% of the depression scores are missing for participants
with lower perceived control over pain. The black crosshairs denote data records with perceived
control scores and imputed depression values.

1987; Efron & Gong, 1983; Efron & Tibshirani, 1993) is one option for estimating mea-
sures of uncertainty (see Chapter 2) and generating and analyzing multiple sets of impu-
tations is another (see Chapter 7).

Last Observation Carried Forward


Last observation carried forward is a missing data technique for longitudinal designs
with incomplete repeated measurements. The procedure is relatively rare in the behav-
ioral and the social sciences, and is more common in medical studies and clinical tri-
als (Wood et al., 2004). As its name implies, last observation carried forward imputes
repeated measurements with scores from the prior measurement occasion. For example,
if a participant drops out after the fifth week of an 8-week study, the fifth week’s score
replaces all subsequent observations. To illustrate, Table 1.4 shows four waves of hypo-
thetical depression scores for five participants, with imputed scores shown in bold type-
face. As you can see, the prior measurement occasions “carry forward” regardless of
whether a participant permanently attrits (e.g., the first and third data records) or has
intermittent missing values (e.g., the fourth data record).
Introduction to Missing Data 31

TABLE 1.4. Imputed Data from Last Observation


Carried Forward
ID Wave 1 Wave 2 Wave 3 Wave 4
Observed data
1 25 28 — —
2 22 21 24 26
3 18 — — —
4 30 — 31 34
5 20 20 22 21

Imputed data
1 25 28 28 28
2 22 21 24 26
3 18 18 18 18
4 30 30 31 34
5 20 20 22 21

Last observation carried forward effectively assumes no change after the final obser-
vation or during the intermittent period where scores are missing. The conventional wis-
dom is that imputing the data with stable scores yields a conservative estimate of treat-
ment group differences at the end of a study. However, empirical research shows that this
isn’t necessarily true, as the method can also exaggerate group differences (Cook, Zeng,
& Yi, 2004; Liu & Gould, 2002; Mallinckrodt, Clark, & David, 2001; Molenberghs et al.,
2004). The direction and magnitude of the bias depend on specific characteristics of the
data, but the approach is likely to produce distorted parameter estimates, even with an
unsystematic missingness process (Molenberghs et al., 2004). Suffice to say, there are
much better strategies for dealing with longitudinal missing data.

1.8 COMPARING MISSING DATA METHODS VIA SIMULATION

The previous scatterplots suggest that older missing data methods can misrepresent
distributions in ways that almost certainly introduce bias. Monte Carlo computer simu-
lations can reveal how the tendencies depicted in the graphs unfold over many different
samples and across different estimands. To this end, I used a series of simulation studies
to compare listwise deletion, arithmetic mean imputation, regression imputation, and
stochastic regression imputation to a “gold standard” maximum likelihood estimator
for missing data. As mentioned previously, maximum likelihood missing data handling
leverages the entire sample’s observed data without discarding any information. The
other “gold standards,” Bayesian estimation and multiple imputation, are equivalent in
this case (Collins et al., 2001; Schafer, 2003).
The first step of a computer simulation is to specify a set of hypothetical parameter
values. Recycling the parameters that created the artificial depression and perceived con-
trol over pain data in the previous scatterplots helps visualize the procedure. Returning
32 Applied Missing Data Analysis

to Figure 1.2, the contour rings convey the perspective of a drone hovering over the peak
of the bivariate normal population distribution, and the gray circles are an artificial
sample of hypothetically complete data. The next step generates many artificial data
sets from the population. Researchers often ask whether contemporary approaches like
maximum likelihood can be used with small samples or large amounts of missing data.
To examine this issue, I programmed a simulation that created 1,000 random samples of
N = 100 from the bivariate normal population, and I deleted 50% of the artificial depres-
sion scores following one of the missing data mechanisms. The missing at completely at
random process mimicked Figure 1.4, the conditionally MAR mechanism followed Fig-
ure 1.5, and the MNAR process mirrored Figure 1.7. After deleting scores, I used different
missing data-­handling methods to estimate three sets of parameters: the mean vector and
variance–­covariance matrix, coefficients from the regression of Y on X (e.g., perceived
control over pain predicting depression), and coefficients from the regression of X on Y
(e.g., depression predicting perceived control over pain). Any discrepancy between the
average estimates and their true values reflects systematic nonresponse bias.

Missing Completely at Random


The first simulation modeled a missing (always) completely at random mechanism
where missingness on Y (e.g., depression) was independent of the data. Table 1.5 shows
the average parameter estimates for each method along with their true values. The esti-
mates in bold typeface differ from their true values by more than 10%. Missing data

TABLE 1.5. Average Parameter Estimates


from the MCAR Computer Simulation
Parameter True value LWD AMI RI SRI FIML
Means, variances, covariances
μX 20.00 20.02 20.02 20.02 20.02 20.02
μY 15.00 14.98 14.98 14.99 15.01 14.99
σX2 25.00 24.86 24.91 24.91 24.91 24.66
σX⋅Y –12.65 –12.83 –6.33 –12.87 –12.88 –12.74
σY2 40.00 40.44 19.95 23.75 40.40 39.75
ρX⋅Y –.40 –.40 –.28 –.52 –.40 –.40

Regression of Y on X
β0 25.12 25.31 20.06 25.31 25.36 25.31
β1 (X) –0.51 –0.52 –0.25 –0.52 –0.52 –0.52
σε2 33.60 33.75 18.30 16.48 33.12 32.39

Regression of X on Y
γ0 24.74 24.76 24.76 28.06 24.80 24.82
γ1 (Y) –0.32 –0.32 –0.32 –0.54 –0.32 –0.32
σr2 21.00 20.77 22.92 17.69 20.56 20.19

Note. LWD, listwise deletion; AMI, arithmetic mean imputation; RI, regression imputation;
SRI, stochastic regression imputation; FIML, full-information maximum likelihood.
Introduction to Missing Data 33

theory predicts that listwise deletion, stochastic regression imputation, and maximum
likelihood estimation are unbiased in large samples. The simulation bears this out, as
the average estimates are effectively identical to the true population parameters, even
with a small sample size and 50% missing data. As you might expect, mean imputation
and regression imputation were prone to substantial biases. To illustrate, the solid curve
in Figure 1.13 shows the sampling distribution of the correlation estimates for regres-
sion imputation, and the dashed curve shows the corresponding distribution for mean
imputation. Neither method did a good job of recovering the population correlation,
as the true value (the vertical line) was in the tails of both distributions. Although the
presence and magnitude of the biases varied across estimands, the simulation results
provide no support for these approaches on balance.
Although deletion appears to be just as good as maximum likelihood, leveraging the
full sample’s observed data generates estimates that are more precise, with less variation
across samples. The precision difference is dramatic for some estimands and modest for
others. To illustrate, the solid curve in Figure 1.14 is a kernel density plot displaying
the sampling distribution of the maximum likelihood mean estimates, and the dashed
curve shows the corresponding distribution for listwise deletion. As you can see, both
distributions are centered at the true value of 20, but the maximum likelihood estimates
are substantially closer to the truth, on average (e.g., the peak of the solid curve is higher
at the true value and its tails are less thick). As a second example, Figure 1.15 shows the
sampling distributions of the covariance. Maximum likelihood is again more precise,
but the difference is quite modest.
Relative Probability

–1.0 –0.8 –0.6 –0.4 –0.2 0.0 0.2


Correlation Estimates

FIGURE 1.13. Kernel density plots of the correlation estimates from the MCAR computer sim-
ulation. The solid curve shows the sampling distribution of the regression imputation estimates,
and the dashed curve shows the corresponding mean imputation estimates. Neither distribution
is centered at the true value of –.40, indicating substantial nonresponse bias.
Relative Probability

17 18 19 20 21 22 23
Mean Estimates

FIGURE 1.14. Kernel density plots of the X mean estimates from the MCAR computer simula-
tion. The solid curve shows the sampling distribution of the maximum likelihood estimates, and
the dashed curve shows the corresponding deletion estimates. Both distributions are centered at
the true value of 20, but the maximum likelihood estimates are substantially closer to the true
value, on average.
Relative Probability

–40 –30 –20 –10 0 10


Covariance Estimates

FIGURE 1.15. Kernel density plots of the covariance estimates from the MCAR computer
simulation. The solid curve shows the sampling distribution of the maximum likelihood esti-
mates, and the dashed curve shows the corresponding deletion estimates. Both distributions are
centered at the true value of –12.65, but the maximum likelihood estimates are slightly closer to
the true value, on average.

34
Introduction to Missing Data 35

Missing at Random
The second simulation, which mimicked Figure 1.5, modeled a missing (always) at ran-
dom mechanism where the probability of a missing Y score increased as the value of X
decreased (e.g., depression scores were more likely to be missing for participants with
low perceived control over pain). Table 1.6 shows the average parameter estimates for
each method, along with their true values. Following the first simulation, mean impu-
tation and regression imputation estimates were prone to bias, and the results offer no
support for these procedures. A systematic missingness process was generally detrimen-
tal to the listwise deletion estimates as well. The notable exception was the regression
of Y on X, where complete-­case analysis gives optimal estimates when missingness does
not depend on the outcome variable (Glynn & Laird, 1986; Little, 1992; von Hippel,
2007; White & Carlin, 2010). Finally, missing data theory again predicts that maximum
likelihood estimation and stochastic regression imputation should be unbiased in large
samples, and they are virtually so here. These results are consistent with published sim-
ulation studies showing that the percentage of missing data is not a strong determinant
of bias provided that presumed mechanism is correct (Madley-­Dowd, Hughes, Tilling,
& Heron, 2019). Finally, stochastic regression imputation gave equivalent point esti-
mates to maximum likelihood, but its standard errors and significance tests are untrust-
worthy without corrective procedures like the bootstrap.

TABLE 1.6. Average Parameter Estimates


from the MAR Computer Simulation
Parameter True values LWD AMI RI SRI FIML
Means, variances, covariances
μX 20.00 22.51 19.99 19.99 19.99 19.99
μY 15.00 13.74 13.74 15.01 15.02 15.01
σX2 25.00 18.64 25.11 25.11 25.11 24.86
σX⋅Y –12.65 –9.42 –4.66 –12.65 –12.66 –12.52
σY2 40.00 38.44 19.03 23.62 40.05 39.57
ρX⋅Y –.40 –.35 –.21 –.51 –.40 –.40

Regression of Y on X
β0 25.12 25.09 17.46 25.09 25.10 25.09
β1 (X) –0.51 –0.50 –0.19 –0.50 –0.50 –0.50
σε2 33.60 33.76 18.20 16.54 32.98 32.40

Regression of X on Y
γ0 24.74 25.89 23.37 27.99 24.79 24.77
γ1 (X) –0.32 –0.25 –0.25 –0.53 –0.32 –0.32
σr2 21.00 16.32 24.04 18.03 20.79 20.46

Note. LWD, listwise deletion; AMI, arithmetic mean imputation; RI, regression imputation;
SRI, stochastic regression imputation; FIML, full-information maximum likelihood.
36 Applied Missing Data Analysis

Missing Not at Random


The final simulation, which mirrored Figure 1.7, modeled a missing (always) not at
random mechanism where the probability of a missing Y score increased as the value
of Y itself increased (e.g., depression scores were more likely to be missing for partici-
pants with high levels of depression). Table 1.7 shows the average parameter estimates
for each method, along with their true values. As you can see, all methods produced
biased estimates of one or more estimands. Consistent with the MAR simulation, dele-
tion gave accurate estimates of the regression of X on Y, because missingness did not
depend on the outcome (White & Carlin, 2010). Maximum likelihood and stochastic
regression imputation estimates were similarly accurate for that model but exhibited
predictable biases in other analyses. Conditioning on auxiliary variables could improve
the situation a little bit, but the only way to counteract nonresponse bias from a focused
MNAR process like this one is to adopt a specialized analysis that introduces a nui-
sance model for missingness (e.g., a selection model or pattern mixture model). To
illustrate one such approach, I used maximum likelihood estimation to fit a selection
model that introduces an additional regression, with Y predicting its own missingness
(the true data-­generating model). The rightmost column of Table 1.7 shows that a selec-
tion model can effectively eliminate bias, but achieving that payoff requires a correctly
specified nuisance model. Chapter 9 describes analysis models for MNAR processes in
more detail.

TABLE 1.7. Average Parameter Estimates


from the MNAR Computer Simulation
Parameter True values LWD AMI RI SRI FIML FIML selection
Means, variances, covariances
μX 20.00 20.00 20.02 20.02 20.02 20.02 20.02
μY 15.00 14.97 14.97 14.97 14.97 14.97 14.87
σX2 25.00 24.17 25.19 25.19 25.19 24.94 24.94
σX⋅Y –12.65 –9.61 –4.77 –10.06 –10.10 –9.96 –13.17
σY2 40.00 30.04 14.93 17.35 30.17 29.71 42.06
ρX⋅Y –.40 –.36 –.25 –.48 –.36 –.36 –.39

Regression of Y on X
β0 25.12 22.97 18.77 22.97 23.00 22.97 25.47
β1 (X) –0.51 –0.40 –0.19 –0.40 –0.40 –0.40 –0.53
σε2 33.60 26.23 14.02 12.90 25.67 25.18 34.64

Regression of X on Y
γ0 24.74 24.80 24.82 28.61 25.03 25.04 24.94
γ1 (X) –0.32 –0.32 –0.32 –0.58 –0.34 –0.34 –0.33
σr2 21.00 21.14 23.71 19.08 21.55 21.21 20.35

Note. LWD, listwise deletion; AMI, arithmetic mean imputation; RI, regression imputation; SRI, stochastic regression
imputation; FIML, full-information maximum likelihood.
Introduction to Missing Data 37

1.9 PLANNED MISSING DATA

The remainder of the chapter describes planned missing data designs that introduce
intentional missing values as a device for reducing respondent burden or lowering
research costs. The thought of intentionally creating missing values might seem odd at
first, but you are probably already familiar with the idea. For example, in a randomized
study with two treatment conditions, everyone has a hypothetical score from both con-
ditions, but participants only provide a response to their assigned condition. The unob-
served response to the other condition—­the potential outcome or counterfactual—­is
missing completely at random. Viewing randomized experiments as a missing data
problem is popular in the statistics literature and is a key component of Rubin’s causal
inference framework (Rubin, 1974; West & Thoemmes, 2010). The fractional factorial
(Montgomery, 2020) is another research design that yields MCAR values. With this
design, you purposefully select a subset of experimental conditions from a full facto-
rial scheme and randomly assign participants to a restricted combination of conditions.
Carefully omitting certain design cells saves resources by eliminating higher-­order
effects that are unlikely to be present in the data. Finally, planned missingness designs
have long been a staple in educational testing applications, where examinees are admin-
istered a subset of test questions from a larger item bank (Johnson, 1992; Lord, 1962).
You likely encountered a variant of this approach if you took the Graduate Record Exam.
The advent of sophisticated missing data-­handling methods prompted the devel-
opment of planned missingness designs that use intentional missing values to address
logistical and budgetary constraints (Graham, Taylor, & Cumsille, 2001; Graham
et al., 2006; Little & Rhemtulla, 2013; Raghunathan & Grizzle, 1995; Rhemtulla &
­Hancock, 2016; Rhemtulla & Little, 2012; Silvia, Kwapil, Walsh, & Myin-­Germeys,
2014). I describe three such designs in this section: multiform designs for questionnaire
data, wave missing data designs for longitudinal studies, and two-­method measurement
designs that pair expensive and inexpensive measures of a construct. Importantly, these
designs cannot introduce bias, because they create patterns of unsystematically missing
values. Of course, introducing missing data necessarily reduces power, but the loss of
precision is surprisingly low in many cases.

Multiform Designs
Multiform planned missingness designs are most often associated with studies that use
lengthy surveys that comprise several questionnaires and many items. Respondent bur-
den is a major concern in these settings, because the number of items that people can
reasonably answer in a single sitting is limited. A multiform design addresses this issue
by administering multiple questionnaire forms that comprise different subsets of vari-
ables. For example, the classic three-form design (Graham et al., 1996, 2006) distributes
variables into four blocks (X, A, B, and C) that are allocated across three different ques-
tionnaire forms. Each form includes the X set and is missing the A, B, or C set. Table 1.8
shows the distribution of four blocks across the three forms, with O’s denoting observa-
tions and M’s indicating missing values, and Figure 1.1d shows a graphical schematic
of the design. Supposing that each variable set contains 25 questionnaire items, then
38 Applied Missing Data Analysis

TABLE 1.8. Three‑Form Design


Variable set
Form X A B C
1 O M O O
2 O O M O
3 O O O M

Note. O, observed; M, missing.

survey length is reduced by 25% and participants respond to 75 rather than 100 ques-
tions. Multiform designs readily extend to include additional variable sets as needed.
For example, Table 1.9 shows a six-form design from Rhemtulla and Little (2012) where
respondents provide data on three out of five blocks, and Raghunathan and Grizzle
(1995) and Graham et al. (2006) describe designs with even more forms.
The main downside to multiform designs (and planned missingness designs in gen-
eral) is a reduction in statistical power. The impact of missing data on power and preci-
sion is complex and depends on the type of model and parameter being estimated (e.g.,
models with latent vs. manifest variables; correlations vs. regression slopes), as well as
the effect sizes within and between blocks (Rhemtulla, Savalei, & Little, 2016). Looking
at the percentage of observed responses for each variable or variable pair (sometimes
called covariance coverage) provides some insight. To illustrate, Table 1.10 shows the
covariance coverage rates for a three-form design with eight variables distributed equally
across four blocks. The cell percentages reflect three tiers of precision. All things being
equal, tests involving members of the X set (e.g., Y1 and Y2) have the most power, because
these variables are complete. Variable pairs with 33% missing data introduce a second,
lower tier of precision and power. This tier includes between-­set associations involving
a member of the X set (e.g., Y1 and Y3) and within-­set associations between variables in
the A, B, or C blocks (e.g., Y3 and Y4). Finally, the greatest reductions in power occur
when testing associations between variable pairs with 66% missing data. This includes
all between-­set associations involving members of A, B, or C (e.g., Y3 and Y5).

TABLE 1.9. Six‑Form Design


Variable set
Form X A B C D
1 O M M O O
2 O M O M O
3 O O M M O
4 O M O O M
5 O O M O M
6 O O O M M

Note. O, observed; M, missing.


Introduction to Missing Data 39

TABLE 1.10. Percentage of Responses within and between Blocks


of a Three‑Form Design
X A B C
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8
X Y1 100% 100%
Y2 100% 100%

A Y3 66% 66% 66% 66%


Y4 66% 66% 66% 66%

B Y5 66% 66% 33% 33% 66% 66%


Y6 66% 66% 33% 33% 66% 66%

C Y7 66% 66% 33% 33% 33% 33% 66% 66%


Y8 66% 66% 33% 33% 33% 33% 66% 66%

With these percentages in mind, we can devise strategies for distributing variables
to blocks in a way that mitigates rather than exacerbates the design’s natural inefficien-
cies. First, pairs of variables with strong associations should appear in different blocks
(Raghunathan & Grizzle, 1995; Rhemtulla & Little, 2012; Rhemtulla et al., 2016). This
makes intuitive sense, because a large effect size introduces redundancy that offsets
a lack of observations. This principle has important implications for studies that use
multiple-­item scales to measure complex constructs, where items from the same scale
tend to have much stronger correlations than items belonging to different scales. Dis-
tributing a scale’s items across different sets maximizes power (Graham et al., 1996,
2006; Rhemtulla & Hancock, 2016; Rhemtulla & Little, 2012), especially when using
a latent variable model to examine associations among constructs (Rhemtulla et al.,
2016).
Pairs of variables with weak associations are good candidates for the fully complete
X set, because small effect sizes naturally require more data to achieve adequate power.
Additionally, Graham et al. (2006) recommend assigning key outcome variables to the
X set, as doing so maximizes power to test a study’s main substantive hypotheses. Ana-
lytic work from Rhemtulla et al. (2016) supports this recommendation, as the strategy
maximizes power to detect non-zero regression slopes. Including outcome variables in
the X set also ensures that two-way interaction effects are estimable (Enders, 2010).
Finally, the X set could also include potential determinants or correlates of unplanned
missing data, as conditioning on such variables is necessary to satisfy the MAR assump-
tion (Rhemtulla & Little, 2012). The power analyses in the next section highlight some
of these principles.

Longitudinal Designs
Respondent burden and budgetary constraints can be particularly acute in longitudi-
nal studies where researchers administer assessments repeatedly over time. Extending
40 Applied Missing Data Analysis

the logic of the three-form design, Graham et al. (2001) described a number of wave
missing data designs where each participant provides data at a subset of measurement
occasions. Table 1.11 shows one such design that features seven random subgroups,
six of which have intentional missing data at one wave. Longitudinal planned missing-
ness designs can be especially efficient relative to their complete-­data counterparts. For
example, applying the design in the table to the group-by-time interaction effect from
a linear growth curve model, Graham and colleagues showed that power was 94% as
large as that of a complete-­data analysis. Other designs produce comparable power with
even fewer data points. In situations where the total number of assessments is fixed
(e.g., a grant budget can accommodate 1,000 assessments, each costing $100), Graham’s
chapter further showed that wave missing data designs can achieve higher power than
a corresponding complete-­data design; that is, collecting incomplete data from 300 par-
ticipants can achieve higher power than collecting complete data from 250 participants.
Myriad configurations of patterns are possible with wave missing designs, not all of
which are nearly as beneficial as the ones described earlier. Computer simulation stud-
ies provide details on a few possibilities (e.g., Graham et al., 2001; Mistler & Enders,
2011), and methodologists have outlined general strategies for identifying designs that
maximize efficiency in a particular scenario. Wu, Jia, Rhemtulla, and Little (2016)
developed a computer simulation tool for this purpose called SEEDMC (SEarch for Effi-
cient Designs using Monte Carlo Simulation). Their algorithm creates a design pool con-
taining all possible planned missingness designs with a given number of measurement
occasions, and it uses Monte Carlo computer simulations to create many artificial data
sets for each member of the pool. Fitting a longitudinal model to each artificial data set
and computing the sampling variation of the resulting estimates identifies designs with
the highest relative efficiency (i.e., lowest possible sampling variation). More recently,
Brandmaier, Ghisletta, and von Oertzen (2020) developed an analytic approach that
estimates the measurement error of the individual change rates from a given configura-
tion of measurement occasions. Their method selects the same optimal designs as Monte
Carlo computer simulations, but it does so without intensive computations. I illustrate a
combination of these strategies in Section 10.11.
Wave missing data designs are particularly useful for studies that examine change
following an intervention or a treatment. However, many researchers are interested in

TABLE 1.11. Wave Missing Data Design for a Longitudinal Study


Group % sample Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
1 16.7 O O O O O
2 16.7 M O O O O
3 16.7 O M O O O
4 16.7 O O M O O
5 16.7 O O O M O
6 16.7 O O O O M

Note. O, observed; M, missing.


Introduction to Missing Data 41

TABLE 1.12. Cross‑Sequential Design for a Developmental Study


Cohort 12 13 14 15 16 17
12 O O O M M M
13 M O O O M M
14 M M O O O M
15 M M M O O O

Note. O, observed; M, missing.

developmental processes that involve age-­related change (e.g., the development of read-
ing skills in early elementary school, the development of behavioral problems during
the teenage years). Cohort-­sequential (Duncan, Duncan, & Hops, 1996; Nesselroade
& Baltes, 1979) or cross-­sequential designs (Little, 2013; Little & Rhemtulla, 2013)
are ideally suited for this type of research question. This design requires multiple age
cohorts, each of which is followed over a fixed period. These shorter longitudinal stud-
ies combine to produce a much longer developmental span. To illustrate, Table 1.12
shows a cross-­sequential design from a 3-year study with four age cohorts: 12, 13, 14,
and 15. Notice that each cohort has three waves of intentional missing data (e.g., the
12-year-olds have missing data at ages 15, 16, and 17; the 13-year-olds have missing data
at ages 12, 16, and 17; and so on).
The four 3-year studies combine to create a longitudinal design spanning 6 years,
but you must be careful analyzing the data, because several bivariate associations are
inestimable. For example, there are no data with which to estimate the correlation
between scores at ages 12 and 15, 13 and 16, 14 and 17, and so on. This feature rules out
popular multiple imputation procedures that array repeated measurements in columns
(e.g., Schafer, 1997; van Buuren, 2007). However, you can readily use maximum likeli-
hood or Bayesian estimation to fit growth models to the data, and multilevel imputation
schemes that nest repeated measurements within individuals are another possibility
(see Chapter 8).

Two‑Method Measurement Designs


The two-­method measurement design (Graham et al., 2006) was developed for situa-
tions in which a researcher has the choice between two measures of a construct, one
of which is expensive and valid (i.e., a “gold standard” measure), the other of which is
inexpensive but less valid. The basic idea is to collect the inexpensive measure from the
entire sample and restrict the expensive measure to a random subset of participants.
Graham et al. give an example from cigarette smoking research where self-­reports with
dubious validity are obtained from the entire sample and “gold standard” biochemical
validators are collected from a smaller subsample. The two-­method design could also
be beneficial with brain imaging studies, where functional magnetic resonance imaging
(fMRI) data are difficult and costly to obtain, but inexpensive behavioral measures are
inexpensive and easy to collect from a much larger sample.
42 Applied Missing Data Analysis

There are at least two ways to analyze data from a two-­method measurement design.
One approach is to cast the “gold standard” measure in the focal analysis model and use
the inexpensive measure as an auxiliary variable. As a preview, Figure 1.16a shows a
path diagram of the so-­called extra dependent variable model (Graham, 2003) that
features the auxiliary variable (the inexpensive measure) as an additional outcome. The
idea is that the inexpensive measure transmits information to the expensive measure
(and thus enhances the power) via its mutual association with the predictor and a cor-
related residual term (the double-­headed curved arrow connecting the residuals). If the
two measures can be cast as multiple indicators of the same construct, a second option
is to analyze the data with a latent variable model similar to the one in Figure 1.16b.
­Graham et al. (2006) refer to this diagram as a bias-­reduction model, because the cor-
related residual between the two inexpensive measures removes extraneous sources of

(a) Extra Dependent Variable Model

Predictor

Expensive Inexpensive

(b) Bias Correction Model

Predictor Latent

Inexpensive Inexpensive
Expensive
Measure 1 Measure 2

FIGURE 1.16. The top panel shows a path diagram of the extra dependent variable model, and
the bottom panel is diagram of a bias-­reduction model for a two-­method measurement design
where inexpensive and expensive measures are indicators of a latent factor.
Introduction to Missing Data 43

correlation that result from invalidity, thus improving the accuracy of the structural
regression coefficient connecting the covariate to the latent outcome. Graham et al.
(2006) and Rhemtulla and Little (2012) provide guidelines for determining the optimal
sample size ratio for the expensive measure, and Monte Carlo computer simulations are
also ideally suited for this task.

1.10 POWER ANALYSES FOR PLANNED MISSINGNESS DESIGNS

This final section illustrates a power analysis for a three-form design. Section 10.10
presents a similar power study for a longitudinal growth curve model with wave missing
data and unplanned missingness. I use computer simulations for this purpose, because
they are relatively easy to implement and are generally applicable to virtually any analy-
sis model. The goal of a computer simulation is to generate many artificial data sets with
known population parameters and examine the distributions of the estimates across
those many samples. In a power analysis, the focus shifts to significance tests, where the
simulation-­based power estimate is the proportion of artificial data sets that produced
a significant test statistic.
The first step of a simulation is to specify hypothetical values for the population
parameters. This is especially important when planning a three-form design, because the
expected effect sizes dictate the assignment of variables to the four sets (e.g., variables
with strong associations can be exposed to large amounts of missingness). I take a some-
what different tack that holds effect size constant to illustrate the design’s natural tenden-
cies and highlight previous findings from the literature. To this end, I considered four nor-
mally distributed variables (one variable per set) with uniformly moderate correlations
equal to .30. The simulation created 5,000 random samples of N = 250 from this popula-
tion, and I subsequently deleted data according to the three-form design in Table 1.8.
Power depends, in part, on the type of parameter being estimated (e.g., the covari-
ance between two variables has different power than a regression slope). To illustrate
this point, I fit two models to each artificial data set: a saturated model consisting of a
mean vector and variance–­covariance matrix, and a three-­predictor linear regression
model with one of the variables arbitrarily designated as the outcome. The assignment
of the outcome variable to the four sets is an important consideration, so I further exam-
ined two design configurations: one with a complete predictor in the X set, and the other
with a complete outcome in the X set. Figure 1.17 shows path diagrams of the four possi-
bilities, with shaded rectangles representing blocks with missing data. I used maximum
likelihood estimation to fit the analysis models to the artificial data sets, and I recorded
the proportion of the 5,000 samples that produced statistically significant estimates.
This proportion is a simulation-­based estimate of the probability of rejecting a false null
hypothesis. Maximum likelihood is the focus of the next two chapters, but for now it is
sufficient to know that the estimator leverages the full sample’s observed data without
discarding any information. Simulation scripts are available on the companion website.
Table 1.13 gives power estimates for each correlation and regression slope along
with the corresponding power values for optimal analyses with no missing data. To
facilitate interpretation, the power ratios reflect complete-­data power relative to that of
44 Applied Missing Data Analysis

(a) Model 1: Incomplete Outcome

X 1 (X Set) X 2 (A Set) X 3 (B Set) Y (C Set)

(b) Model 1: Complete Outcome

Y (X Set) X 1 (A Set) X 2 (B Set) X 3 (C Set)

(c) Model 2: Incomplete Outcome

X 1 (X Set) X 2 (A Set) X 3 (B Set)

Y (C Set)

(d) Model 2: Complete Outcome

X 1 (X Set) X 2 (A Set) X 3 (B Set)

Y (C Set)

FIGURE 1.17. Path diagrams depicting two analysis models and two configurations of planned
missing data. The four sets of the three-form designs are color coded, with shaded rectangles rep-
resenting blocks with missing data.

a planned missingness design (e.g., 1.20 means that a complete-­data analysis has 20%
more power). Table 1.13 illustrates several important points, all of which echo findings
from the literature. First, notice that power estimates differ by estimand, with regres-
sion slopes exhibiting lower power than correlations. This isn’t necessarily surprising
given that the coefficients reflect partial associations, but it nevertheless highlights the
importance of considering different analyses that will be performed on the incomplete
data. Second, correlations involving a complete variable in the X set (e.g., the association
in the first row of the table) experienced virtually no reduction in power, even though
Introduction to Missing Data 45

TABLE 1.13. Simulation‑Based Power Estimates


for a Three‑Form Design

Optimal X1 complete Y complete


Parameter power Power Power ratio Power Power ratio
Correlations
Y ↔ X1 1.00 .98 1.02 .98 1.02
Y ↔ X2 1.00 .83 1.21 .98 1.02
Y ↔ X3 1.00 .84 1.19 .98 1.02
X1 ↔ X2 1.00 .98 1.02 .82 1.22
X1 ↔ X3 1.00 .98 1.02 .82 1.21
X2 ↔ X3 1.00 .82 1.22 .82 1.22

Regression slopes
X1 → Y .85 .61 1.40 .62 1.37
X2 → Y .86 .40 2.12 .63 1.36
X3 → Y .86 .40 2.14 .64 1.35

33% of the other variable’s scores were missing (e.g., the power advantage of a complete-­
data analysis was only about 2%). Third, correlations involving variable sets AB, AC, or
BC (e.g., the correlation between X2 and X3) still had sufficient power values above .80,
even though only 33% of score pairs were complete (see Table 1.10). Finally, the bottom
section of Table 1.13 illustrates that assigning the outcome variable to the complete X set
uniformly improves the power of all regression slopes, whereas assigning a predictor to
the X set benefits only that covariate’s slopes. As noted previously, assigning outcomes
to the X set also ensures that all two-way interactions are estimable.

1.11 SUMMARY AND RECOMMENDED READINGS

This chapter described the theoretical underpinnings for missing data analyses, as out-
lined by Rubin and colleagues (Little & Rubin, 1987; Mealli & Rubin, 2016; Rubin,
1976). This work classifies missing data problems according to three different processes
that link missingness to the data: an unsystematic or haphazard missing completely at
random (MCAR) mechanism, a systematic conditionally missing at random (CMAR)
process where missingness relates only to the observed data, and a systematic missing
not at random (MNAR) mechanism where unseen score values determine missingness.
From a practical perspective, these mechanisms function as statistical assumptions for
a missing data analysis, and they also help us understand why not to use older methods
like deletion and single imputation with a mean or predicted value.
Looking forward, most of the book is devoted to methods that naturally require a
conditionally MAR assumption—­maximum likelihood, Bayesian estimation, and mul-
tiple imputation. This mechanism is reasonable for many applications, and flexible soft-
ware options abound. Chapter 9 describes how to modify these methods to model differ-
46 Applied Missing Data Analysis

ent MNAR processes. In the near term, maximum likelihood estimation is the next topic
on the docket. Chapter 2 describes the full information estimator for complete data,
and Chapter 3 applies the method to missing data problems. Finally, I recommend the
following articles for readers who want additional details on topics from this chapter:

Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and resrictive strate-
gies in modern missing data procedures. Psychological Methods, 6, 330–351.

Graham, J. W., Taylor, B. J., Olchowski, A. E., & Cumsille, P. E. (2006). Planned missing data
designs in psychological research. Psychological Methods, 11, 323–343.

Madley-­Dowd, P., Hughes, R., Tilling, K., & Heron, J. (2019). The proportion of missing data
should not be used to guide decisions on multiple imputation. Journal of Clinical Epidemiol-
ogy, 110, 63–73.

Olinsky, A., Chen, S., & Harlow, L. (2003). The comparative efficacy of imputation methods for
missing data in structural equation modeling. European Journal of Operational Research,
151, 53–79.

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychologi-
cal Methods, 7, 147–177.
2

Maximum Likelihood Estimation

2.1 CHAPTER OVERVIEW

Maximum likelihood is the go-to estimator for many common statistical models, and
it is one of the three major pillars of this book. As its name implies, the estimator
identifies the population parameters that are most likely responsible for a particular
sample of data. I spend most of the chapter unpacking this statement for analyses with
normally distributed outcomes. Not only are such models exceedingly common across
many different substantive disciplines, but the normal curve also appears prominently
throughout the book as a distribution for missing values. As such, this chapter sets up
a lot of later material. For now, I focus on complete-­data maximum likelihood analyses,
but all the major ideas readily generalize to missing data, and much of Chapter 3 tweaks
concepts from this chapter.
The chapter begins with a simple univariate example that illustrates the mechanics
of estimation and builds to multiple regression. As you will see, maximum likelihood
estimates are equivalent to those of ordinary least squares, as both approaches iden-
tify estimates that minimize squared distances to the data points, albeit in different
ways. After describing significance tests and corrective procedures for non-­normal data,
I illustrate estimation for a mean vector and variance–­covariance matrix. This multi-
variate analysis lays the groundwork for missing data handling in models with general
missing data patterns. Although I mostly discuss models with analytic solutions for the
estimates, I introduce iterative optimization algorithms in this chapter, as they will be
the norm with missing data.

2.2 PROBABILITY DISTRIBUTIONS


VERSUS LIKELIHOOD FUNCTIONS

Probability distributions and likelihood functions play a prominent role throughout the
book, so it is important to introduce these concepts early and establish some recurring
47
48 Applied Missing Data Analysis

notation. A binary outcome with score values of 0 and 1 provides a simple platform for
exploring some key ideas. As the name implies, a probability distribution is a math-
ematical function that describes the relative frequency of different score values. The
Bernoulli distribution below describes the probability of the two scores:

(1−Yi )  π if Yi =1
f ( Yi | π ) = πYi (1 − π ) = (2.1)
1 − π if Yi =0

The function on the left side of the equation says that the probability of a particular
score value depends on the unknown population proportion π to the right of the vertical
pipe (the pipe means “conditional on” or “depends on”). The right side of the equation
gives the rules for computing the two probabilities.
To provide a substantive context, I use the math achievement data on the com-
panion website. Among other things, the data set includes pretest and posttest math
achievement scores and academic-­related variables (e.g., math self-­efficacy, standard-
ized reading scores, sociodemographic variables) for a sample of N = 250 students (see
Appendix). One of the variables in the data is a binary indicator that measures whether
a student is eligible for free or reduced-­priced lunch (0 = no assistance, 1 = eligible for
free or reduced-­price lunch). Hypothetically, suppose we knew that the true proportion
of eligible students in the population is π = .45. Figure 2.1 displays the probability dis-
tribution as a bar graph, and its mathematical description is f(Yi|π = .45). I use generic
function notation f(∙) throughout the book to represent the height of a distribution or
curve at some value on its horizontal axis, so “f of something” always refers to vertical
elevation. In this example, f(Yi|π = .45) is just a fancy way of referencing the vertical
height of the bars in Figure 2.1.
The figure and previous equation highlight the defining feature of a probability
distribution: Probabilities must sum to 1. The same is true for continuous probability
distributions like the normal curve, where the area under the curve must equal 1. We
will encounter many different curves and functions throughout the book, not all of
which are probability distributions. The likelihood is one important example. Returning
to Equation 2.1, the function on the left side of the expression has two inputs inside the
parentheses: data values and a parameter. The ordering of the two inputs implies that
the data values vary, but the parameter to the right of the vertical pipe (the “conditional
on” symbol) functions as a known constant; that is, the probability distribution says
how likely certain scores are given an assumed value for π.
After collecting data, the function is “reversed” by treating scores as known and
varying the parameter π. The resulting likelihood function describes the relative fre-
quency of different parameter values given the observed data. For example, suppose that
we collect data from a single student who is eligible for free or reduced-­price lunch (i.e.,
Y = 1). Reversing the role of the data and the parameter in the function gives the follow-
ing likelihood expression:

Li ( π | Yi = 1) = π1 (1 − π ) = π
0
(2.2)
Maximum Likelihood Estimation 49

1.0
0.8
0.6
Proportion
0.4
0.2
0.0

0 = No 1 = Yes
Free or Reduced-Price Lunch

FIGURE 2.1. The probability distribution for a binary variable that measures whether a stu-
dent is eligible for free or reduced-priced lunch. The bar graph corresponds to a distribution
where the true proportion π = .45.

The left side of the equation now says that the likelihood of a particular parameter value
depends on the observed data. Consistent with the previous function notation, “L of
something” references the height of the distribution at a particular value on the horizon-
tal axis, but the abscissa now reflects all possible values of π between 0 and 1.
To illustrate the effect of reversing the function’s arguments, Figure 2.2 graphs the
likelihood in Equation 2.2 across the entire range of π. The height of the graph—the
likelihood of the parameter given the observed data— quantifies the data’s support
for every possible value of π. Two points are worth highlighting. First, the probabil-
ity distribution of the data is discrete, but the likelihood function is a continuous
distribution. Second, notice that the function defines a triangle with an area equal to
0.50. Thus, by the previous definition, the likelihood is not a probability distribution,
because the area under the function does not equal 1. This distinction is important,
because it is incorrect to say that Li(π|Yi) describes the probability of the parameter
given the data—that interpretation is reserved for a Bayesian analysis. Rather, you
should view likelihood as a function that describes the data’s evidence or support for
different parameter values. As you will see later in the chapter, the likelihood function
provides the mathematical machinery for identifying parameter values that maximize
fit to the observed data.
50 Applied Missing Data Analysis

1.2
1.0
0.8
Likelihood
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


Population Proportion

FIGURE 2.2. The likelihood function describing the relative frequency of different parameter
values given a single observation where Y = 1. The height of the graph—the likelihood of the
parameter given the observed data—­quantifies the data’s support for every possible value of π.

2.3 THE UNIVARIATE NORMAL DISTRIBUTION

The applications of maximum likelihood estimation in this chapter primarily lever-


age the normal distribution. The normal curve is a reasonable approximation for many
continuous variables that we encounter in the behavioral and social sciences, and it
also appears prominently later in the book as a latent response distribution for cat-
egorical variables (Albert & Chib, 1993; Johnson & Albert, 1999). A univariate analysis
example is a useful starting point, because the basic estimation principles from this
simple context readily generalize to more complicated analyses. Continuing with the
math achievement data, I use the math posttest scores to illustrate how to estimate the
mean and variance with maximum likelihood. As you will see, the mechanics of this
simple example readily extend to more complex analyses.
To begin, the probability distribution for a normally distributed variable is

 1 ( Y − μ )2 
(
f=
Yi | μ, σ 2
) 1
2
exp  −
 2
i

σ2


(2.3)
2πσ  
where Yi is the outcome score for participant i (e.g., a student’s math posttest score), and
μ and σ2 are the population mean and variance. To reiterate some important notation,
Maximum Likelihood Estimation 51

the function on the left side of the equation can be read as “the relative probability of
a score given assumed values for the parameters.” Visually, “f of Y” is the height of the
normal curve at a particular score value on the horizontal axis. Dissecting the right side
of the expression, the kernel inside the exponential function defines the curve’s shape.
Notice that the main component is a squared z-score that quantifies the standardized
distance between a score and the mean. Finally, the fraction to the left of the exponential
function is a scaling term that ensures that the area under the curve sums or integrates
to 1. This scaling term makes the function a probability distribution.
From the previous section, you know that a probability distribution treats scores as
variable and parameters as known constants. To illustrate, assume that the true popula-
tion parameters are μ = 56.79 and σ2 = 87.72 (these happen to be the maximum likeli-
hood estimates). Next, consider two math scores, Y1 = 53 and Y2 = 45. Substituting these
scores and the parameter values into Equation 2.3 gives f(Y = 53|μ, σ2) = 0.039 and
f(Y = 45|μ, σ2) = 0.019. As seen in Figure 2.3, “f of something” refers to the height of the
normal curve at a particular score value on the horizontal axis. Although these verti-
cal coordinates look like probabilities, they are not—the probability of any one score is
effectively 0, because the horizontal axis can be sliced into a countless number of infini-
tesimally small intervals. Rather, the height coordinates represent relative probabilities.
For example, it is incorrect to say that 3.9% of all students from this population have a
0.05
0.04
Relative Probability
0.03
0.02
0.01
0.00

20 40 60 80 100
Posttest Math Scores

FIGURE 2.3. A normal distribution with parameters μ = 56.79 and σ2 = 87.72. The black dots
are the relative probabilities for two math scores: f(Y = 53|μ, σ2) = 0.039 and f(Y = 45|μ, σ2) =
0.019.
52 Applied Missing Data Analysis

test score of 53, but you can say that a score of 53 is about twice as likely as a score of 45,
because its vertical elevation is twice as high.

The Likelihood and Log‑Likelihood Functions


The goal of maximum likelihood estimation is to identify the population parameter
values most likely to have produced a particular sample of data. After collecting data,
the function is “reversed” by treating scores as known and varying the parameters. The
likelihood expression for a single observation is

 1 ( Y − μ )2 
(
Li= 2
μ, σ | Yi ) 1
2
exp  −
 2
i

σ2


(2.4)
2πσ  
where Li represents one observation’s support for a particular combination of the mean
and variance. The likelihood expression might seem like a notational sleight of hand
since the right side of the expression is identical to Equation 2.3. However, the notation
on the left side of the equation signals an important shift: The probability distribu-
tion views scores as hypothetical and parameters as known, whereas likelihood views
parameters as hypothetical and scores as known. Applied to the math achievement data,
Equation 2.4 quantifies the degree to which one observation from this sample supports
different values of μ and σ2.
Identifying the maximum likelihood estimates requires a summary measure that
quantifies the entire sample’s evidence about the unknown parameter values. From
probability theory, the product of individual probabilities describes the joint occurrence
of a set of independent events. For example, the probability of flipping a fair coin twice
and observing two heads in a row is .50 × .50 = .25. Applying this rule to the individual
likelihood expressions gives the sample likelihood function.
N
( = ) ∏ L (μ, σ
L μ, σ2 | data i
2
| Yi ) (2.5)
i =1

Extending previous ideas, the likelihood quantifies a particular sample’s support for dif-
ferent values of μ and σ2. Visually, the likelihood function describes a three-­dimensional
surface with the population mean and variance on the horizontal and depth axes and
L as the height of the surface at each unique combination of the two parameters. It
is important to reiterate that the likelihood function is not a probability distribution,
because the area under the surface does not equal 1.
Applying Equation 2.5 to the math data involves multiplying 250 very small num-
bers, each of which requires many decimals to achieve good precision. As you can imag-
ine, the resulting product is infinitesimally small. Taking the natural logarithm of the
relative probabilities provides a more tractable metric. This transformation maps prob-
abilities onto the negative side of the number line, with higher probabilities taking on
“less negative” values than lower probabilities. To illustrate, reconsider the pair of math
scores and the parameters from the previous example: Y1 = 53, Y2 = 45, μ = 56.79, and σ2 =
Maximum Likelihood Estimation 53

87.72. Transforming the relative probabilities to the logarithmic scale gives ln(0.039) =
–3.24 and ln(0.019) = –3.96. Figure 2.4 shows that –3.24 and –3.96 also represent height
coordinates, but the log transformation has changed the normal curve to a parabola.
Nevertheless, the conclusion is the same: A score of 53 is more likely than a score of 45.
Working with logarithms changes the structure of the likelihood, because the loga-
rithm product rule says to add rather than multiply the transformed likelihood values
(i.e., ln(A × B) = ln(A) + ln(B)). Applying the product rule gives the log-­likelihood func-
tion below.

 1  1 ( Y − μ )2 
) ∑ ( ( )) ∑
N N
2
LL μ, σ= | data ( ln Li μ, σ
= 2
| Yi ln 
 2
exp  −
 2
i

σ2


=i 1 =i 1  2πσ   (2.6)
N
N N
( ) ( ) ∑ (Y − μ )
1 −1
ln ( 2π ) − ln σ2 − σ2
2
=− i
2 2 2 i =1

Visually, the log-­likelihood function defines a three-­dimensional surface, the height of


which represents the data’s support for a unique combination of the parameters. Figure
2.5 shows the likelihood surface for a range of different parameter combinations. To get
a better look at the surface, Figure 2.6 is a contour plot that conveys the perspective of a
–2
Natural Log of Relative Probability
–4
–6
–8
–10

20 40 60 80 100
Posttest Math Scores

FIGURE 2.4. Natural logarithm of a normal distribution with parameters μ = 56.79 and σ2 =
87.72. The black dots represent the natural log of two relative probabilities: ln(.039) = –3.24 and
ln(.019) = –3.96.
54 Applied Missing Data Analysis

drone hovering over the peak of the log-­likelihood surface, with smaller contours denot-
ing higher elevation (and vice versa). The data’s support for the parameters increases as
the contours get smaller, and the maximum likelihood estimates are located at the peak
of surface, shown as a black dot. The goal of estimation is to identify the parameter val-
ues at that coordinate.
As you might have surmised, the log-­likelihood value will always be a large nega-
tive number, because it sums individual fit values that are themselves usually negative
numbers. For example, the peak of the function in the previous figures has a vertical
elevation of LL = –913.999, and the log-­likelihood values decrease (i.e., become more
negative) as μ and σ2 move away from their optimal values for the data. Several factors
influence the log-­likelihood value (e.g., the sample size, the number of variables, the
amount of missing data), and there is no cutoff that determines good or bad fit to the
data. However, we can use the log-­likelihood to make relative judgments about different
candidate parameter values. These relative fit assessments are an integral part of estima-
tion and hypothesis testing.

–900

–920
Log-Lik

–940
elihood

–960

–980

400
–1000
300
e
anc
ari

45 200
nV

50
tio

Pop 55
ula

ula 100
tion
Pop

Me 60
an
65
0

FIGURE 2.5. Bivariate log-­likelihood surface for different values of μ and σ2. The height of the
surface represents the data’s support for different combinations of the mean and the variance.
Note that the floor of the function is located well below the minimum value on the vertical axis.
Maximum Likelihood Estimation 55

400
300
Population Variance
200
100
0

45 50 55 60 65 70
Population Mean

FIGURE 2.6. Contours of the log-­likelihood surface at different values of μ and σ2. The plot
conveys the perspective of a drone hovering over the peak of the log-­likelihood surface, with
smaller contours denoting higher elevation (and vice versa). The height of the surface represents
the data’s support for different combinations of parameter values, and the maximum likelihood
estimates are located at the peak of surface (shown as a black dot).

2.4 ESTIMATING UNKNOWN PARAMETERS

The key take-home message thus far is that “reversing” a probability distribution by
treating the observed data as known constants defines a log-­likelihood function that
measures the data’s support for different candidate parameter values. The goal of esti-
mation is to identify the parameter values that maximize the log-­likelihood function,
as these are the values that garner the most support from the data. Visually, this corre-
sponds to finding the peak of the three-­dimensional surface in Figures 2.5 and 2.6. The
resulting estimates are optimal in the sense that they minimize the sum of the squared
z-scores in the normal distribution function. There are three main ways to find the max-
imum likelihood estimates: (a) a grid search that computes the log-­likelihood value for
each unique combination of the parameter values, (b) an analytic solution that provides
an equation for solving the estimates, and (c) an iterative optimization algorithm. The
first approach is usually too unwieldly and inefficient for practical applications, but it is
a good starting point for this simple example, because it illustrates important concepts.
I describe analytic solutions and optimization algorithms later in the chapter.
56 Applied Missing Data Analysis

To illustrate the mechanics of a grid search, Table 2.1 shows individual and sample
log-­likelihood values at five different estimates of the population mean (to keep the
illustration simple, I held the variance constant at its maximum likelihood estimate). As
you might expect, an individual’s contribution to the log-­likelihood differs across the
five estimates, because a given score offers more support for some parameter values than
others (i.e., the standardized distances from the scores to the center of the normal curve
change with different values of μ). The summary log-­likelihood values in the bottom
row of Table 2.1 similarly fluctuate as a function of the population mean. As explained
previously, the log-­likelihood summarizes the data’s support for a particular combina-
tion of parameter values, such that higher (i.e., less negative) values reflect better fit to
the data. If the five means in the table were our only options, we would choose μ̂ = 57
as the maximum likelihood estimate, because this parameter value maximizes fit to the
sample data (i.e., minimizes the sum of the squared z-scores).
Next, I conducted a comprehensive grid search that varied the population mean
in tiny increments of 0.01 and plotted the resulting log-­likelihood values in Figure 2.7.
As you can see, the function resembles a hill or a parabola, with the optimal parameter
value located at its peak. This brute-force estimation process revealed that the curve’s
highest elevation, LL = –913.999, is located at μ = 56.79, and no other value of the mean
has more support from the data. As such, μ̂ = 56.79 is the maximum likelihood estimate
of the mean, or the population parameter with the highest probability of producing this
sample of 250 math scores. I applied the same grid search procedure to the variance after
fixing the mean at its maximum likelihood estimate. Figure 2.8 shows the resulting log-­
likelihood function. Although the function looks very different—­the right skew owes

TABLE 2.1. Individual and Sample Log‑Likelihoods at Five Values of μ


Y μ = 53 μ = 55 μ = 57 μ = 59 μ = 61
63 –3.72601 –3.52081 –3.36120 –3.24720 –3.17880
53 –3.15599 –3.17880 –3.24720 –3.36120 –3.52081
71 –5.00285 –4.61524 –4.27323 –3.97682 –3.72601
53 –3.15599 –3.17880 –3.24720 –3.36120 –3.52081
57 –3.24720 –3.17880 –3.15599 –3.17880 –3.24720
55 –3.17880 –3.15599 –3.17880 –3.24720 –3.36120
59 –3.36120 –3.24720 –3.17880 –3.15599 –3.17880
... ... ... ... ... ...
54 –3.16170 –3.16170 –3.20730 –3.29850 –3.43530
71 –5.00285 –4.61524 –4.27323 –3.97682 –3.72601
49 –3.24720 –3.36120 –3.52081 –3.72601 –3.97682
54 –3.16170 –3.16170 –3.20730 –3.29850 –3.43530
61 –3.52081 –3.36120 –3.24720 –3.17880 –3.15599
51 –3.17880 –3.24720 –3.36120 –3.52081 –3.72601
38 –4.43853 –4.80334 –5.21375 –5.66977 –6.17138

Sums –934.4897 –918.5749 –914.0604 –920.9462 –939.2323


Maximum Likelihood Estimation 57

–900
–950
–1000
Log-Likelihood
–1050
–1100
–1150
–1200

40 45 50 55 60 65 70 75
Population Mean

FIGURE 2.7. Likelihood function with respect to the mean, holding the variance constant
at its sample estimate. The log-likelihood on the vertical axis represents the data’s support for a
particular parameter value. The peak of the function is the maximum likelihood estimate of the
mean.
–910
–915
–920
Log-Likelihood
–925
–930
–935
–940

50 70 90 110 130 150 170


Population Variance

FIGURE 2.8. Likelihood function with respect to the variance, holding the mean constant at
its maximum likelihood estimate. The log-likelihood on the vertical axis represents the data’s
support for a particular parameter value. The maximum likelihood estimate of the variance is
located at the peak of the function.
58 Applied Missing Data Analysis

to the fact that the variance is bounded at 0 on the low end—the graph nevertheless
displays the data’s support for different parameter values. The brute-force grid search
revealed that σ̂2 = 87.72 is the maximum likelihood estimate of the variance.

2.5 GETTING AN ANALYTIC SOLUTION

You can imagine that a grid search quickly becomes impractical as the number of model
parameters increases. A second approach is to derive an equation that gives an ana-
lytic solution for the maximum likelihood estimates. Although this strategy has limited
applications, the mechanics of getting the solution—­in particular, leveraging calculus
derivatives—­sets the stage for the iterative optimization algorithms that I discuss later
in the chapter.
To begin, a first derivative is a slope coefficient. Returning to Figure 2.7, the log-­
likelihood function is a parabolic curve. Imagine using a magnifying glass to zoom in
on the log-­likelihood function within a very narrow slice along the horizontal axis.
Although the entire function has substantial curvature, magnifying the log-­likelihood
at a particular point on the curve would reveal a straight line. Thus, you can think of
the curved function in Figure 2.7 as stringing together a sequence of very tiny straight
lines, the direction and magnitude of which vary as you move from left to right on the
horizontal axis. These linear slopes are the first derivatives of the function. To infuse a
bit more precision, the first derivative is the slope of a line that is tangent to the function
at a particular value on the horizontal axis. To illustrate, Figure 2.9 shows the deriva-
tives at five values of μ. I refer to these slopes as the first derivatives of the log-­likelihood
function with respect to the mean, because the variance (the other unknown quantity in
the function) is held constant. First derivatives are central to finding an equation for
the maximum likelihood estimates, and they also appear prominently in the iterative
optimization algorithms I discuss later in the chapter.
Moving from left to right across Figure 2.9, the derivatives decrease in magnitude
(i.e., the slopes flatten) as elevation rises, and the slope is exactly 0 at the function’s
peak. The fact that the first derivative is 0 at the point on the function directly above the
maximum likelihood estimate suggests that we can set the derivative expression to 0
and solve for the unknown parameter. First, we need the derivative equations. I give the
expressions below, and introductory calculus resources catalog the differential calculus
rules for getting the first derivatives of a function. To begin, the first derivative of the
log-­likelihood function with respect to the mean (i.e., the linear slopes in Figure 2.9) is
as follows:
N
∂LL
∂μ
= (σ ) ∑ (Y − μ)
2 −1
i (2.7)
i =1

In words, the left side of the expression reads “the first derivative of the log-­likelihood
function with respect to the mean,” where ∂ is a common symbol for a derivative, and
the fraction denotes the differential operator. Setting the right side of the equation equal
to 0 and solving for μ gives the maximum likelihood estimate of the mean.
Maximum Likelihood Estimation 59

–900
–950
–1000
Log-Likelihood
–1050
–1100
–1150
–1200

40 45 50 55 60 65 70 75
Population Mean

FIGURE 2.9. Likelihood function with respect to the mean, holding the variance constant at
its maximum likelihood estimate. The dashed lines represent first derivatives, or slopes of lines
tangent to the function at each black dot.

N
1
=μˆ = ∑
N i =1
Yi Y (2.8)

Notice that μ̂ is identical to the familiar formula for the arithmetic mean. Consistent
with the previous grid search, applying the expression to the math posttest scores gives
a maximum likelihood estimate of μ̂ = 56.79.
Differentiating the log-likelihood function with respect to the variance gives the
slopes of tangent lines at different points on the function in Figure 2.8.
N
∂LL
( )
N 2 1 2
( ) ∑ (Y − μ)
−1 −2 2
2
=− σ + σ i (2.9)
∂σ 2 2 i =1

Again, setting the right side of the equal to 0 and solving for σ2 gives the maximum
likelihood estimate of the variance.
N
1
∑ (Y − μ)
2
=σˆ 2 i (2.10)
N i =1

Notice that the maximum likelihood solution has N rather than N – 1 in the denomi-
nator. We know that applying the equation for the population variance to sample data
60 Applied Missing Data Analysis

gives negatively biased estimates that underrepresent population-­level variation, and


the same is true for maximum likelihood variance estimates. This bias is an issue in
small samples but quickly becomes negligible as N increases. Such is the case with the
math achievement data, where the maximum likelihood and unbiased estimates are
very similar: σ̂2 = 87.72 and s2 = 88.07.

2.6 ESTIMATING STANDARD ERRORS

The log-­likelihood function provides a mechanism for estimating standard errors, and
this, too, relies on calculus derivatives. The process lends itself well to graphical dis-
plays, so I interleave a conceptual description with the technical details. To set the stage,
Figure 2.10 shows the log-­likelihood functions for two data sets with the same mean but
different variance. The solid curve, which is identical to Figure 2.7, corresponds to the
math posttest data, and the flatter dashed function comes from a data set with 50% more
variance (i.e., σ̂2 = 131.58 vs. 87.72).
The curvature of the log-­likelihood function (i.e., its steepness or flatness) deter-
mines the precision of the maximum likelihood estimate at its peak. To understand why
this is the case, recall that the log-­likelihood quantifies the data’s evidence for different
candidate parameter values. Looking at the solid curve in Figure 2.10, you can see that the
–900
–950
–1000
Log-Likelihood
–1050
–1100
–1150
–1200

40 45 50 55 60 65 70 75
Population Mean

FIGURE 2.10. Log-­likelihood functions for two data sets with the same mean but different
variance. The solid curve, which is identical to Figure 2.7, corresponds to the math posttest data,
and the flatter dashed function comes from a data set with 50% more variance.
Maximum Likelihood Estimation 61

data’s support for competing parameter values decreases rapidly as μ moves away from its
optimal value in either direction. In contrast, the dashed curve is much flatter, meaning
that the data provide similar support for a range of parameter values near the peak. As
such, the steeper function reflects a more precise estimate with a smaller standard error.
This makes intuitive sense if you think about estimation as a hiker trying to climb to the
highest possible elevation on a mountain. A climber standing at the top of a steep peak
would be very certain about reaching the exact summit, because elevation drops quickly
in every direction, whereas a climber standing on a flatter plateau would be less confident
about the summit’s precise location. To apply this idea to data, we need to figure out how
to quantify curvature of the log-likelihood and translate that into a standard error.

Second Derivatives
Measuring curvature and computing standard errors requires the second derivatives of
the log-likelihood function. These second derivatives, which are also slope coefficients,
have an intuitive visual interpretation. To illustrate, Figure 2.11 displays the first deriva-
tives of the two log-likelihood functions from Figure 2.10. Moving from left to right,
the linear slopes along the steep curve vary substantially, changing from large positive
values on the left to large negative values on the right. Conversely, the slopes along the
-900
-950
-1000
Log-Likelihood
-1050
-1100
-1150
-1200

40 45 50 55 60 65 70 75
Population Mean

FIGURE 2.11. Log-likelihood functions for two data sets with the same mean but different
variance. The straight lines represent first derivatives. The steep function has rapidly changing
first derivatives and thus a large second derivative, whereas the flatter function has a smaller
second derivative, because its slopes don’t change as much near the peak.
62 Applied Missing Data Analysis

flatter curve exhibit less variability, ranging from moderately positive to moderately
negative. Mathematically, a second derivative captures the rate at which the first deriva-
tive slopes change across the log-­likelihood function. For example, the steep function
in Figure 2.11 has rapidly changing first derivatives and thus a large second derivative.
Conversely, the flatter function has a smaller second derivative, because its slopes don’t
change as much near the function’s peak.
Second derivatives can be confusing, because they are metaquantities that capture
the rate of change in the linear slopes; that is, they are equations that give the slope of the
slopes. A regression analogy is useful for sorting this out. Returning to Figure 2.9, you can
think of the curve as a nonlinear regression line that predicts the log-­likelihood at different
values of the parameter (i.e., the parameter is the predictor variable, and the log-­likelihood
is the outcome). The linear term from this regression, which is the first derivative, tells us
how much the log-­likelihood changes for an infinitesimally small increase in the parame-
ter. The second derivative is also a slope from a regression, but that regression now predicts
the first derivatives at different values of the parameter (i.e., the parameter is the predictor
variable, and the first derivative is the outcome). Because the linear slopes change at a
constant rate across the parabolic function, the second derivative reflects the change in the
slope for each one-unit increase in the population mean. The regression analogy highlights
that first and second derivatives are just the same concept applied to different variables.
To illustrate second derivatives more concretely, reconsider the first derivative slope
expression from Equation 2.7. We know that substituting μ̂ = 56.79 into the formula (i.e.,
evaluating the function at the maximum likelihood estimate) returns a slope coefficient
of 0. Next, we can use the expression to compute the first derivative after increasing
or decreasing the mean by 1 point. Starting with the steep curve in Figure 2.11, sub-
stituting μ = 55.79 and 57.79 into the equation gives first derivatives equal to +2.85 and
–2.85, respectively. Thus, we can verify that a one-unit increase in the population mean
changes the first derivative (i.e., the slope of the log-­likelihood at a particular point) by
–2.85. This value is the second derivative! Moving to the flatter function, substituting
the same two estimates into the equation gives first derivatives equal to +1.90 and –1.90,
respectively. A one-unit increase in the population mean now induces smaller changes
in the linear slopes, because the log-­likelihood function is less peaked. As you can see,
larger second derivatives (in absolute value) reflect greater curvature and more preci-
sion, whereas smaller second derivatives imply less curvature.
I previously explained that second derivatives are the same concept as a first deriva-
tive but applied to a different dependent variable (a function of the original function).
As such, getting the second derivatives involves applying differential calculus rules to
the slope equations from Equations 2.7 and 2.9. To begin, the second derivative of the
log-­likelihood function with respect to the mean (i.e., the curvature of the function in
Figure 2.7) is as follows:

∂ 2 LL N
2
= − 2 (2.11)
∂μ σ

Substituting σ̂2 = 87.72 (the maximum likelihood estimate) and N = 250 into the expres-
sion verifies the earlier conclusion that the second derivative equals –2.85. The second
Maximum Likelihood Estimation 63

derivative of the log-­likelihood function with respect to the variance (i.e., the curvature
of the function in Figure 2.8) is as follows:
N
∂ 2 LL N 2
( ) ( ) ∑ (Y − μ)
−2 −3 2
= σ − σ2 (2.12)
( )
2 i
∂σ2 2 i =1

Substituting σ̂2 = 87.72 and N = 250 in the expression gives a second derivative equal to
–.016. Because the log-­likelihood function in Figure 2.8 has multiple bends, the rate of
change in the linear slopes is no longer constant going from left to right. Thus, we need
to view the second derivative as curvature at the function’s peak. Again, you can think
of this number (in absolute value) as the estimate’s precision.
You probably noticed that the values of the second derivative were both negative.
In fact, this is not a coincidence, as the sign of the second derivative signals whether a
solution corresponds to the maximum or the minimum of a function. To understand
why this is the case, imagine a U-­shaped log-­likelihood function that is a mirror image
of the parabola in Figure 2.7. When applied to a U-­shaped function, the first derivative
takes on a value of 0 at the lowest point on the curve (i.e., the bottom of a valley instead
of the peak of a hill). The sign of the second derivative differentiates the minimum and
maximum of a function and thus tells us whether an estimate is located at the bottom
of a trough or the peak of a hill. To illustrate, imagine traversing a U-­shaped function
moving from left to right. Contrary to the derivatives displayed in Figure 2.9, the linear
slopes from an inverted function change from large negative values to large positive
values; that is, a one-unit increase to the parameter increases rather than decreases the
slopes, thus giving a positive second derivative. Consequently, the fact that the second
derivatives were negative is important, because it signals that the estimates are, in fact,
located at the peak of the surface.

From Second Derivatives to Standard Errors


With second derivatives in hand, we can now compute standard errors. This process
involves three steps: (1) Multiply each derivative by –1, (2) compute its reciprocal, and
(3) take the square root. To begin, multiplying the second derivative by –1 gives a quan-
tity known as information or Fisher information (after statistician Ronald Fisher). This
step rescales the derivative so that large positive values reflect greater precision or confi-
dence in the estimate. Second, computing the reciprocal or inverse of information gives
the sampling variance, or the expected squared difference between the estimate and the
true population parameter. Applying these first two steps to the mean and variance gives
the following expressions for their sampling variances:
−1
 N  σˆ 2
var ( μˆ ) =−  − 2  = (2.13)
 σˆ  N

−1 −1
 N 2 −2 N
2 3
N
2
( ) ( ) ( ) ∑ ( )
−3
  var σˆ 2 = − 
=
2

σˆ − σˆ 2 (
i 1=



 i 1

Yi − μˆ )  = 2 σˆ 2  −N σˆ 2 + 2 ( Yi − μˆ ) 


(2.14)
64 Applied Missing Data Analysis

Finally, taking the square root of the sampling variance gives the standard error. Notice
that the square root of Equation 2.13 is the familiar formula for the standard error of
the mean, σˆ ÷ N .
To illustrate standard error computations, reconsider the two log-­likelihood func-
tions in Figure 2.11. The steeper curve corresponds to the math achievement data from
the companion website, which has a variance σ̂2 = 87.72. Substituting this estimate into
Equation 2.13 gives a sampling variance equal to var(μ̂) = 0.35 and a standard error
equal to SEμˆ = 0.59. Consistent with the usual interpretation of a standard error, 0.59 is
the expected difference between the maximum likelihood estimate and the true popula-
tion mean, or the standard deviation of estimates from many random samples of size
250. As a comparison, the dashed curve corresponds to a transformed data set with 50%
more variance. Substituting σ̂2 = 131.58 into Equation 2.13 returns a sampling variance
and standard error equal to var(μ̂) = 0.53 and SEμˆ = 0.73, respectively. These results rein-
force the previous conclusion that steeper functions with more curvature reflect greater
precision and smaller standard errors.

2.7 INFORMATION MATRIX AND PARAMETER


COVARIANCE MATRIX

The log-­likelihood function in Equation 2.6 varies as a function of two unknowns.


Although the univariate analysis allows us to consider each parameter separately, with-
out regard to the other, changes to one parameter generally correlate with changes to
another. Returning to the three-­dimensional surface in Figure 2.5, the presence of such
a correlation implies that curvature or elevation changes along one axis systematically
track with elevation changes along the other. Although the mean and variance happen
to be uncorrelated in this example, we need to establish a more generalizable recipe for
computing standard errors that accounts for potential linkages among the parameters.
Second derivatives are obtained by applying differential calculus rules to the first
derivative expressions (e.g., differentiating Equation 2.7 with respect to μ gives Equa-
tion 2.11). To get the association between two parameters, you differentiate the first
derivative expression for one parameter with respect to a different parameter. For exam-
ple, to get the covariance between μ and σ2, you differentiate the slope expression from
Equation 2.7 with respect to σ2 (or equivalently, differentiate the slope expression from
Equation 2.9 with respect to μ). The cross-­product derivative expression for this exam-
ple is as follows:
N
∂ 2 LL ∂ 2 LL
( ) ∑ (Y − μ)
−2
2
= 2 = − σ2 i (2.15)
∂μ∂σ ∂σ ∂μ i =1

The left side of the equation reads “first differentiate the log-­likelihood with respect to
the mean, then differentiate the resulting expression with respect to the variance” (or
vice versa).
Next, the second derivatives and the cross-­product terms are stored in a symmetric
matrix known as the Hessian.
Maximum Likelihood Estimation 65

 N 
N
( ) ∑
−2
 − 2 − σ2 ( Yi − μ ) 
 σ 
HO ( θ ) = 
i =1
(2.16)
N N 
( ) ∑ N 2
( ) ( ) ∑
−2 −2 − 3
 − σ2 ( Yi − μ ) σ − σ 2 2
( Yi − μ ) 
 2
=  i 1 =i 1 

Notice that the diagonal elements contain the second derivatives from Equations 2.11
and 2.12, and the new addition from Equation 2.15 appears in the off-­diagonal elements.
The subscript on HO indicates that the derivative equations depend on the observed
data (an alternate approach described below replaces data values with the expectations
or averages), and θ denotes the parameter values. Substituting the maximum likelihood
estimates into the expressions gives HO(θ̂).
Computing standard errors involves the same three steps as before. First, multiply-
ing the matrix of second derivatives by –1 gives the observed information matrix.

()
I O θˆ = −H O θˆ () (2.17)

As before, this step rescales the derivatives so that large positive values reflect greater
precision or confidence in the estimates. Second, taking the inverse of the information
matrix (the matrix analogue of a reciprocal) gives the variance–­covariance matrix of
the parameter estimates.

()
Sˆ θˆ = I O−1 θˆ (2.18)

The parameter covariance matrix for the univariate analysis has sampling variances on
the diagonal and the covariance between the two estimates in the off-­diagonal elements.

= ˆ
S θˆ
 var ( μˆ )
 =
( 
)
cov μˆ , σˆ 2   0.35 0 
(2.19)
 
 ( )
 cov σˆ 2 , μˆ var σˆ ( )
2 
  0

61.55 

You can see that the covariance between the mean and variance is 0, because the devia-
tion scores in the Hessian’s off-­diagonal sum to 0. The independence of the mean and
variance (or more generally, a model’s mean parameters and its variance–­covariance
parameters) is a well-known feature of maximum likelihood estimation. As you will
see in the next chapter, this independence doesn’t necessarily hold with missing data
(Kenward & Molenberghs, 1998; Savalei, 2010). Finally, taking the square root of the
sampling variances on the diagonal of the variance–­covariance matrix gives the stan-
dard errors (e.g., SEμˆ = 0.35 = .59 and SEσˆ 2 = 61.55 = 7.85).

Standard Errors Based on Expected Information


The observed information matrix is so named, because individual elements of the Hes-
sian matrix include deviation scores that rely on observed data values. Although this
is usually the preferable way to compute standard errors, an alternative method based
on the expected information matrix warrants brief discussion. With complete data, the
66 Applied Missing Data Analysis

observed and expected information are often equivalent and produce identical stan-
dard errors. However, the two approaches are not always the same with missing data
(­Kenward & Molenberghs, 1998; Savalei, 2010).
Revisiting the Hessian matrix in Equation 2.16, the second derivatives reflect sum-
mations across the N scores. To see how expected information works, it is useful to look
at a single observation’s contribution to these sums.

 −1 
N  −σ2 ( Yi − μ ) 
σ2
HO ( θ ) =∑   (2.20)
i =1  −σ 2 Y − μ
1 2
( ) ( ) 2
−2 2 −3
 ( i ) σ − σ ( Yi − μ ) 
 2 

The expected information matrix invokes a computational shortcut that replaces (Yi – μ)
and (Yi – μ)2 with their expectations or long-run averages.

E ( Yi − μ ) = 0 (2.21)

E ( Yi − μ ) = σ2
2

Substituting the expectations simplifies the Hessian as follows:

 1 
N  − σ2  0
HE ( θ) = ∑   (2.22)
i =1  0


− σ
2
( )
1 2 −2 

Substituting the maximum likelihood estimates into the Hessian and multiplying the
matrix by –1 gives the expected information matrix.

 N 
 − σˆ 2 0

IE ()
ˆθ = ˆ −
−H E θ = ()  (2.23)
 0



2
( )
N 2 −2 
σˆ 

Finally, taking the inverse of the information matrix gives the variance–­covariance
matrix of the estimates, the diagonal of which contains squared standard errors.
As you can see, the expected information is simpler to compute, because it does
not rely on the raw data. With complete data, standard errors based on the observed
and expected information are often indistinguishable, as they are in this example. This
equality doesn’t necessarily hold with missing data, as the expectations in Equation 2.21
require an MCAR process where missingness is unrelated to the data. In contrast, stan-
dard errors based on the observed information assume a less stringent MAR mechanism
where missingness depends on the observed data. Simulation results favor standard
errors based on observed information (Kenward & Molenberghs, 1998; Savalei, 2010),
so I strictly rely on this approach.
Maximum Likelihood Estimation 67

2.8 ALTERNATIVE APPROACHES TO ESTIMATING


STANDARD ERRORS

The normal curve plays an integral role in every phase of maximum likelihood estima-
tion, as its log-­likelihood function provides a basis for identifying the optimal estimates
for the data and computing standard errors. Of course, non-­normal data are exceedingly
common, and some authors argue that normality is the exception rather than the rule
(Micceri, 1989). Depending on the analysis model, maximum likelihood estimates may
still be consistent when normality is violated, meaning that they converge to their true
population values as the sample size increases (Yuan, 2009b; Yuan & Bentler, 2010).
However, standard errors and significance tests are almost certainly compromised.
This section describes two alternate (and very different) strategies for estimating
sampling variation when normality is violated: so-­called “robust” or sandwich estimator
standard errors (Freedman, 2006; Greene, 2017; White, 1980) and bootstrap resampling
(Efron, 1987; Efron & Gong, 1983; Efron & Tibshirani, 1993). These methods have a
long history in the literature and a substantial body of literature that generally supports
their use (Arminger & Sobel, 1990; Enders, 2001; Finch, West, & MacKinnon, 1997;
Gold & Bentler, 2000; Hancock & Liu, 2012; Rhemtulla, Brosseau-­Liard, & Savalei,
2012; Savalei & Falk, 2014; Yuan, 2009b; Yuan & Bentler, 2000, 2010; Yuan, Bentler,
& Zhang, 2005; Yuan, Yang-­Wallentin, & Bentler, 2012). I discuss analogous corrective
procedures for significance tests later in the chapter.

Robust Standard Errors


The previous standard error formulation assumes that the model—­ including the
assumed population distribution—­is correctly specified. We can and often do apply
maximum likelihood to non-­normal or heteroscedastic data, in which case the estima-
tion procedure is known as quasi-­maximum likelihood or pseudo maximum likeli-
hood estimation (Gourieroux, Monfort, & Trognon, 1984; Greene, 2017; White, 1996).
Depending on the analysis model, pseudo maximum likelihood estimation may still
provide consistent estimates that converge to the true population values as the sample
size gets larger (Yuan, 2009b; Yuan & Bentler, 2010), but the usual expressions for stan-
dard errors are invalid. Alternative standard error expressions for misspecified models
are widely referred to as robust standard errors or sandwich estimator standard errors.
Robust or sandwich estimator standard errors are a family of procedures that
attempt to adjust for different types of model misspecification. For example, the stan-
dard errors I outline below are designed for distributional misspecifications but do not
address independence violations resulting from clustered data (e.g., repeated measure-
ments nested in persons, students nested within schools); different types of misspecifi-
cations require different corrective procedures. I give a brief description of robust stan-
dard errors for non-­normal data in this section, and several good tutorial papers are
available to readers who want additional details (Freedman, 2006; Hayes & Cai, 2007;
Savalei, 2014).
The term sandwich estimator stems from fact that the “robustified” parameter cova-
riance matrix has a three-part structure that resembles a sandwich. The normal-­theory
68 Applied Missing Data Analysis

covariance matrix from Equation 2.18 forms the outer pieces of “bread,” and the “meat”
in the middle of the sandwich is a new matrix that captures deviations between the data
and the assumed normal distribution. The sandwich estimator covariance matrix is

() ()
Sˆ θˆ = bread × meat × bread = I O−1 θˆ Sˆ S θˆ I O−1 θˆ
()
(2.24)

where IO(θ) is the information matrix from Equation 2.17, and the meat in the middle
term is a new covariance matrix based on first derivatives (described below).
Revisiting Equations 2.7 and 2.9, the first derivative or slope expressions reflect
summations across the N scores. To illustrate the composition of the meat term, we need
to look at a single observation’s contribution to these equations. Arranging the terms in
an array gives the so-­called score vector for a single observation.


( ) 
−1
 σ2 ( Yi − μ ) 
Si ( θ ) =   (2.25)
1
( ) ( )
1
−1 −2
+ σ2 ( Yi − μ ) 
2
 − σ2
 2 2 
The meat of the sandwich is the variance–­covariance matrix of these score vectors eval-
uated at the maximum likelihood estimates (i.e., the Sˆ S( θˆ ) term in Equation 2.24).
To understand how the formula works, you need know that IO(θ̂) and Sˆ S( θˆ ) both esti-
mate the information matrix, albeit in different ways. When the data are normal, the two
matrices are equivalent and effectively cancel out when multiplying one by the inverse
of the other (the resulting product is an inert identity matrix), leaving only the normal-­
theory covariance matrix from Equation 2.18. In contrast, when the data are non-­normal,
the product of the two matrices has diagonal elements that reflect the relative magnitude
of the two information matrices, and this array serves to rescale the parameter covari-
ance matrix in a way that compensates for kurtosis. Returning to the score vector in
Equation 2.25, notice that the first derivative expressions include deviation scores. When
the data are leptokurtic, the thicker tails produce a higher proportion of large deviation
scores than a normal curve, and multiplying the first piece of bread by the meat returns
a matrix containing large diagonal values that inflate the parameter covariance matrix
(the rightmost piece of bread). In contrast, when the data are platykurtic, the distribution
has fewer extreme scores than a normal curve, and the bread × meat product returns a
matrix with fractional values that attenuate the covariance matrix elements.
Recall from the earlier example that the normal-­theory standard errors for the
mean and variance were SEμˆ = 0.35 = 0.59 and SEσˆ 2 = 61.55 = 7.85, respectively. The
sandwich estimator covariance matrix for the same data is as follows:


S θˆ
 var ( μˆ )
 = 
(
cov μˆ , σˆ 2   0.35 0.17  ) (2.26)
 
 cov σˆ 2 , μˆ
 ( )
var σˆ 2   0.17 60.30 
 ( )
Taking the square root of the diagonal elements gives SEμˆ = 0.35 = .59 and SEσˆ 2 = 60.31
= 7.77. This example highlights two points. First, the standard error of the mean is the
same in both cases, because this parameter is unaffected by the robustification pro-
Maximum Likelihood Estimation 69

cess (White, 1982; Yuan et al., 2005). Second, the standard error of the variance barely
changes, because the data are essentially normal (as noted previously, the sandwich
estimator simplifies to the conventional covariance matrix in this case). More generally,
a divergence between the two covariance matrices would likely signal a model mis-
specification (e.g., the normal distribution is a poor approximation for the data; King &
Roberts, 2015; White, 1982).

Bootstrap Resampling
Bootstrap resampling (Efron, 1987; Efron & Gong, 1983; Efron & Tibshirani, 1993) is a
second approach to generating standard errors that are robust to normality violations.
The bootstrap uses Monte Carlo computer simulation to generate an empirical sampling
distribution of each parameter estimate, the standard deviation of which is the stan-
dard error. This section describes a so-­called “naive bootstrap” that generates standard
errors, and modifications to the basic procedure can also generate sampling distribu-
tions of test statistics (Beran & Srivastava, 1985; Bollen & Stine, 1992; Enders, 2002;
Hancock & Liu, 2012; Savalei & Yuan, 2009).
The basic idea behind the bootstrap is to treat the observed data as a surrogate
for the population and draw B samples of size N with replacement; that is, after being
selected for a bootstrap sample, each observation returns to the surrogate population
and is eligible to be chosen again. The sampling with replacement scheme ensures that
some data records appear more than once in each sample, whereas others do not appear
at all. To illustrate, Table 2.2 shows five bootstrap samples from a small toy data set with
10 observations. Drawing many bootstrap samples (e.g., B > 2,000) and fitting a model
to each data set gives an empirical sampling distribution of the estimates. The standard
deviation of B estimates is the bootstrap standard error
B

∑( )
1 2
=
SEθˆ θˆ=
b −θ SDθˆ (2.27)
B − 1 b =1

TABLE 2.2. Five Bootstrap Samples Drawn with Replacement


Y Sample 1 Sample 3 Sample 3 Sample 4 Sample 5
63 71 49 61 71 71
53 71 55 61 38 54
71 57 63 49 55 71
57 55 49 63 63 38
55 61 38 57 63 49
54 51 53 57 61 61
49 71 71 57 53 55
61 71 61 51 57 54
51 61 55 54 57 38
38 38 51 61 53 53
70 Applied Missing Data Analysis

where θ̂b is the maximum likelihood estimate from sample b, and θ is the average esti-
mate across the B samples. Finally, the 2.5 and 97.5% quantiles of the empirical distribu-
tion (i.e., the estimates that separate the most extreme 2.5% of the lower and upper tails
of the distribution) define a 95% confidence interval. Unlike their theoretical counter-
parts, bootstrap confidence intervals need not be symmetric around the average point
estimate.

2.9 ITERATIVE OPTIMIZATION ALGORITHMS

This chapter focuses primarily on analyses with analytic solutions for the maximum
likelihood estimates. Beyond the univariate example, this includes linear regression
models and multivariate analyses involving a mean vector and covariance matrix.
Many, if not most, applications of maximum likelihood do not have analytic solutions,
and even the tidy problems from this chapter become messy later with missing data.
I describe two such algorithms in this chapter, gradient ascent and Newton’s method,
and in Chapter 3, I describe the expectation maximization (EM) algorithm (Dempster,
Laird, & Rubin, 1977; Rubin, 1991).
Returning to the log-­likelihood function in Figure 2.7, an optimization algorithm
tasked with finding the maximum likelihood estimates is like a hiker trying to reach
the summit of a mountain. The hiker could start the trek at different trailheads, and that
starting point would dictate the direction of travel and rate of ascent. Similarly, optimi-
zation algorithms need initial guesses about the parameter values, and software defaults
could generate starting values on either side of hill. The first derivative is like a compass
in the sense that its sign tells the algorithm the direction it needs to travel to reach the
curve’s maximum elevation. For example, starting the climb at μ = 45 requires positive
adjustments to the parameter, whereas starting at μ = 65 requires negative adjustments.
The starting coordinates also dictate the size of the hiker’s steps. If the trek begins far
from the peak, the hiker can take big steps without worrying about missing the summit.
In contrast, the surface flattens near the top where very tiny steps are needed to find the
exact location of the peak. The size of each step links to the magnitude of the derivatives
in Figure 2.9, with larger slopes inducing bigger steps, and slopes closer to 0 requiring
very small steps.

Gradient Ascent
Gradient ascent (or equivalently, gradient descent, if you invert the log-­likelihood func-
tion) is a good starting point for exploring iterative optimization, because it parallels
the hiking analogy. Starting with an initial guess about the parameter, the algorithm
takes repeated steps in the direction of maximum until it finds the optimal estimate for
the data. The iterative recipe for gradient ascent is straightforward: At each iteration,
compute an updated estimate that equals the previous estimate plus some adjustment,
the size of which depends on the first derivative or slope. More formally, the updating
step is
Maximum Likelihood Estimation 71

=
new estimate current estimate + step size (2.28)

 ∂LL 
θ( ) = θ( ) + 
t +1 t
× constant 
 ∂θ 
where θ denotes the parameter of interest, t indexes the iterations, and the step size term
in parentheses is the first derivative (evaluated at the current estimate) times a small
constant, sometimes referred to as the learning rate.
To illustrate iterative optimization, I applied gradient ascent to the mean (to keep
the illustration simple, I held the variance at its maximum likelihood estimate). A cus-
tom R program is available on the companion website for readers interested in coding
the algorithm by hand. To begin, I initiated the process with a starting value of μ(0) = 0
and a constant learning rate of .25 (the constant is usually some small value between 0
and 1). Substituting the initial parameter value into the first derivative expression from
Equation 2.7 (i.e., evaluating the function at μ = 0) gives a slope equal to 161.86. The
huge positive slope implies a correspondingly large positive adjustment to the param-
eter. Multiplying the derivative by the learning rate gives a step size equal to 161.86 ×
.25 = 40.47 and an updated parameter value equal to μ(1) = 40.47. The new estimate is
closer to the peak, so the slope coefficient decreases in magnitude to 46.53. Repeating
the process gives a step size equal to 46.53 × .25 = 11.63 and an updated estimate equal
to μ(2) = 52.10.
Table 2.3 gives the parameter updates, first derivatives, and log-­likelihood values
from 17 iterations. As you can see, the first few cycles produced steep slope coefficients

TABLE 2.3. Iterative Updates from a Gradient Ascent Algorithm


Iteration μ Slope Log-likelihood
0 0.0000000 161.8619279 –5510.230027773
1 40.4654820 46.5319355 –1293.850964489
2 52.0984659 13.3769630 –945.391338786
3 55.4427066   3.8455985 –916.593142769
4 56.4041062   1.1055295 –914.213136500
5 56.6804886   0.3178167 –914.016442587
6 56.7599428   0.0913657 –914.000186960
7 56.7827842   0.0262657 –913.998843525
8 56.7893507   0.0075509 –913.998732498
9 56.7912384   0.0021707 –913.998723322
10 56.7917810   0.0006240 –913.998722564
11 56.7919371   0.0001794 –913.998722501
12 56.7919819   0.0000516 –913.998722496
13 56.7919948   0.0000148 –913.998722496
14 56.7919985   0.0000043 –913.998722496
15 56.7919996   0.0000012 –913.998722496
16 56.7919999   0.0000004 –913.998722496
72 Applied Missing Data Analysis

and large adjustments to the parameter. The vertical elevation of the log-­likelihood also
increased rapidly as the algorithm took large strides toward the peak. In contrast, the
final few iterations induced very small adjustments to the mean, and changes to the log-­
likelihood were in the 10th decimal. Continuing to iterate until the derivative equals 0 is
inefficient and unnecessary, because any additional improvement to the estimate would
be infinitesimally small (e.g., after 17 iterations, the estimate is changing in the seventh
decimal place). Instead, I terminated the iterations when the estimates from consecutive
steps differed by less than .000001, as changes of this magnitude effectively signal that
the algorithm has reached the summit.

Newton’s Algorithm
Gradient ascent is useful for establishing some intuition about iterative optimization,
but the simple variant I describe here can be slow to converge and may not converge
at all when variables have different scales. Newton’s algorithm (also known as the
Newton–­Raphson algorithm) similarly parallels the hiking analogy, but it uses a more
complex formulation for the step size that requires first and second derivatives. The
upside of this additional complexity is that the updating step naturally provides the
building blocks for computing standard errors after the final iteration. To illustrate the
basic ideas, reconsider the log-­likelihood function with respect to the variance in Figure
2.8. Although the log-­likelihood is a complex curve with multiple bends, magnifying a
graph of the function at its maximum would reveal a simpler curved line that resembles
a quadratic function (i.e., an inverted U, or a parabola). Leveraging this idea, Newton’s
algorithm uses the first and second derivative values (i.e., the linear slope and curvature
at a specific point on the function) to construct a parabolic curve that extends from the
current parameter value toward the log-­likelihood’s peak. The apex of each quadratic
function represents the algorithm’s best guess about the maximum likelihood estimate
at a particular iteration, and this temporary peak becomes the updated parameter value
for the next iteration.
Figure 2.12 shows the log-­likelihood function, with black dots denoting four con-
secutive parameter values. The three dashed lines are quadratic curves assembled from
the first and second derivative formulas. To illustrate the iterative updates, suppose that
the optimizer begins its ascent from a starting value of σ2(0) = 50. A black dot appears
on the log-­likelihood function at this coordinate, and the leftmost dashed curve (the
smallest of the three) is the parabolic function that projects from the starting value. The
dashed curve is trying to approximate what the log-­likelihood function looks like near
its summit, and the apex of the curve represents the parabola’s best guess about the
maximum likelihood estimate at the initial iteration. The peak of the quadratic curve,
located at σ2(1) = 65.03, becomes the new estimate for the next iteration. Repeating the
process, the algorithm substitutes the updated estimate into the first and second deriva-
tive expressions and uses the resulting quantities to project another quadratic function
from the new coordinate. The middle of the three dashed curves shows the parabola for
this step, the peak of which is located at σ2(2) = 78.40. Similarly, the rightmost dashed
curve shows the quadratic approximation at the third iteration, the maximum of which
corresponds to σ2(3) = 85.93. You can see that the dashed curves become wider and flat-
Maximum Likelihood Estimation 73

–910
–915
–920
Log-Likelihood
–925
–930
–935
–940

50 70 90 110 130 150


Population Variance

FIGURE 2.12. The likelihood function with respect to the variance, holding the mean con-
stant at its maximum likelihood estimate. The black dots represent four consecutive updates
to the variance beginning at the starting value σ2(0) = 50. The three dashed lines are quadratic
curves assembled from the first and second derivative formulas, and the peak of each parabola
identifies the updated parameter value at the next iteration.

ter as elevation increases, such that each successive update does an increasingly better
job at approximating the shape of the log-likelihood function near its peak. After a few
more iterations, the algorithm locates the summit.
More formally, the jump from the current to the updated parameter value is as fol-
lows:

=
new estimate current estimate + step size (2.29)
−1
 ∂ 2 LL   ∂LL 
θ( ) = θ( ) − 
t +1 t
2 
  
 ∂θ   ∂θ 
The step size, computed as the ratio of the first and second derivatives at the current
parameter value θt, corresponds to the horizontal distance between the current estimate
and the peak of the projected quadratic curve. In effect, Newton’s algorithm is breaking
the total vertical elevation into several smaller hikes, and the derivative terms function
as a wayfinder that plots the route to each intermediate peak. The updating step readily
extends to more complex models with multiple parameters. In this case, the multivari-
ate updating equation is
74 Applied Missing Data Analysis

θ( =)
θ( ) − H −1(θ(t ) )S(θ(t ) )
t +1 t
(2.30)

where θ is a vector of parameter values, t indexes the iterations, S(θ(t)) is the vector of
first derivatives (i.e., the score vector, computed as the sum of Equation 2.25 across all
N observations), and the rightmost term is the Hessian matrix of second derivatives.
To illustrate a multivariate optimization scheme, I used Newton’s algorithm to esti-
mate the mean and variance of the math posttest scores. A custom R program is available
on the companion website for readers interested in coding the algorithm by hand. In this
example, S(θ) is a vector containing the slope expressions from Equations 2.7 and 2.9,
and H(θ) is the second derivative matrix from Equation 2.16. The multivariate updating
scheme is virtually identical to the univariate scheme depicted in Figure 2.12, except
that each parameter’s parabolic approximation now accounts for the associations in the
Hessian’s off-­diagonal. Table 2.4 shows the iterative updates from a climb initiated at
(terrible) starting values of μ(0) = 0 and σ2(0) = 1. Notice that the algorithm immediately
locates the optimal estimate of the mean after the first update. Returning to Figure
2.7, the log-­likelihood with respect to the mean is itself a parabolic function, so the
optimizer can immediately predict the peak of the function from any starting value. In
contrast, the algorithm requires 17 iterations to locate the optimal value of the variance.
Consistent with gradient ascent, you can see that the optimizer makes large adjustments
at first and very small alterations as it approaches the peak.

TABLE 2.4. Iterative Updates from Newton’s Algorithm


Iteration μ σ2 Log-likelihood
0 0 1 –414360.734633301
1 56.79200000 1.49992453 –7590.507280671990
2 56.79200000 2.24341946 –5218.181032416420
3 56.79200000 3.35059911 –3653.304333842760
4 56.79200000 4.99327916 –2626.616256992080
5 56.79200000 7.41677626 –1958.552805664100
6 56.79200000 10.96146467 –1529.318174000000
7 56.79200000 16.07692601 –1258.915758159560
8 56.79200000 23.30441653 –1093.809160106400
9 56.79200000 33.17164090 –997.987688507726
10 56.79200000 45.89009316 –946.947360485552
11 56.79200000 60.70697190 –923.606989421236
12 56.79200000 74.99905776 –915.615473776683
13 56.79200000 84.49594080 –914.087289046441
14 56.79200000 87.48858990 –913.999146772176
15 56.79200000 87.71555229 –913.998722506935
16 56.79200000 87.71673597 –913.998722495553
17 56.79200000 87.71673600 –913.998722495553
Maximum Likelihood Estimation 75

2.10 LINEAR REGRESSION

This section extends maximum likelihood estimation to a multiple regression analysis.


As you will see, the previous concepts readily generalize to this analysis with virtu-
ally no modifications, because estimation still relies on the univariate normal curve. A
single-­predictor model is a useful starting point, because the log-­likelihood function for
the coefficients can be visualized in a three-­dimensional graph. The simple regression
model is

Yi = β0 + β1 X i + ε i = E ( Yi | X i ) + ε i (2.31)

(
Yi ~ N1 E ( Yi | X i ) , σ2ε )
where E(Y|X) is a predicted value (i.e., the expected value or mean of Y given a particular
X score), the tilde means “distributed as,” N1 denotes the univariate normal distribution
function (i.e., the probability distribution in Equation 2.3), and the conditional mean
and residual variance inside the parentheses are the distribution’s two parameters. The
bottom row of the expression is simply stating our usual assumption that outcome scores
are normally distributed around a regression line with constant residual variation.
Switching gears to a different substantive context, I use the smoking data from
the companion website to illustrate multiple regression. The data set includes several
sociodemographic correlates of smoking intensity from a survey of N = 2,000 young
adults (e.g., age, whether a parent smoked, gender, income). To facilitate graphing, I start
with a simple regression model where the parental smoking indicator (0 = parents did not
smoke, 1 = one or both parents smoked) predicts smoking intensity (higher scores reflect
more cigarettes smoked per day):

INTENSITYi = β0 + β1 ( PARSMOKEi ) + ε i (2.32)

The intercept represents the expected smoking intensity score for a respondent whose
parents did not smoke, and the slope is the group mean difference. The analysis example
later in this section expands the model to include additional explanatory variables.

Probability Distribution and Log‑Likelihood


Linear regression leverages the univariate normal distribution function from Equation
2.3, and the only difference is that a predicted value and residual variance replace μ
and σ2, respectively. Using generic notation, the probability distribution (normal curve
equation) for the simple regression is as follows:

1 ( Yi − E ( Yi | X i ) )
 2 
f ( 2
Yi | β, σ= )
ε , Xi
1 
exp −
 2 σ2ε


(2.33)
2πσ2ε  
To reiterate recurring notation, the function on the left side of the equation can be read
as “the relative probability of a score given assumed values for the model parameters.”
76 Applied Missing Data Analysis

Visually, “f of Y” is the height of the conditional normal curve that describes the spread
of scores around a particular point on the regression line (e.g., the normal distribution
of smoking intensity scores for participants who share the same value of the parental
smoking indicator). The main component in the kernel is still a squared z-score, but that
quantity now represents the standardized distance between a score and its predicted
value. As before, the fraction to the left of the exponential function is a scaling term that
ensures the area under the probability distribution sums or integrates to 1. Finally, note
that explanatory variables function as fixed constants like the parameters. This feature
will change in Chapter 3, where incomplete predictors appear as variables in a prob-
ability distribution.
As you know, maximum likelihood estimation reverses the probability distribu-
tion to get the likelihood of different combinations of population parameters given the
observed data. Taking the natural logarithm of each observation’s likelihood and sum-
ming the transformed probabilities gives a log-­likelihood function that summarizes the
data’s evidence about the coefficients and residual variance.
 
1 ( Yi − ( β0 + β1 X i ) )
N  2

LL ( β= ) ∑
, σ2ε | data ln  1
 2πσ2

exp −
 2 σ2ε

 
i =1
 ε  
N
N N
( ) ( ) ∑ ( Y − (β
1 −1 (2.34)
+ β1 X i ) )
2
=− ln ( 2π ) − ln σ2ε − σ2ε i 0
2 2 2 i =1
N N 1
( ) ( )
−1
=− ln ( 2π ) − ln σ2ε − σ2ε ( Y − Xβ )′ ( Y − Xβ )
2 2 2
The compact matrix expression in the bottom row stacks the N outcome scores into a
vector Y, and it uses X to denote a corresponding matrix that contains predictor vari-
ables and a column of ones for the intercept.
With only two coefficients, we can visualize the log-­likelihood surface of β0 and β1
in three dimensions. Figure 2.13 is a contour plot conveying the perspective of a drone
hovering over the peak of the log-­likelihood surface, with smaller contours denoting
higher elevation (and vice versa). The data’s support for the parameters increases as the
contours get smaller, and the maximum likelihood estimates of β0 and β1 are located at
the peak of the surface, shown as a black dot. The angle of the ellipses owes to the fact
that the intercept and slope coefficients are negatively correlated (i.e., the data’s support
for a larger mean difference requires concurrent support for lower comparison group
average). Identifying the optimal parameters for the data is again analogous to a hiker
climbing a mountain peak. Following the univariate example, we can derive an exact
solution or use an iterative optimization approach such as Newton’s algorithm.

Maximum Likelihood Estimates and Standard Errors


As before, the process of deriving maximum likelihood estimates and standard
errors requires the first and second derivatives of the log-­likelihood function. Apply-
ing ­dif­ferential calculus rules to Equation 2.34 leads to the following first derivative
expressions:
Maximum Likelihood Estimation 77

∂LL 1
− 2 ( −X ′Y + X ′Xβ )
= (2.35)
∂β σε

∂LL
( )
N 2
( )
1 2
−1 −2
2
=
− σε + σε ( Y − Xβ )′ ( Y − Xβ ) (2.36)
∂σ ε 2 2

Setting these slope equations to 0 and solving for the unknown parameters at the peak
of the log-­likelihood surface gives the maximize likelihood estimates below.
−1
=βˆ (=
X ′X ) X ′Y βˆ OLS (2.37)

ˆ 2ε
σ
=
1
N
( ′
)(
Y − Xβˆ Y − Xβˆ ) (2.38)

Notice that the coefficients are identical to those of ordinary least squares, but the resid-
ual variance differs, because the sample size is not adjusted for the number of estimates
in β̂. This matches the earlier result for the mean and variance.
From Section 2.6, you know that second derivatives quantify the curvature or
steepness of the log-­likelihood function near its peak (i.e., the rate at which the first-
4.5
4.0
3.5
Population Slope
3.0
2.5
2.0
1.5

8.0 8.5 9.0 9.5 10.0


Population Intercept

FIGURE 2.13. Contour plot that conveys the perspective of a drone hovering over the peak of
the log-­likelihood surface for a simple regression model, with smaller contours denoting higher
elevation (and vice versa). The maximum likelihood estimates of β0 and β1 are located at the peak
of surface (shown as a black dot).
78 Applied Missing Data Analysis

order slopes change across the range of parameter values). These second derivatives are
obtained by applying differential calculus rules to Equations 2.35 and 2.36, and the Hes-
sian collects these equations in a matrix.


( )
2 −1
( ) X′ ( Y − Xβ )′ 
−2
 − σ ε X ′X − σ2ε 
HO ( θ ) =   (2.39)
( ) N
( σ ) − ( σ ) ( Y − Xβ )′ ( Y − Xβ ) 
−2 2 −2 2 −3
 − σ2ε ( Y − Xβ )′ X ε ε
 2 

Substituting the maximum likelihood estimates into the expression and multiplying
HO(θ̂) by –1 gives the observed information matrix, then taking its inverse (the matrix
analogue of a reciprocal) gives the variance–­covariance matrix of the parameter esti-
mates. Equations 2.17 and 2.18 depict these steps. The parameter covariance matrix
for the simple regression analysis is symmetric with three rows and columns, one per
parameter.

 var βˆ 0

( ) (
cov βˆ 0 , βˆ 1 ) ( )
cov βˆ 0 , σˆ 2ε 
  0.018 −0.018 0 
θ  (
ˆS ˆ =  cov βˆ , βˆ
1 0 ) ( )
var βˆ 1 ( ˆ
)
2  
cov β1 , σˆ ε  =  −0.018 0.039 0 

(2.40)
   0 0.365 
(
 cov σˆ 2ε , βˆ 0
 ) (
cov σˆ 2ε , βˆ 1 ) ( )
var σˆ 2ε  

0

Finally, taking the square root of the diagonal elements gives the standard errors (e.g.,
SEβˆ = 0.039 = 0.20). To establish further linkages to ordinary least squares, the expres-
1
sion in the upper left block of Equation 2.39 is a 2 × 2 matrix that contains derivatives
with respect to the two coefficients. Multiplying this submatrix by –1 and taking its
inverse gives an expression that is identical to a parameter covariance matrix from ordi-
nary least squares regression.

Analysis Example
To illustrate maximum likelihood estimation for multiple regression, I expanded the
previous analysis model to include age and income as predictors. I centered the addi-
tional variables at their grand means to maintain the intercept’s interpretation as the
expected smoking intensity score for a respondent whose parents did not smoke.

INTENSITYi = β0 + β1 ( PARSMOKEi ) + β2 ( AGEi − μ 2 ) + β3 ( INCOMEi − μ 3 ) + ε i (2.41)

Importantly, the smoking intensity distribution has substantial positive skewness and
kurtosis, so I used robust (sandwich estimator) standard errors and the bootstrap to
illustrate different corrective procedures. Analysis scripts are available on the compan-
ion website, including a custom R program for readers interested in coding Newton’s
algorithm by hand.
Table 2.5 shows the maximum likelihood estimates, along with ordinary least
squares results as a comparison. As expected, the two estimators produced identical
Maximum Likelihood Estimation 79

TABLE 2.5. Maximum Likelihood and Ordinary Least Squares Estimates


Maximum likelihood OLS
Parameter Est. SE RSE BSE Est. SE
β0 9.09 0.126 0.120 0.119 9.09 0.126
β1 (PARSMOKE) 2.91 0.187 0.186 0.183 2.91 0.187
β2 (AGE) 0.59 0.040 0.040 0.040 0.59 0.040
β3 (INCOME) –0.10 0.027 0.032 0.032 –0.10 0.027
σε2 17.15 0.542 1.673 1.685 17.18 —
2
R 0.19 0.016 0.026 0.025 0.19 —

Note. RSE, robust standard errors; BSE, bootstrap standard errors.

coefficients, but the maximum likelihood residual variance is very slightly smaller,
because it does not subtract the four degrees of freedom spent estimating the coeffi-
cients. This slight difference aside, the estimates themselves have the same meaning. For
example, the intercept (β̂0 = 9.09, SE = .12) is the expected number of cigarettes smoked
per day for a respondent whose parents didn’t smoke, and the parental smoking indi-
cator slope (β̂1 = 2.91, SE = .19) is the mean difference, controlling for age and income.
The corrective procedures induced relatively minor changes to the coefficients’ standard
errors, but they had a dramatic impact on the standard error of the residual variance. As
is often the case with a reasonably large sample size, sandwich estimator and bootstrap
standard errors were effectively equivalent.

2.11 SIGNIFICANCE TESTS

Maximum likelihood estimation offers three significance testing options: the Wald test
(Wald, 1943), likelihood ratio statistic (Wilks, 1938), and the score test or Lagrange
multiplier (Rao, 1948). The latter is commonly referred to as the modification index
in structural equation modeling applications (Saris, Satorra, & Sörbom, 1987; Sörbom,
1989). I describe the first two approaches, because they are widely available in general-­
purpose software packages, and Buse (1982) provides a nice tutorial on this “trilogy of
tests” for readers who are interested in additional details.
The Wald test and likelihood ratio statistic can evaluate the same hypotheses, but
they do so in different ways. The Wald test compares the discrepancy between the esti-
mates and hypothesized parameter values (usually zeros) to sampling variation. The
simplest incarnation of the test statistic is just a z-score or chi-­square. In contrast, the
likelihood ratio statistic compares log-­likelihood values from two competing models,
the simpler of which aligns with the null hypothesis. The two tests are equivalent in
very large samples but can give markedly different answers in small to moderate samples
(Buse, 1982; Fears, Benichou, & Gail, 1996; Greene, 2017; Pawitan, 2000). These differ-
ences are sometimes attributable to the fact that the Wald test inappropriately assumes
that sampling distributions follow a normal curve, but discrepancies can arise for other
80 Applied Missing Data Analysis

reasons that are more difficult to predict. Statistical issues aside, the likelihood ratio test
is somewhat less convenient to implement, because it requires two analyses, but this is
not a compelling disadvantage.

Wald Test
The simplest incarnation of the Wald test is the familiar z-statistic that compares the
difference between an estimate and hypothesized parameter value (e.g., θ0 = 0) to the
estimate’s standard error.
θˆ − θ0
z= (2.42)
SEθˆ
Leveraging the large-­sample normality of maximum likelihood estimates, a standard
normal distribution generates a probability value for the test, and symmetrical confi-
dence interval limits are computed by multiplying the standard error by the appropriate
z critical value, then adding and subtracting that product (i.e., the margin of error or
half-width) to the estimate.

CI = θˆ ± z CV × SEθˆ (2.43)

The z critical values for different alpha levels are available in textbooks and online (e.g.,
zCV = ±1.96 for α = .05).
Squaring the z-score gives an alternative expression for the Wald statistic that
instead follows a chi-­square distribution with a single degree of freedom.

=
 θˆ − θ0 
TW = 
2
( θˆ − θ )( θˆ − θ )
0 0
(2.44)
 SE ˆ 
 θ  var ( θˆ )

The chi-­square formulation readily generalizes to multiple parameters:

TW = ( ′
) (
θˆ Q − θ 0 Sˆ θ−ˆ 1 θˆ Q − θ 0
Q
) (2.45)

where θ̂Q is a vector of Q estimates, θ0 is the corresponding vector of hypothesized


values (typically zeros), and Sˆ θˆ Q is a variance–­covariance matrix that contains Q rows
and columns from full parameter covariance matrix (or its robustified counterpart).
The numerical value of TW is the sum of squared standardized differences between the
estimates and their hypothesized values. If the null hypothesis is true, the test statis-
tic follows a central chi-­square distribution with Q degrees of freedom, and statistical
significance implies that one or more of the estimates in θ̂Q are different from their
hypothesized values.

Likelihood Ratio Test


The likelihood ratio statistic evaluates the relative fit of two nested models. Nested
models can take a variety of forms, but a common application compares the substantive
Maximum Likelihood Estimation 81

analysis to a more restrictive version of the model that fixes a subset of parameters to 0.
Returning to the earlier regression analysis, we could use the likelihood ratio statistic
to evaluate the null hypothesis that R2 = 0 by comparing the fit of the analysis model
from Equation 2.41 to that of an empty model that constrains the slope coefficients to
0. A slightly different application of the likelihood ratio test occurs in structural equa-
tion modeling analyses in which a researcher compares the fit of a saturated model
(i.e., a model that places no restrictions on the mean vector and covariance matrix) to
that of a more parsimonious analysis model that imposes a structure on the data (e.g., a
confirmatory factor analysis model). In either scenario, the simpler model with Q fewer
parameters aligns with the null hypothesis, so I denote the restricted model’s param-
eters as θ0 and the full model’s parameters as θ.
The likelihood ratio statistic is

TLR = ( ( ) (
−2 LL θˆ 0 | data − LL θˆ | data )) (2.46)

where LL(θ̂0|data) is the sample log-­likelihood value for the restricted model (e.g., an
empty regression model with only an intercept), and LL(θ̂|data) is the log-­likelihood
for the more complex model (e.g., the full regression model). The more complex model
with additional parameters will always achieve better fit and a higher log-­likelihood,
but that improvement should be very small when the null hypothesis is true. If the two
models are equivalent in the population, the likelihood ratio statistic follows a central
chi-­square distribution with Q degrees of freedom, which in this case is the difference
between the number of parameters in the two models. A significant test statistic indi-
cates that the data provide more support for the full model than the restricted model
(e.g., one or more parameters are significantly different from zero).

Robust Test Statistics


As discussed in Section 2.8, non-­normal data may or may not compromise point esti-
mates, but they certainly distort standard errors. The same is true for significance tests,
as the Wald and likelihood ratio statistics no longer follow the optimal chi-­square dis-
tribution. The Wald test is easily robustified by substituting a sandwich estimator cova-
riance matrix into Equation 2.45 (or a robust standard error into Equation 2.42). The
likelihood ratio statistic can be rescaled to more closely approximate the correct chi-­
square distribution (Satorra & Bentler, 1988; Satorra & Bentler, 1994; Yuan & Bentler,
2000), or a p-value can be obtained by referencing the biased test statistic against a boot-
strap sampling distribution that honors the distribution of the data (Beran & Srivastava,
1985; Bollen & Stine, 1992; Enders, 2002; Savalei & Yuan, 2009). I describe these two
approaches below and illustrate their application in one of the later analysis examples.
Readers familiar with structural equation models are undoubtedly familiar with
the rescaled likelihood ratio statistic, which is commonly known as the Satorra–­Bentler
chi-­square (Satorra & Bentler, 1988; Satorra & Bentler, 1994). The general procedure for
comparing two nested models involves dividing the likelihood ratio statistic by a con-
stant scaling term that largely depends on the kurtosis of the data (Satorra & Bentler,
2001; Yuan et al., 2005). The rescaled test statistic is
82 Applied Missing Data Analysis

TLR
TSB = (2.47)
cLR
P0 c0 − PF cF
cLR =
P0 − PF
where TLR is the likelihood ratio statistic from Equation 2.46, and cLR is a scaling con-
stant that combines the number of parameters in the full and restricted models, PF and
P0, respectively, and model-­specific scaling terms, cF and c0.
The scaling term can be understood by revisiting the sandwich estimator cova-
riance matrix in Equation 2.24. As explained previously, the “bread × meat” product
yields a matrix with diagonal elements that reflect the relative magnitude of two infor-
mation matrices, one of which is sensitive to outlier scores. When the data are normal,
the two matrices are equivalent and cancel out when multiplying one by the inverse of
the other (the resulting product is an inert identity matrix). In contrast, when the data
are non-­normal, the resulting product contains fractional diagonal terms that can be
smaller or larger than 1, depending on the kurtosis of the data. Multiplying this matrix
by the rightmost piece of “bread” inflates or deflates elements in parameter covariance
matrix accordingly.
The rescaling terms for the likelihood ratio test also leverage discrepancies between
the two information matrices. In the simplest possible univariate application (e.g., the
analysis from Section 2.8), the scaling term is a fraction that compares a single diago-
nal value from each information matrix (Yuan et al., 2005). More generally, cF and c0
pool the elements of the “bread × meat” product into a single scalar value that r­ escales
the test statistic to have the same expected value or mean as its optitmal central chi-­
square distribution (Satorra & Bentler, 1988, 1994, 2001). As such, referencing TSB to
a chi-­square distribution with Q degrees of freedom gives an approximate p-value,
and a significant test statistic indicates that the data provide more support for the full
model than the restricted model (e.g., one or more parameters are significantly different
from 0).
A second option for getting a robust significance test is to use the original TLR from
Equation 2.46 but reference the test statistic against a simulation-­based bootstrap sam-
pling distribution that honors the data’s shape. This is essentially the opposite tack of
rescaling, which fixes up the test statistic and leaves the theoretical sampling distribu-
tion intact. As explained in Section 2.8, the bootstrap procedure treats the observed
data as a surrogate for the population and draws many samples of size N with replace-
ment. Fitting the analysis model to each data set produces a collection of estimates
that form empirical sampling distributions, the standard deviations of which are robust
standard errors. A slight modification is needed to apply the bootstrap to test statistics.
As you know, a probability value reflects the likelihood that the observed test statistic
originated from a hypothetical population where the null hypothesis is exactly true. To
achieve this interpretation from the bootstrap, you need to first transform the observed
data to match the null hypothesis. Returning to the multiple regression model from
Equation 2.41, a null hypothesis that R2 = 0 implies that all regression slopes equal 0.
The estimated slopes will never be exactly 0, yet the sample data must be exactly consis-
tent with this condition for the bootstrap to work properly.
Maximum Likelihood Estimation 83

Beran and Srivastava (1985) and Bollen and Stine (1992) modified the bootstrap
procedure by first applying an algebraic transformation that aligns the mean and covari-
ance structure of the data to the null hypothesis (the procedure is sometimes referred
to as the model-based bootstrap). Importantly, this transformation does not modify dis-
tribution shapes, so drawing bootstrap samples from the rescaled data gives an empiri-
cal sampling distribution that reflects the natural variation of the test statistic with
non-­normal data. A robust p-value is then obtained by computing the proportion of
bootstrap samples that give a test statistic larger than TLR from the original analysis. The
transformation expression is

( Yi − μˆ )′ Sˆ −.5Sˆ 0−.5 + μˆ ′0
Y i = (2.48)

where Y i is the transformed data for observation i, Yi is the corresponding vector of


observed scores, μ̂ and Ŝ are the mean vector and covariance matrix of the sample data,
and μ̂0 and Ŝ0 are model-­implied mean vector and covariance matrix from the restricted
model (i.e., the model that aligns with the null hypothesis). I use Y as a generic symbol
for the analysis variables, but this vector could include predictors and outcomes. The
equation essentially applies two transformations: The (Yi – μ̂)′ Ŝ–.5 part of the expres-
sion “erases” the mean and the covariance structure from the data by converting the
variables to uncorrelated z-scores, and Ŝ0–.5 + μ̂0′ rescales the z-scores to match the asso-
ciations implied by the null hypothesis. Returning to the math achievement regression
model from Equation 2.41, a null hypothesis that R2 = 0 would induce a transformation
where explanatory variables are correlated with each other but mutually uncorrelated
with the outcome. Applying the bootstrap procedure to the rescaled data and collecting
the B test statistics creates an empirical sampling distribution, and the robust probabil-
ity value is then the proportion of these statistics that exceed TLR, the likelihood ratio
statistic from the raw data.

Analysis Example
Returning to the multiple regression model from Equation 2.41, I use the Wald test and
likelihood ratio statistic to evaluate the null hypothesis that R2 = 0. Both tests func-
tion like the omnibus F test from ordinary least squares in this context. To begin, the
Wald test standardizes discrepancies between the estimates and null values against the
parameter covariance matrix. The full covariance matrix is a 5 × 5 matrix, but the test
uses only the elements related to the slope coefficients. The composition of the test sta-
tistic for this example is as follows:

( ) ( ) ( )
−1
 βˆ 1 0 ′  var β1 cov βˆ 1 , βˆ 3   βˆ
ˆ cov βˆ 1 , βˆ 2
  1 0 
 
(
TW =  βˆ 2 − 0   cov βˆ 2 , βˆ 1
ˆ 
) ( )
var βˆ 2 ( )
cov βˆ 2 , βˆ 3   βˆ 2 − 0  (2.49)
 β3 0     ˆ 0

 (
  cov βˆ 3 , βˆ 1 ) (
cov βˆ 3 , βˆ 2 ) ( )
var βˆ 3   β3
 

The diagonal elements of the middle matrix are the sampling variances (i.e., squared
84 Applied Missing Data Analysis

standard errors), and the off-­diagonal elements capture the degree to which the esti-
mates covary across repeated samples. Substituting the appropriate estimates into the
previous expression gives TW = 481.19, the value of which represents the sum of squared
standardized differences from zero. Referencing the test statistic to a chi-­square dis-
tribution with Q = 3 degrees of freedom gives p < .001; consistent with an analogous F
test, we can conclude that at least one of the slopes is nonzero. The sandwich estimator
(robust) test statistic was markedly lower at TW = 423.15 but gave the same conclusion.
The likelihood ratio statistic evaluates the same hypothesis but requires a nested
or restricted model that aligns with the null. This secondary model is an empty regres-
sion that fixes the three slope coefficients to zero. With complete data, you can get the
restricted model log-­likelihood by constraining the slope coefficients to zero during
estimation or by excluding the explanatory variables from the analysis. Although it
makes no difference here, explicitly constraining the slopes to zero as follows is prefer-
able, because it generalizes to missing data analyses.

INTENSITYi = β0 + ( 0 ) ( PARSMOKEi ) + 0 ( AGEi − μ 2 ) + 0 ( INCOMEi − μ 3 ) + ε i (2.50)

Fitting the two models and substituting the resulting log-­likelihood values into Equa-
tion 2.46 gives the following test statistic:

( ( ) ( ))
    TLR =−2 LL θˆ 0 | data − LL θˆ | data =2 ( ( −5,895.145 ) − ( −5, 679.545 ) ) =431.20 (2.51)

As you can see, fixing the slopes to zero substantially decreased the log-­likelihood from
–5,679.545 to –5,895.145, indicating that the restricted model’s parameters are located
at a much lower vertical elevation on the log-­likelihood surface. Referencing the test
statistic to a chi-­square distribution with Q = 3 degrees of freedom returns a probability
value of p < .001, which, again, indicates that one or more of the slopes’ coefficients are
nonzero. The corresponding rescaled test statistic from Equation 2.47 was markedly
lower at TSB = 173.97 (cLR = 2.48) but gave the same conclusion. Although TW and TLR
produced the same substantive conclusion, their numerical values aren’t particularly
well calibrated. This is not unusual, as the tests often require a much larger sample size
to achieve equivalence.

2.12 MULTIVARIATE NORMAL DATA

The multivariate normal distribution plays an important role throughout the book, and
it appears prominently in Chapter 3, where it provides a flexible framework for miss-
ing data handling. To set the stage for missing data, this section uses the distribution
as a backdrop for estimating a mean vector and variance–­covariance matrix. As you
will see, the concepts we’ve already established readily generalize to multivariate data
with virtually no modifications (although some of the equations are messier). I use the
employee data from the companion website to provide a substantive context. The data
set includes several workplace-­related variables (e.g., work satisfaction, turnover inten-
Maximum Likelihood Estimation 85

tion, employee–­supervisor relationship quality) for a sample of N = 630 employees.


The illustration uses a 7-point work satisfaction rating (1 = extremely dissatisfied to 7
= extremely satisfied) and two composite scores that measure employee empowerment
and a construct known as leader–­member exchange scale (the quality of an employee’s
relationship with his or her supervisor). I treat work satisfaction as a normally distrib-
uted variable, because it has a sufficient number of response options and a symmetric
distribution (Rhemtulla et al., 2012). The Appendix gives a description of the data set
and variable definitions.

Probability Distribution and Log‑Likelihood


To tie the multivariate normal distribution back to earlier material, it is useful to cast
the analysis as three empty regression models. Using generic notation, the models are
as follows:

 WORKSATi   μ1   ε1i 
     
Yi =  EMPOWER i  =  μ 2  +  ε 2i  = μ + ε (2.52)
 LMX  μ  ε 
 i   3   3i 
Yi ~ N 3 ( μ, S )

The bottom equation is shorthand notation to reference data that follow a multivariate
normal distribution; N3 denotes a three-­dimensional normal distribution, and the first
and second terms in parentheses are the mean vector and variance–­covariance matrix
(the multivariate distribution’s parameters).
The multivariate normal distribution function generalizes the normal curve to mul-
tiple variables. In addition to a mean and variance for each variable, the distribution also
incorporates covariances among the variables (or alternatively, correlated residuals).
To illustrate, Figure 2.14 shows an idealized bivariate normal distribution for the pain
interference and depression composite variables. The distribution retains its familiar
shape and looks like a bell-­shaped surface in three-­dimensional space. The probability
distribution function that describes the shape of the surface has the same basic struc-
ture as its univariate sibling in Equation 2.3, with vectors and matrices replacing scalar
quantities.
( −V ×.5)  1 
f ( Yi | μ, S ) =π
(2 ) exp  − ( Yi − μ )′ S −1 ( Yi − μ ) 
−.5
S (2.53)
 2 
The column vector Yi now contains V observations for a participant i, μ is the corre-
sponding vector of population means, and Σ is a variance–­covariance matrix of the V
variables. As before, the function on the left side of the expression can be read as “the
relative probability of the V observations given assumed values for the model param-
eters.” Visually, the equation describes the height of the surface in Figure 2.14 at the
intersection of score values along the horizontal and depth axes. The term in the expo-
nential function, (Yi – μ)′ Σ–1(Yi – μ), is a key component that equals the sum of squared
86 Applied Missing Data Analysis

standardized differences between the scores and the distribution’s center (a quantity
known as Mahalanobis distance). Finally, the terms to the left of the exponential func-
tion scale the distribution so the area under the surface sums or integrates to 1.
As you know, a probability distribution treats scores as variable and the parameters
as known constants. To illustrate the distribution function’s output, assume that the
true population parameters are as follows (these happen to be the maximum likelihood
estimates for the employee empowerment and leader–­member exchange variables):
 28.61   20.38 5.37 
= μ =  S   (2.54)
 9.59   5.37 9.10 
The contour plot in Figure 2.15 shows the perspective of a drone hovering over the
peak of the bivariate normal distribution in Figure 2.14, with smaller contours denoting
higher elevation and larger relative probabilities (and vice versa). The overhead perspec-
tive better reveals the positive correlation between pain interference and depression.
The black diamond corresponds to interference and depression scores of Y1 = (32.00,
13.18)′, and the black circle corresponds to Y2 = (33.25, 9)′. Substituting everything into

0.015
Relative

0.010
Probabil

0.005
ity

50
nt

40
rme
owe

0.000
30
Emp

0
Lead
er-M 5
e

20
loye

emb
er E 10
xch
Emp

ang
e (R 15
elat
ions
hip 10
Qua 20
lity)

FIGURE 2.14. An idealized bivariate normal probability distribution for the employee empow-
erment and leader–­member exchange variables.
Maximum Likelihood Estimation 87

50
40
Employee Empowerment
30
20
10

0 5 10 15 20
Leader-Member Exchange (Relationship Quality)

FIGURE 2.15. The contour plot shows the perspective of a drone hovering over the peak of
the bivariate normal distribution in Figure 2.14, with smaller contours denoting higher eleva-
tion and larger relative probabilities (and vice versa). The overhead perspective better reveals the
positive correlation between pain interference and depression. The black circle and diamond are
two pairs of scores located at the same vertical elevation.

Equation 2.53 returns relative probability values of f(Y1|μ, Σ) = 0.006 and f(Y2|μ, Σ) =
0.006. The two pairs of scores have the same relative probability (i.e., are located at the
same vertical elevation), despite the fact that the straight line connecting Y1 to the center
of the distribution is noticeably shorter than the line connecting Y2 to the peak. This
result happens, because the positive correlation rotates the contours in such a way that
elevation drops rapidly directly above and below the distribution’s peak. This feature is
also apparent in Equation 2.53, where scaling the squared deviation scores relative to
the variance–­covariance matrix standardizes the distances in a way that accounts for
the correlations among the variables.
Following established concepts, estimation “reverses” the probability distribution’s
arguments to get the likelihood of different combinations of population parameters
given the observed data. Taking the natural logarithm gives the log-­likelihood contribu-
tion for a single observation:
V 1 1
LLi ( μ, S | data ) =− ln ( 2π ) − ln S − ( Yi − μ )′ S −1 ( Yi − μ ) (2.55)
2 2 2
88 Applied Missing Data Analysis

and summing across the N observations gives the sample log-­likelihood.


N
V N 1
LL ( μ, S | data ) =−N
2
ln ( 2π ) − ln S −
2 2 ∑ ( Y − μ )′ S
i =1
i
−1
( Yi − μ ) (2.56)

Numerically, the log-­likelihood is a large negative value that summarizes the data’s evi-
dence for a specific combination of parameter values in μ and Σ, with higher or less
negative numbers reflecting better fit (and vice versa). Visually, the log-­likelihood cor-
responds to the height of a multidimensional surface at specific values of μ and Σ. As
always, the goal of estimation is to identify the parameter values that maximize fit to
the observed data (or equivalently, minimize the sum of the squared z-scores in the
rightmost term).

Maximum Likelihood Estimates and Standard Errors


Consistent with the previous examples, we can derive an exact solution for the mean
vector and covariance matrix or use an iterative optimization approach such as New-
ton’s algorithm. An exact solution requires first and second derivatives of the log-­
likelihood function. The underlying logic is the same as before—­solve for the param-
eters after setting the derivative expressions to 0—but getting the derivative expres-
sions is more complex and requires matrix calculus (Magnus & Neudecker, 1999).
Although most of the equations are not intuitive, I include them as a resource for
interested readers. Equations aside, you can still follow the gist of estimation, because
all quantities retain their previous meaning (e.g., a first derivative gives the slope at a
particular point on the log-­likelihood surface; a second derivative captures curvature
or steepness at the peak).
The first derivatives with respect to μ and Σ are as follows:
N
∂LL
∂μ
−N S −1μ + S −1 Yi
=
i =1
∑ (2.57)

N
∂LL 1  S −1 − S −1 Y − μ Y − μ ′ S −1 
∂S
=
− 
2 i =1 ∑ ( i )( i ) 

(2.58)

Setting these equations to 0 and solving for the parameters gives the following analytic
solutions for the maximum likelihood estimates:
N
1
μˆ =
N ∑Y
i =1
i (2.59)

N
ˆ 1
S
=
N ∑ ( Y − μˆ )( Y − μˆ )′
i =1
i i (2.60)

The analytic solutions highlight a recurring theme, which is that maximum likelihood
estimates of variances and covariances do not adjust for the degrees of freedom spent
estimating the means; as such, variance–­covariance estimates are biased in small sam-
ples but approach their true population values as sample size increases (i.e., the esti-
mates are said to be consistent).
Maximum Likelihood Estimation 89

Second derivatives quantify the curvature or steepness of the log-­likelihood func-


tion near its peak (i.e., the rate at which the first-order slopes change across the range of
parameter values). Second derivatives are obtained by applying matrix calculus rules to
Equations 2.57 and 2.58, and the Hessian collects these equations in a symmetric matrix
with P rows and columns, where P is the number of unique parameters in μ and Σ.

 ∂ 2 LL ∂ 2 LL 
 
∂μ 2 ∂μ∂S 
HO ( θ ) =  (2.61)
 ∂ 2 LL ∂ 2 LL 

 ∂S∂μ ∂S 2 

The second derivative equations below are the building blocks for the observed infor-
mation matrix, and analogous expressions for the expected information are available
in the literature (Savalei, 2010; Savalei & Bentler, 2009; Yuan & Hayashi, 2006) and in
Chapter 3.
∂ 2 LL
= −N S −1 (2.62)
∂μ 2

{ }
N
∂ 2 LL  
∂S 2
= ∑
− D′V  S −1 ⊗ S −1 ( Yi − μ )( Yi − μ )′ S −1 − .5S −1  DV
i =1  
N
∂ 2 LL
−  S −1 ⊗ ( Yi − μ )′ S −1  DV
= ∑
∂μ∂S i =1
 
The ⊗ symbol is a Kronecker product that multiplies one matrix by each element of
another matrix, and DV is the so-­called “duplication matrix” (Magnus & Neudecker,
1999). Each covariance term appears twice in the first derivative matrix from Equation
2.58 but only once in the Hessian (and similarly, only once in the parameter covariance
matrix). The duplication matrix combines these redundant terms into a single value.
Substituting the maximum likelihood estimates into the derivative expressions, multi-
plying HO(θ̂) by –1, then taking its inverse gives the variance–­covariance matrix of the
estimates.

Analysis Example
Returning to the empty regression models in Equation 2.52, I use work satisfaction,
employee empowerment, and leader–­member exchange scales to illustrate maximum
likelihood estimation. Analysis scripts are available on the companion website, includ-
ing a custom R program for readers interested in coding Newton’s algorithm by hand.
Table 2.6 gives the maximum likelihood estimates of the means, standard deviations,
variances and covariances, and correlations (in bold typeface above the diagonal). I
computed the standard deviations and correlations by transforming the maximum like-
lihood estimates of the variances and covariances (e.g., a correlation is a covariance
divided by square root of the product of two variances). As a comparison, Table 2.6 also
gives results from the usual unbiased estimator of the variance–­covariance matrix. The
90 Applied Missing Data Analysis

TABLE 2.6. Maximum Likelihood Descriptive Statistics


Variable 1 2 3
Maximum likelihood
1. WORKSAT 1.58   .29 .42
2. EMPOWER 1.64 20.38 .39
3. LMX 1.61 5.37 9.10
Means 3.99 28.61 9.59
SD 1.26 4.52 3.02

Unbiased sample estimates


1. WORKSAT 1.59   .29 .42
2. EMPOWER 1.64 20.42 .39
3. LMX 1.61 5.37 9.11
Means 3.99 28.61 9.59
SD 1.26 4.52 3.02

Note. Bold typeface denotes correlations.

maximum likelihood estimates of these parameters are consistently lower (albeit by a


trivial amount), because the estimator from Equation 2.60 has N rather than N – 1 in
the denominator.

2.13 CATEGORICAL OUTCOMES:


LOGISTIC AND PROBIT REGRESSION

Looking ahead to missing data analyses, we now have flexible estimators that accom-
modate mixtures of categorical and continuous incomplete variables. To set the stage for
later examples, I illustrate complete-­data maximum likelihood estimation for a binary
outcome variable. Continuing with the employee data set, I use a dichotomous measure
of turnover intention that equals 0 if an employee has no plan to leave his or her position
and 1 if the employee has the intention of quitting. The bar graph in Figure 2.16 shows
the distribution of the discrete responses.

Latent Response Variable Formulation


Logit and probit regression envision binary scores originating from an underlying latent
response variable that represents one’s underlying proclivity or propensity to endorse
the highest category (Agresti, 2012; Johnson & Albert, 1999). Applied to the turnover
intention measure, this latent variable represents an unobserved, continuous dimension
of quitting intentions. To illustrate, Figure 2.17 shows the latent variable distribution
for the bar graph in Figure 2.16. The vertical line represents the precise cutoff point or
threshold in the latent distribution where discrete scores switch from 0 to 1 (or more
generally, from the lowest code to the highest code). The areas under the curve above and
Maximum Likelihood Estimation 91

100
80
60
Percent
40
20
0

0 = No Qutting Intentions 1 = Intend to Quit


Turnover Intention

FIGURE 2.16. Bar graph of the dichotomous measure of turnover intention. TURNOVER = 0
if an employee has no plan to leave his or her position, and TURNOVER = 1 if the employee has
intentions of quitting.

below this threshold correspond to the category proportions in the bar chart: 69% of the
area under the curve falls below the threshold, and 31% falls above in the shaded region.
Using generic notation, the link between the latent scores and categorical responses is

0 if Yi ≤ τ
*
Yi =  *
(2.63)
1 if Yi > τ

where Yi is the binary outcome for individual i, Yi* is the corresponding latent response
score, and τ is the threshold parameter (the vertical line in Figure 2.17). Fixing the latent
response variable’s mean or its threshold parameter to 0 provides a necessary identifica-
tion constraint, and I always adopt the latter strategy.
Adding an explanatory variable to the latent response model is a relatively small
step forward. To illustrate, consider a simple regression with leader–member exchange
(employee– supervisor relationship quality) predicting turnover intention, the latent
variable model for which is as follows:

TURNOVERi* = β0 + β1(LMXi) + ϵi (2.64)

The key difference between logistic and probit regression is the distribution of the resid-
ual term—the probit model defines ϵi as a standard normal variable, whereas logis-
92 APPLIED MISSING DATA ANALYSIS

TURNOVER = 0 TURNOVER = 1
Relative Probability

–4 –3 –2 –1 0 1 2 3 4
Latent Response Variable

FIGURE 2.17. Latent response distribution for a binary variable. The vertical line at 0 is a
threshold parameter τ that divides the latent distribution into two regions. Employees with no
quitting intentions have latent scores below the threshold, and employees who intend to quit
have scores above the threshold. The area under the shaded region of the curve is the probability
of quitting (the proportion of 1’s in the data).

tic regression defines the residual as a standard logistic variable. To illustrate a probit
regression model, Figure 2.18 shows the latent variable distributions at three values of
the explanatory variable, with the area above the threshold parameter (the predicted
probabilities) shaded in gray. The black dots represent predicted values, and the contour
rings convey the perspective of a drone hovering over the peak of a bivariate normal dis-
tribution, with smaller contours denoting higher elevation (and vice versa). The graph
for a logistic regression is similar, but standard logistic distributions have thicker tails
than the normal curves in the figure.
Going forward, I use the following notation for probit regression models to empha-
size the normally distributed latent response variable, which later functions as an
incomplete variable to be imputed:

Yi* = β0 + β1 X i + ε i (2.65)

ε i ~ N1 ( 0,1)

The second term in the normal distribution function indicates that the latent response
variable’s variance is fixed at 1 to provide a metric. I write the logistic model in its more
usual format as
Maximum Likelihood Estimation 93

 Pr ( Yi = 1) 
ln 
 1 − Pr ( Y =  = β0 + β1 X i (2.66)
 i 1) 
where the term on the left side of the equation is the log odds or logit. The logistic model
also has a fixed variance, which I omit from the expression.
Both modeling frameworks provide a conversion to the probability metric, albeit
using different functions. The predicted probability of endorsing the highest category
(e.g., the probability of quitting) from the probit model is

 τ − X iβ 
Pr ( Yi = 1| β,data ) = 1 − Φ  2  = 1 − Φ ( −X iβ ) = Φ ( −X iβ ) = πi (2.67)
 σε 
where Xi is the predictor vector for individual i (including a column of 1’s for the inter-
cept), β contains the coefficients, Xiβ is the predicted latent response, and Φ(·) is the
cumulative distribution function of the standard normal curve. The subtraction inside
the parentheses expresses the threshold as a z-score (recall that τ = 0 and σε2 = 1), and
6
4
Latent Turnover Intention
2
0
–2
–4
–6

0 5 10 15 20
Leader-Member Exchange (Relationship Quality)

FIGURE 2.18. Latent response distribution for a binary variable. The vertical line at 0 is a
threshold parameter τ that divides the latent distribution into two regions. Employees with no
quitting intentions have latent scores below the threshold, and employees who intend to quit
have scores above the threshold. The area under the shaded region of the curve is the probability
of quitting (the proportion of 1’s in the data).
94 Applied Missing Data Analysis

the function returns the area below this value in a standard normal curve. Subtracting
that result from 1 gives the area above the threshold (e.g., the shaded regions of the nor-
mal curves in Figure 2.18). Similarly, the logit link function translates predicted latent
response scores to the probability metric as follows:
exp ( X iβ )
Pr(Yi = 1| β,data) = = πi (2.68)
1 + exp ( X iβ )

Probability Distribution and Log‑Likelihood


Probit regression is appealing, because it leverages a normal distribution for the under-
lying response variable. In later chapters, I adopt a likelihood expression that features
the latent response scores in the normal curve expression from Equation 2.3, but for
now, I use an alternative equation that represents an individual’s likelihood contribution
as a predicted probability (area under the standard normal distribution).

Li ( β | data ) = Φ( −X iβ)Yi × (1 − Φ ( −X iβ ) )
1−Yi
= πYi i (1 − πi )
(1−Yi )
(2.69)

In the context of the employee turnover example, the likelihood features the product of
the predicted probability of quitting (left term) and not quitting (right term). The scores
in the exponents act like on–off switches that activate the left term (the predicted prob-
ability that Y = 1) if Y = 1 and trigger the right term (the predicted probability that Y =
0) if Y = 0. Taking the natural logarithm and summing across the N cases gives the fol-
lowing sample log-­likelihood expression:
N
LL ( β | data
= ) ∑ (Y × ln ( Φ ( −X β )) + (1 − Y ) × ln (1 − Φ ( −X β )))
i =1
i i i i (2.70)

Numerically, the log-­likelihood is a large negative number that equals the sum of log-
arithmically transformed probability values. Conceptually, this value represents the
data’s support for a particular combination of population regression coefficients in β.
The log-­likelihood for logistic regression has the same form as Equation 2.70 but
uses the Bernoulli distribution probability distribution from Equation 2.1. Reversing
the probability distribution’s arguments by taking data values as given and varying the
parameters gives the likelihood expression for a single observation.
Y 1−Yi
 exp ( X iβ )  i  exp ( X iβ )  (1−Yi )
Li ( β | data ) =  × 1 − = πYi i (1 − πi )
 1 + exp ( X β )   1 + exp ( X β ) 
(2.71)
 i   i 
Consistent with Equation 2.70, the likelihood features the product of the predicted
probability of quitting (left term) and not quitting (right term), and the scores in the
exponent activate the probability that corresponds to one’s binary response. Taking
the natural logarithm and summing across the N cases gives the following sample log-­
likelihood expression, which again represents the data’s support for a particular combi-
nation of regression parameters:
Maximum Likelihood Estimation 95
N   exp ( X iβ )   1 
LL ( β | data
= ) ∑  Y × ln  1 + exp ( X β )  + (1 − Y ) × ln  1 + exp ( X β )  
i i (2.72)
i =1  i i 
Unlike the other models in this chapter, there is no analytic solution for the probit
and logistic regression coefficients, and iterative optimizers such as Newton’s algorithm
are a must. Iterative optimization works the same as it did with normally distributed
data, so I point readers to the literature for additional technical details (Agresti, 2012;
Greene, 2017). Putting aside the technicalities, the process of computing standard errors
follows the same procedure described earlier in the chapter; manipulating the matrix of
second derivatives that quantifies the curvature of the log-­likelihood function gives the
variance–­covariance matrix of the estimates, the diagonal of which contains squared
standard errors. Similarly, the significance testing options described in Section 2.11 are
no different with categorical variable models.

Analysis Example
Expanding on the employee turnover example, I used maximum likelihood estimation
to fit probit and logistic regression models that use leader–­member exchange, employee
empowerment, and a male dummy code (0 = female, 1 = male) to predict a binary mea-
sure of turnover intention (TURNOVER = 0 if an employee has no plan to leave his or her
position, and TURNOVER = 1 if the employee has intentions of quitting).

TURNOVER i* = β0 + β1 ( LMX i ) + β2 ( EMPOWER i ) + β3 ( MALEi ) + ε i (2.73)

 Pr (TURNOVER i = 1) 
ln 
 1 − Pr (TURNOVER =  = β0 + β1 ( LMX i ) + β2 ( EMPOWER i ) + β3 ( MALEi )
 i 1 ) 
The probit model’s residual variance is fixed at 1 for identification, and the model addi-
tionally incorporates a fixed threshold parameter that divides the latent response vari-
able distribution into two segments. The logistic regression can also be viewed as a latent
response model, but it is typical to write the equation without a residual. Note that I
use β’s to represent focal model parameters, but the estimated coefficients will not be
the same (logit coefficients are approximately 1.7 times larger than probit coefficients;
Birnbaum, 1968). As always, analysis scripts are available on the companion website.
Table 2.7 shows the maximum likelihood analysis results for both models. Start-
ing with the probit regression results, the Wald test of the full model was statistically
significant, TW(3) = 20.00, p < .001, meaning that the estimates are at odds with the null
hypothesis that all three population slopes equal zero. Each slope coefficient reflects the
expected z-score change in the latent response variable for a one unit increase in the pre-
dictor, controlling for other regressors. For example, the leader–­member exchange coef-
ficient indicates that a one-unit increase in relationship quality is expected to decrease
the latent proclivity to quit by 0.06 z-score units (β̂1 = –0.06, SE = .02), holding other
predictors constant.
Turning to the logistic regression results, the Wald test of the full model was again
statistically significant, and the test statistic’s numerical value was comparable to that
96 Applied Missing Data Analysis

TABLE 2.7. Probit and Logistic Regression Estimates


Parameter Est. RSE z p OR
Probit regression
β0 0.80 0.35 2.25 .03 —
β1 (LMX) –0.06 0.02 –2.99 .00 —
β2 (EMPOWER) –0.03 0.01 –1.83 .07 —
β3 (MALE) –0.03 0.11 –0.30 .77 —
R2   .06 .03 2.36 .02 —

Logistic regression
β0 1.37 0.60 2.30 .02 —
β1 (LMX) –0.10 0.04 –2.96 .00 0.90
β2 (EMPOWER) –0.04 0.02 –1.81 .07 0.96
β3 (MALE) –0.06 0.18 –0.31 .75 0.95
R2   .05 .02 2.30 .02 —

Note. RSE, robust standard error; OR, odds ratio.

of the probit model, TW(3) = 19.35, p < .001. Each slope coefficient now reflects the
expected change in the log odds of quitting for a one-unit increase in the predictor,
holding all other covariates constant. For example, the leader–­member exchange slope
indicates that a one-unit increase in relationship quality decreases the log odds of quit-
ting by .10 (β̂1 = –0.10, SE = .04), controlling for employee empowerment and gender.
Notice that the logistic coefficients are approximately 1.7 times larger than the probit
slopes, as expected (Birnbaum, 1968). Exponentiating each slope gives an odds ratio
that reflects the multiplicative change in the odds (the probability ratio on the left side
of Equation 2.66) for a one-unit increase in a predictor (e.g., a one-point increase on the
leader–­member exchange scale multiplies the odds of quitting by 0.90).
The analysis results highlight that probit and logistic models are effectively equiva-
lent and almost always lead to the same conclusions. Some researchers favor the logistic
framework, because it yields odds ratios, but there is otherwise little reason to prefer
one approach to the other. As you will see, probit regression plays a more central role
with Bayesian estimation and multiple imputation.

2.14 SUMMARY AND RECOMMENDED READINGS

Maximum likelihood is the go-to estimator for many common statistical models, and it
is one of the three major pillars of this book. As its name implies, the estimator identi-
fies the population parameters that are most likely responsible for a particular sample of
data. Much of this chapter has unpacked this definition in the context of linear regres-
sion models and multivariate analyses based on the normal distribution, and the last
section has outlined logistic and probit models for categorical outcomes. Having estab-
Maximum Likelihood Estimation 97

lished all the major details behind estimation and inference, Chapter 3 applies maxi-
mum likelihood to missing data problems. As you will see, everything from this chapter
carries over to missing data applications, where the goal remains to identify parameter
values that maximize fit to the data—the only difference is that some participants have
more of it than others. Finally, I recommend the following articles for readers who want
additional details on topics from this chapter:

Buse, A. (1982). The likelihood ratio, Wald, and Lagrange multiplier tests: An expository note.
American Statistician, 36, 153–157.

Eliason, S. R. (1993). Maximum likelihood estimation: Logic and practice. Newbury Park, CA:
Sage.

Greene, W. H. (2017). Econometric analysis (8th ed.). Boston: Prentice Hall.

Savalei, V. (2014). Understanding robust corrections in structural equation modeling. Structural


Equation Modeling: A Multidisciplinary Journal, 21, 149–160.
3

Maximum Likelihood Estimation


with Missing Data

3.1 CHAPTER OVERVIEW

The origins of maximum likelihood missing data handling are quite old and date back
to the 1950s (Anderson, 1957; Edgett, 1956; Hartley, 1958; Lord, 1955). These early solu-
tions were limited in scope and had specialized applications (e.g., bivariate normal data
with a single, incomplete variable). Many important breakthroughs came in the 1970s
when methodologists developed the theoretical underpinnings of modern missing data-­
handling techniques, as well as computational methods to implement them (Beale &
Little, 1975; Dempster et al., 1977; Finkeiner, 1979; Hartley & Hocking, 1971; Orchard
& Woodbury, 1972). For researchers in the social and behavioral sciences, maximum
likelihood missing data handling became a practical reality in the 1990s, when struc-
tural equation modeling software packages began implementing estimators for raw data
(Arbuckle, 1996; Jamshidian & Bentler, 1999; Muthén et al., 1987; Wothke, 2000). As
an aside, researchers often refer to maximum likelihood missing data handling as full-­
information maximum likelihood (FIML) estimation. Although the FIML acronym is
often synonymous with missing data applications (Arbuckle, 1996), the name conveys
that estimates are derived from the raw data, just as they were in Chapter 2.
Virtually everything from Chapter 2 carries over to missing data applications,
where the goal remains to identify parameter values that maximize fit to the data—the
only difference is that some participants have more of it than others. Missing data analy-
ses generally require iterative optimization routines, but the nuts and bolts of estima-
tion and inference mirror Chapter 2. As you will see, the missing data-­handling aspect
of maximum likelihood all happens behind the scenes; a researcher simply needs to
dial up a capable software package and specify a model. The estimator does not discard
incomplete data records, nor does it impute them. Rather, it identifies the parameter
values with maximum support from whatever data are available. The first part of this
chapter digs under the hood to illustrate how procedures from the previous chapter
98
Maximum Likelihood Estimation with Missing Data 99

accommodate missing data. These changes are intuitive, so readers who aren’t as inter-
ested in the finer details can still get the gist.
Maximum likelihood analyses have evolved considerably in recent years. The
estimators that were widely available when I was writing the first edition of this book
were generally limited to multivariate normal data. This is still a common assump-
tion for missing data analyses, but flexible estimation routines that accommodate mix-
tures of categorical and continuous variables are now widely available (Ibrahim, 1990;
Ibrahim, Chen, Lipsitz, & Herring, 2005; Lüdtke, Robitzsch, & West, 2020a; Muthén,
Muthén, & Asparouhov, 2016; Pritikin, Brick, & Neale, 2018). As you will see, these
approaches generally don’t work from a multivariate distribution, but rather disassem-
ble a model into multiple parts that leverage different probability distributions. This
strategy—­factorizing a multivariate distribution into easier component distributions or
submodels—­also paves the way for estimating interactions and nonlinear effects with
missing data (Lüdtke et al., 2020a; Robitzsch & Lüdke, 2021). This is an important
innovation, as classic methods based on multivariate normality are known to introduce
bias (Cham, Reshetnyak, Rosenfeld, & Breitbart, 2017; Enders, Baraldi, & Cham, 2014;
Lüdtke et al., 2020a; Seaman, Bartlett, & White, 2012; Zhang & Wang, 2017). I use sev-
eral data analysis examples throughout the chapter to illustrate these newer methods
and their classic counterparts.

3.2 THE MULTIVARIATE NORMAL DISTRIBUTION REVISITED

Revisiting maximum likelihood estimation for multivariate normal data is a good start-
ing point that sets the stage for much of this chapter. As you will see, the missing data
methods in this book uniformly require distributional assumptions for incomplete vari-
ables. From a practical perspective, this means that univariate models such as multiple
regression must be specified as multivariate analyses to do any type of missing data
handling. The multivariate normal distribution is often a reasonable way to assign a
distribution to variables that wouldn’t otherwise need one, and it can work surprisingly
well with non-­normal variables. The multivariate normal distribution is also founda-
tional to the structural equation modeling approach that I discuss later in the chapter.
The structural modeling framework is an important toolkit for implementing maximum
likelihood estimation, as it accommodates a wide range of models with missing values
on outcomes or predictors.
I use the employee data from the companion website to provide a substantive con-
text. The data set includes several workplace-­related variables (e.g., work satisfaction,
turnover intention, employee–­supervisor relationship quality) for a sample of N = 630
employees (see Appendix). The illustration uses a 7-point work satisfaction rating (1 =
extremely dissatisfied to 7 = extremely satisfied) and two composite scores that measure
employee empowerment and a construct known as leader–­member exchange scale (the
quality of an employee’s relationship with his or her supervisor). I treat work satisfac-
tion as a normally distributed variable, because it has a sufficient number of response
options and a symmetrical distribution (Rhemtulla et al., 2012). The empty regression
models for the multivariate analysis are as follows:
100 Applied Missing Data Analysis

 WORKSATi   μ1   ε1i 
     
Yi =  EMPOWER i  =  μ 2  +  ε 2i  = μ + ε (3.1)
 LMX  μ  ε 
 i   3   3i 
Yi ~ N 3 ( μ, S )

The bottom row of the expression says that the variables follow a three-­dimensional
normal distribution with parameters μ and Σ. I used this same example in Section 2.12,
but now there are missing values; work satisfaction ratings have a 4.8% missing data
rate, the employee empowerment variable has 16.2% of its scores missing, and 4.1% of
the leader–­member exchange values are incomplete.

Complete‑Data Log‑Likelihood
A quick recap of concepts from Chapter 2 sets the stage for missing data handling. After
collecting data, recall that we “reverse” the probability distribution’s arguments to get
the likelihood of different combinations of population parameters given the observed
data. Taking the natural logarithm of the multivariate normal distribution function
gives the log-­likelihood contribution for a single observation.

( )
LLi μ, S | Yi( com ) =−
V
2
1 1
ln ( 2π ) − ln S − ( Yi − μ )′ S −1 ( Yi − μ )
2 2
(3.2)

To refresh notation, V is the number of variables, Yi is a column vector of scores for


participant i, and μ. and Σ denote the population mean vector and variance–­covariance
matrix, respectively. The “com” subscript on the left side of the expression emphasizes
that each person’s data are complete.
Summing across N participants gives the log-­likelihood function for a sample of
data.

( )
N
V N 1
LL μ, S | Y( com ) =−N
2
ln ( 2π ) − ln S −
2 2 ∑ ( Y − μ )′ S
i =1
i
−1
( Yi − μ ) (3.3)

Going forward, I refer to this equation as the complete-­data log-­likelihood to differ-


entiate the expression from its missing data sibling. As you know, the log-­likelihood
is a large negative value that summarizes the data’s evidence for a specific combina-
tion of parameter values, with higher numbers reflecting better fit (and vice versa). As
always, the goal of estimation is to identify the parameter values that maximize fit to
the observed data (or equivalently, minimize the sum of the squared z-scores in the
rightmost term).

Observed‑Data Log‑Likelihood
Returning to the ideas from Chapter 1 (Little & Rubin, 2020; Rubin, 1976), missing
data theory imagines a hypothetically complete data set that partitions into observed
and missing components. Symbolically, this idea is expressed as Y(com) = (Y(obs), Y(mis)).
Maximum Likelihood Estimation with Missing Data 101

The values in Y(mis) are essentially latent variable scores that we are unable to collect. To
illustrate, the three variables from the employee data illustration exhibit the five missing
data patterns in Table 3.1: (1) cases with complete data on all three variables, (2) par-
ticipants with missing data on just one of the three variables, and (3) persons missing
bo work satisfaction and employee empowerment scores. The contents of Y(obs) and Y(mis)
thus vary across patterns, with Y(obs) containing between one and three scores and Y(mis)
containing one or two unseen values.
The unseen values in Y(mis) cannot function as known constants in the log-­likelihood
expression, so the estimator removes the missing parts of the data from the multivariate
normal distribution and identifies the parameter values that maximize fit to the remain-
ing observed data. As a result, each participant with one or more observations contrib-
utes to the analysis, and nothing is wasted. You often see the probability distribution of
the observed data written as follows:

( ) ∫ f ( Y(
f Y( obs ) | θ = com ) )
| θ dY( mis ) (3.4)

The integration operator says that the observed-­data distribution on the left side of
the equation is obtained by averaging or marginalizing over the missing values. Mar-
ginalizing is akin to replacing each latent score in Y(mis) with a weighted sum over all
possible values of the missing variable, with higher weights assigned to more plausible
scores and vice versa.
Next, let’s see how Equation 3.4 translates into a log-­likelihood function. When a
participant has missing values, the observed data for that individual no longer contain
information about every model parameter. The log-­likelihood equation accommodates
this feature by eliminating the elements in the data and parameter arrays that corre-
spond to the missing variables. A single individual’s contribution to the observed-­data
log-­likelihood function is

( )
LLi μ, S | Yi( obs ) =−
Vi
2
1 1
ln ( 2π ) − ln S i − ( Yi − μ i )′ S i−1 ( Yi − μ i )
2 2
(3.5)

where Yi contains the participant’s observed data, Vi is the number of scores in the data
vector, and μi and Σi contain the subset of parameters in μ and Σ that correspond to the
observed variables in Yi. The equation says that all participants share the same model
parameters, but the fit of the observed data is restricted to those parameters for which an

TABLE 3.1. Missing Data Patterns from the Multivariate Analysis


Pattern % sample Work satisfaction Empowerment LMX
1 78.9 O O O
2 4.1 O O M
3 12.2 O M O
4 0.8 M O O
5 4.0 M M O

Note. LMX, leader–member exchange.


102 Applied Missing Data Analysis

individual has scores. Summing across N participants gives the log-­likelihood function
for a sample of incomplete data.

( ) ∑ V2 ln (2π) − 21 ∑ln S
N N N
1
LL μ, S | Y( obs ) =−
=i 1
i

=i
i
1 =i
2
− ∑ ( Y − μ )′ S
1
i i
−1
i ( Yi − μ i ) (3.6)

As mentioned previously, the maximum likelihood estimates no longer have analytic


solutions, so optimizing the function requires an iterative algorithm.
A bivariate analysis with a single incomplete variable provides a closer look at the
log-­likelihood function. To keep the notation simple, I generically refer to the complete
and incomplete variables as X and Y, respectively. The bivariate analysis has two miss-
ing data patterns, each with a different fit function. The log-­likelihood contribution for a
participant with complete data features the full complement of terms from the bivariate
normal distribution.

1 σ σ XY 
2

( 2
2
)
LLi μ, S | Y( obs ) =Y , X =− ln ( 2π ) − ln  X
2  σYX

σY2 
' −1 (3.7)
1   Xi   μ X    σ2X σ XY    Xi   μ X  
−    −         −   
2   Yi   μ Y   σ σ2Y 
 YX   Yi   μ Y  

In contrast, participants with missing Y scores provide no information about μY, σY2,
or σYX. Dropping these elements from the parameter arrays leaves μi = μX and Σi = σX2,
and the data’s support for these remaining parameters derives from a univariate normal
distribution.

1 ( Xi − μ X )
2

( 1
2
) 1
LLi μ, S | Y( obs ) =X =− ln ( 2π ) − ln σ2X −
2 2 σ2X
(3.8)

Equation 3.8 is a concrete example of the integration operation from Equation 3.4, as
marginalizing a bivariate normal distribution over one of its variables yields a univariate
normal log-­likelihood.
Summing across the N observations gives the observed-­data log-­likelihood function
for the sample

nC  σ X σ XY 
2

( )
LL μ, S | Y( obs ) =−nC ln ( 2π ) −
2
ln 
σ

σY2 
 YX
−1
1
nC
  X i   μ X  ′  σ2X σ XY    Xi   μ X  

2 ∑   Y  −  μ   
i =1   i   Y    σ YX

σ2Y    Y  −  μ  
 i   Y 
(3.9)

nM
( X i − μ X )2

nM
2
n
ln ( 2π ) − M ln σ2X −
2
( ) ∑
1
2 σ2X
i =1

where nC and nM are the number of cases with complete data and missing values, respec-
Maximum Likelihood Estimation with Missing Data 103

tively. As always, the log-­likelihood summarizes the data’s evidence for a specific com-
bination of parameter values. The only new wrinkle is that some participants contribute
more information than others. Importantly, the estimator does not discard incomplete
data records, nor does it impute them. The overall goal remains the same, which is to
identify the parameters that maximize fit to the data.

Analysis Example
I use the trivariate analysis model from Equation 3.1 to illustrate maximum likelihood
estimation for a multivariate analysis. Table 3.1 shows the five missing data patterns.
Applying previous ideas, estimation uses all available data, with each missing data pat-
tern contributing different information. For example, nearly 80% of the sample members
have complete data records, and the log-­likelihood contributions for these individuals
reflect fit to a trivariate normal distribution with the full collection of parameters (i.e.,
μi contains all three means and Σi is a 3 × 3 matrix). Three patterns comprise individu-
als who contribute two data points, and their fits are gauged relative to the parameters
of a bivariate normal distribution (i.e., μi contains two of three means, and Σi is a 2 ×
2 matrix containing the two relevant variances and a covariance). The final pattern
comprises participants with a single observation. These log-­likelihood contributions
reflect fit to a univariate normal distribution, where μi and Σi are both scalar values as
in Equation 3.8.
As explained previously, iterative optimizers such as Newton’s algorithm or the
expectation maximization algorithm (discussed later) are necessary for finding the esti-
mates that maximize fit to the observed data. Analysis scripts are available on the com-
panion website, including a custom R program for readers interested in coding Newton’s
algorithm by hand. Table 3.2 gives the maximum likelihood estimates of the means,
standard deviations, variances and covariances, and correlations. The standard devia-
tions and correlations are not estimated parameters but are instead deterministic func-
tions of the variances and covariances (e.g., a correlation is a covariance divided by
square root of the product of two variances), and their delta method standard errors are
similarly functions of the component standard errors (Raykov & Marcoulides, 2004).
Following ideas established in Chapter 2, maximum likelihood estimates of variances
and covariances are negatively biased in small samples, because they do not subtract
the degrees of freedom spent estimating the means; these biases should be trivial, with
a sample size of N = 630, even with substantial missing data. As an aside, researchers
often ask what sample size they should report for a missing data analysis. While no
single N drives the precision of the estimates, the sample size is the number of cases
with at least one observation. For this analysis, that’s all N = 630 employees.

3.3 HOW DO INCOMPLETE DATA RECORDS HELP?

Unlike Bayesian estimation and multiple imputation, maximum likelihood does not
explicitly impute the missing data. Rather, the estimator identifies the optimal parameter
values using whatever data it has at its disposal. While the observed-­data log-­likelihood
104 Applied Missing Data Analysis

TABLE 3.2. Maximum Likelihood Estimates of Descriptive


Statistics and Bivariate Associations
Variables Est. SE z p
Means
Work Satisfaction 3.98 0.05 77.39 < .01
Empowerment 28.59 0.19 149.27 < .01
LMX 9.62 0.12 78.76 < .01

Standard deviations
Work Satisfaction 1.27 0.04 34.66 < .01
Empowerment 4.42 0.14 31.95 < .01
LMX 3.02 0.09 34.89 < .01

Variances and covariances


Work Satisfaction 1.60 0.09 17.33 < .01
Empowerment 19.49 1.22 15.98 < .01
LMX 9.11 0.52 17.45 < .01
Work Satisfaction ↔ Empowerment 1.71 0.26   6.68 < .01
Work Satisfaction ↔ LMX 1.61 0.17   9.46 < .01
Empowerment ↔ LMX 5.65 0.65   8.71 < .01

Correlations
Work Satisfaction ↔ Empowerment 0.31 0.04   7.65 < .01
Work Satisfaction ↔ LMX 0.42 0.03 12.35 < .01
Empowerment ↔ LMX 0.42 0.04 11.21 < .01

Note. LMX, leader–member exchange.

equation clearly shows that some observations contribute more information than others,
it doesn’t convey how the partial data records help achieve a more accurate answer—­
statistical theory and computer simulations like those in Chapter 1 tell us that they
can help, even with very large amounts of missing data. To provide some insight into
how estimation works, I created an artificial data set of employee empowerment and
leader–­member exchange scores with estimates like those in Table 3.2. I deleted 50% of
the leader–­member exchange values to mimic a conditionally MAR process where par-
ticipants with low empowerment are less likely to report their supervisor relationship
quality. This is a scenario in which maximum likelihood is known to produce accurate
estimates. Figure 3.1 shows the scatterplot of the hypothetically data, with gray circles
representing complete cases and black crosshairs denoting partial data records with
missing leader–­member exchange scores. The gray contour rings convey the perspective
of a drone hovering over the peak of the bivariate normal population data, with smaller
contours denoting higher elevation (and vice versa).
Maximum Likelihood Estimation with Missing Data 105

50
40
Empowerment
30
20
10

0 5 10 15 20
Leader-Member Exchange

FIGURE 3.1. Scatterplot of an artificial data set of leader–member exchange and employee
empowerment scores. Fifty percent of the leader–member exchange scores follow an MAR pro-
cess where participants with low empowerment are more likely to have missing values. Gray
circles represent complete cases, and black crosshairs denote partial data records.

To begin, consider what happens if we discard the incomplete data records and
base the analysis on the 50% of participants with complete data. Figure 3.2 shows the
scatterplot after removing the observations with missing data. Importantly, the black
dot at the means is too high along both axes. This bias makes sense considering Figure
3.1, where the MAR process systematically culls data points from the lower tails of the
distributions in the lower-left quadrant of the plot.
Figure 3.3 adds the partial data records with observed empowerment scores along
the vertical axis (horizontal jitter is added to enhance their visibility). The black dot
now represents the maximum likelihood estimates of the means based on all observed
data. The additional employee empowerment scores from the incomplete data records
exert two forces that steer the inaccurate black dot in Figure 3.2 toward the accurate
black dot in Figure 3.3. First, adding empowerment scores to the low end of the dis-
tribution increases the now-complete variable’s variance and decreases its mean. Visu-
ally, the black dot at the center of the complete-case distribution in Figure 3.2 moves
down to the vertical location of the maximum likelihood estimate in Figure 3.3. The
second, less obvious adjustment comes from the normal curve itself. In a normal distri-
bution, the additional scores at the low end of the employee empowerment distribution
106 APPLIED MISSING DATA ANALYSIS

(the crosshair symbols) are only plausible if they are paired with correspondingly low
leader–member exchange scores in the lower-left quadrant of the plot. Although the
matching relationship quality scores are unobserved, the estimator infers their location
and adjusts the parameters to account for the presence of latent or unobserved data in
the lower tail of the distribution. The inaccurate black dot from Figure 3.2, which had
already moved to the correct vertical coordinate, now moves left to the horizontal coor-
dinate of the maximum likelihood estimates (the black dot) in Figure 3.3.
I’ve repeatedly emphasized that maximum likelihood estimation does not impute
missing values, but the illustration suggests that something similar is happening under
the hood. In fact, the normal curve itself functions an imputation machine in the sense
that the estimator can infer the horizontal location of an unseen data point from its
observed vertical location (and vice versa). Thus, while the estimator doesn’t literally
create a filled-in data set, it does use the normal distribution to deduce plausible values
for the missing data. Widaman (2006) describes this process as implicit imputation, and
I sometimes use his phrase to describe maximum likelihood.
50
40
Empowerment
30
20
10

0 5 10 15 20
Leader-Member Exchange

FIGURE 3.2. Complete-case scatterplot after from removing the observations with missing
leader–member exchange scores. The contour rings convey the perspective of a drone hovering
over the bivariate normal population distribution, with smaller contours denoting higher eleva-
tion (and vice versa). The black dot denotes the complete-case means, which are too high along
both axes.
Maximum Likelihood Estimation with Missing Data 107

50
40
Empowerment
30
20
10

0 5 10 15 20
Leader-Member Exchange

FIGURE 3.3. Scatterplot that adds the partial data records with observed empowerment
scores along the vertical axis (horizontal jitter is added to enhance their visibility). The black dot
represents the maximum likelihood estimates of the means.

3.4 STANDARD ERRORS WITH INCOMPLETE DATA

Recall from Chapter 2 that the curvature or steepness of the log-likelihood function
near its peak determines the precision of the estimates. A steep function implies high
precision (and a small standard error), because the data’s support for different candi-
date parameter values decreases rapidly as the parameter moves away from its optimal
value in either direction. Measuring curvature and computing standard errors required
second derivatives of the log-likelihood function. Mathematically, these derivative equa-
tions quantify curvature by measuring the rate at which tangent lines change near the
function’s peak (e.g., lines tangent to a steep function change rapidly near its peak,
whereas lines tangent to a flatter function change very little).
In fact, all the concepts from Chapter 2 work the same with missing data. The
only new detail is that participants with missing data contribute less information to
the derivative expressions. When participants have missing values, their data no lon-
ger contain information about every model parameter. The observed-data log-likelihood
function from Equation 3.5 accommodates this by ignoring parameters that depend on
the missing scores. The incomplete data records similarly contain no information about
108 Applied Missing Data Analysis

the precision of certain estimates, so it makes sense that they wouldn’t contribute to
the corresponding derivative equations (and thus the standard errors). From a practical
perspective, the missing information flattens the log-­likelihood function and increases
standard errors, as you might expect. However, the reduction in power may not be com-
mensurate with the missing data rate (e.g., a 20% missing data rate does not imply a 20%
reduction in precision or power). This section outlines changes to the second derivative
matrix that gives rise to standard errors. The steps for converting this matrix into stan-
dard errors—­multiplying the Hessian by –1 and computing its inverse—­are the same
as in Chapter 2. Although most of the equations are not intuitive, I include them as a
resource for interested readers. Equations aside, the take-home message is straightfor-
ward: When a score is missing, a participant contributes a 0 to any element of the second
derivative matrix that depends on that variable.
A bivariate analysis with a single incomplete variable like that depicted in Figure
3.1 is sufficient for illustrating how the derivative computations change with missing
data. To keep notation simple, I generically refer to the complete and incomplete vari-
ables as Y and X, respectively. Recall from Section 2.12 that second derivatives are stored
in a symmetrical matrix known as the Hessian. The diagonal elements of this matrix
capture the curvature or steepness of the log-­likelihood function near its peak, and
the off-­diagonal elements measure the degree to which changes to one parameter cor-
respond with changes in another. The Hessian is a symmetrical matrix that comprises
three unique blocks.

 ∂ 2 LL ∂ 2 LL 
 
∂μ 2 ∂μ∂S 
HO ( θ ) =  (3.10)
 ∂ 2 LL ∂ 2 LL 

 ∂S∂μ ∂S 2 

The upper-left block is a V × V matrix containing derivatives taken with respect to the
means, the lower-­r ight block is a symmetrical matrix with V × (V + 1) ÷ 2 rows and
columns, one for each unique element of the variance–­covariance matrix, and the lower
diagonal (or upper diagonal) is a matrix of cross-­product derivatives taken first with
respect to a mean and then with respect to a variance or covariance (or vice versa). The
Hessian for this example is a symmetrical matrix with P = 5 rows and columns, one for
each unique element in μ and Σ.
With complete data, the upper-­left block (derivatives taken with respect to the
means) is computed by summing the inverse of the covariance matrix across the N
observations as follows:
−1
N  2
∂ 2 LL
N
σX σ XY 
∂=μ 2
=
− ∑ S −1
=
− ∑


i 1 =i 1  σ YX

σ2Y 
(3.11)

When a participant has missing values, the observed data for that individual no longer
contain information about the precision of certain estimates, so the derivative expres-
Maximum Likelihood Estimation with Missing Data 109

sion replaces the appropriate elements of Σ with zeros. In this example, cases with miss-
ing Y values contribute zeros in place of σYX and σX2, as follows:
−1 −1
∂ 2 LL
nC
 σ2X σ XY  nM
0 0 
∂μ 2
=
− 
 ∑
2 
 −  2 

i 1  0 σY 
∑ (3.12)
=  σYX σY 
i 1=

The same is true for other parts of the Hessian. For the two off-­diagonal blocks, the
incomplete cases contribute information only to the element crossing μY and σY2:
 
nC 0 nM 0 
∂ 2 LL  −1 
∂μ ∂S
=

−1


−  S ⊗ ( Yi − μ ) S  DV −

0
 ∑ 0 
 (3.13)

( Yi − μY ) ( σY2 )
=i 1 =i 1
 −1 

0
 
and their contribution to the lower-­r ight block of the Hessian is similarly restricted to
the second derivative with respect to σY2.

n
∂ 2 LL C
 
=
− ∑ D′V  S −1 ⊗  S −1 ( Yi − μ )( Yi − μ )′ S −1 − .5S −1   DV
∂S 2
i =1   
  (3.14)
0 0
nM 0 

− 0 0 0 
 
( Yi − μY )2 ( σY2 ) ( )
i =1
 −3
2 2


0 0 − .5 σY
 
The derivative expressions for the complete cases are the same as those in Section 2.12,
and the text adjacent to Equation 2.62 describes the meaning of the previous expres-
sion’s components. Although the equations are complicated, they offer a clear take-
home message: Participants contribute information only about parameters for which
they have data. The expressions also highlight that inserting zeros makes the resulting
sums smaller than they would have been with complete data. As a result, multiplying
the Hessian by –1 and computing its inverse (the matrix analog of a reciprocal) inflates
elements of the parameter covariance matrix, the diagonal of which contains squared
standard errors.
For completeness, the rest of this section provides second derivative equations for
multivariate analyses with more than two variables (e.g., the analysis model in Equa-
tion 3.1). Readers who aren’t interested in these expressions can skip to the next sec-
tion without losing important information. To generalize the previous notation beyond
the bivariate example, we need to introduce a Vi × V matrix τi for each participant that
starts as an identity matrix (a matrix with ones on the diagonal and zeros elsewhere)
and removes the rows corresponding to the missing variables. Incorporating this new
matrix into the complete-­data derivative equations from Section 2.12 gives the expres-
sions that follow (Savalei, 2010; Savalei & Bentler, 2009).
110 Applied Missing Data Analysis

N
∂ 2 LL
∂μ 2

= − τ′i S i−1τ i
i =1
(3.15)

{ }
N
∂ 2 LL  
∂S 2

− D′V  τ′i S i−1τ i ⊗ τ′i S i−1 ( Yi − μ i )( Yi − μ i )′ S i−1τ i − .5τ′i S i−1τ i  DV
=
i =1  
2 N
∂ LL
= −  τ′i S i−1τ i ⊗ ( Yi − μ i )′ S i−1τ i  DV

∂μ∂S i =1
 

Substituting the maximum likelihood estimates into the derivative expressions, multi-
plying HO(θ̂) by –1, then taking its inverse gives the variance–­covariance matrix of the
estimates, the diagonal of which contains squared standard errors.
The parameter arrays in Equation 3.15 align with the observed-­data log-­likelihood
expression from Equation 3.5, where μi and Σi contain the subset of parameters in μ
and Σ that corresponds to the observed variables in Yi. Pre- and postmultiplying these
arrays by τi fills in the missing elements of those matrices with zeros, just like in Equa-
tions 3.12 through 3.14. To verify, reconsider the previous bivariate example, where the
complete cases have τi = I2 and Σi = Σ and the incomplete cases have τi = (0 1)and Σi =
σY2. Substituting these quantities into the top formula of Equation 3.15 gives the same
result as Equation 3.12.
N n n
∂ 2 LL C M

∂ μ 2
=
− τ ′ S
i i
=i 1 =i 1 =i 1

−1
τ i =
− τ ′ S
i i∑−1
τ i − ∑
τ′i S i−1τ i

−1
 1 0 ′  σ X σ XY   1 0  M
2 nC n

( )
−1
=
− 

=i 1 =
0 1
 
  σYX σY  

2  0 1 ∑−
 i1
( 0 1)' σ2Y ∑ (0 1) (3.16)

−1 −1
 σ2X σ XY  nC
nM
0 0 
− 
σ 2 
 − ∑  2 

i 1  0 σY 

 YX σY 
=i 1 =

Observed versus Expected Information


The previous expressions produce standard errors based on the observed information.
This name derives from the fact that the derivative expressions include deviation scores
based on the observed data values. In contrast, expected information involves a com-
putational shortcut that replaces the deviation scores in those expressions with their
expectations or long-run averages. For example, this operation replaces each (Yi – μi)
term in Equation 3.15 with a vector of zeros, which are the expected averages of the
deviation scores. Substituting expectations eliminates dependence on the raw data and
simplifies the off-­diagonal and lower-­right blocks of the Hessian as follows (Savalei,
2010; Yuan & Bentler, 2000):

∑ ( })
N
∂ 2 LL
∂S 2
=

1
2
{
D′V τ′i S i−1τ i ⊗ τ′i S i−1τ i DV (3.17)
i =1
∂ 2 LL
=0
∂μ∂S
Maximum Likelihood Estimation with Missing Data 111

The derivatives with respect to the means are unchanged, because the precision of these
parameters depends only on Σ.
With complete data, standard errors based on the observed and expected informa-
tion are often indistinguishable. With missing data, however, Kenward and ­Molenberghs
(1998) showed that standard errors based on the expected information require an unsys-
tematic MCAR mechanism, whereas standard errors based on expected information are
valid with conditionally MAR processes (technically, a missing always at random pro-
cess). The differential treatment of the deviation scores in Equation 3.15 is the issue, as
assigning zeros to the observed deviation scores works fine if missing values are equally
distributed above and below the estimated means, as would be the case with purely
random missingness. However, this substitution is not optimal for a conditionally MAR
process that culls values from one tail of the distribution, leaving more observations
above the estimated mean than below it (or vice versa). Such is the case in Figure 3.3,
where most of the observed leader–­member exchange scores (i.e., the gray circles) are
above the maximum likelihood mean estimate (i.e., to the right of the black dot). Simu-
lation studies show that standard errors based on observed information are preferable
in this case, as the expected information tends to attenuate standard errors and inflate
Type I error rates (Kenward & Molenberghs, 1998; Savalei, 2010).
To illustrate the difference between observed and expected information, I applied
both approaches to the artificial data from Figure 3.1. Table 3.3 shows the variance–­
covariance matrices of the estimates, the diagonals of which contain squared standard
errors. Notice that substituting expectations fills the off-­diagonal blocks with zeros,
whereas using the observed data produces some nonzero values. Other elements also
differ, as do the standard errors of parameters impacted by missing data. For example,
the standard error of μ̂X (e.g., the leader–­member exchange average) based on observed
information is SE = 0.066 = 0.27, whereas using the expected information shrinks this

TABLE 3.3. Variance–Covariance Matrix of Estimates


Computed Using Observed and Expected Information
Parameter μX μY σX2 σXY σY2
Observed information
μX 0.066
μY 0.023 0.065
σX2 –0.094 0 1.208
σXY –0.132 0 0.937 1.480
σY2 0 0 0.340 0.950 2.654

Expected information
μX 0.050
μY 0.023 0.065
σX2 0 0 1.065
σXY 0 0 0.737 1.200
σY2 0 0 0.340 0.950 2.654
112 Applied Missing Data Analysis

value to SE = 0.050 = 0.22. The differences in the table agree with published simulation
studies showing that expected information often attenuates standard errors and inflates
Type I error rates when scores are conditionally MAR (Kenward & ­Molenberghs, 1998;
Savalei, 2010).

3.5 THE EXPECTATION MAXIMIZATION ALGORITHM

Even tidy estimation problems like those in Chapter 2 no longer have analytic solutions
with missing data, making iterative optimization algorithms a necessity. Newton’s algo-
rithm works with derivatives of the observed-­data log-­likelihood from Equation 3.5, but
the procedure is otherwise the same as that in Section 2.9. The EM algorithm (Dempster
et al., 1977; Rubin, 1991) takes the very different tack of filling in rather than removing
the missing parts of the complete-­data log-­likelihood from Equation 3.3. Conceptually,
EM is a tool for solving the chicken or the egg dilemma in which knowing the missing
values would lead to solutions for the estimates and having the estimates would provide
the necessary information for predicting the missing values. The algorithm leverages
this interdependence by “imputing” the missing data given the current parameter values
and then updating the parameters given the filled-­in data. The idea is that each succes-
sive iteration gives better predictions about the missing values, which in turn improve
the estimates, which in turn sharpen the missing values, and so on.
I use the word “imputing” in air quotes, because EM doesn’t literally fill in the data.
Rather, the algorithm uses the parameter estimates to predict the missing parts of the
complete-­data log-­likelihood function. In most cases, these imputed terms are functions
of the missing values rather than the missing values themselves (e.g., expected values).
This is an important practical point, because published research articles often describe
EM as an imputation method. Software packages that use EM-­generated parameter esti-
mates to implement flawed regression imputation schemes no doubt contribute to this
confusion (e.g., the Missing Values Analysis module in SPSS; von Hippel, 2004).
EM’s origins trace to the early 70s (Baum, Petrie, Soules, & Weiss, 1970; Beale &
Little, 1975; Orchard & Woodbury, 1972), and Dempster et al. (1977) formalized the
algorithm and gave it a name. EM has since evolved into a very general optimization
tool and has enjoyed widespread use with latent variable models that treat unmeasured
latent scores as missing data. Such applications include factor analysis, structural equa-
tion models, multilevel models, finite mixture models, and item response theory models,
to name a few (Bock & Aitkin, 1981; Cai, 2008; Jamshidian & Bentler, 1999; Liang &
Bentler, 2004; McLachlan & Krishnan, 2007; Muthén & Shedden, 1999; Raudenbush
& Bryk, 2002). It is important to highlight that EM’s two-step logic—­impute the miss-
ing data, then update the parameters—­appears throughout the book, as Markov chain
Monte Carlo (MCMC) algorithms for Bayesian estimation and multiple imputation apply
the same recipe. These procedures estimate parameter values and missing data by draw-
ing the unknown quantities from a probability distribution (i.e., they use computer simu-
lation to generate artificial values), but they are essentially EM algorithms with random
noise added to account for missing data uncertainty. Rubin (1991) provides an excellent
tutorial on the EM algorithm that describes some of these linkages and extensions.
Maximum Likelihood Estimation with Missing Data 113

A bivariate analysis with a single incomplete variable (e.g., Figure 3.1) provides
a closer look at the EM algorithm. To keep notation simple, I generically refer to the
incomplete and complete variables as X and Y, respectively. The EM algorithm works
with the hypothetical complete-­data log-­likelihood that would have resulted had there
been no missing values (see Equation 3.3). We know from Section 2.12 that the maxi-
mum likelihood estimates for this scenario have the following solutions:
N N
1 1
N
=μˆ X
X=
i μY
ˆ
=i 1 =i 1
N
Yi ∑ ∑ (3.18)

 N  
2  N  
2
1 1
N N
1 1
σˆ 2X
=
N 
=i 1 =
2

Xi −  Xi  =
N  ∑
 
i 1   =
 σY
ˆ 2
N 
Yi −  Yi  
2
N  i 1   ∑ ∑
  i 1= 
1  N
1
N N 
= σˆ XY
N=
 X i Yi −
i 1 N ∑ X i Yi 
=i 1 =i 1 
 ∑ ∑
Missing values introduce holes in the sums and sums of cross-­products terms that
define the sufficient statistics for computing μ and Σ. I previously characterized EM as
a tool for solving the chicken or the egg dilemma: Knowing the missing values on the
right side of the equation would lead to solutions for the estimates and having the esti-
mates on the left side of the equation would provide the information necessary to predict
the missing values. The algorithm tackles the dilemma by iterating between two steps:
the expectation step (E-step) addresses the missing values, and the maximization step
(M-step) updates the estimates.
The E-step treats the observed data and current parameter values at iteration t as
known constants and imputes the missing parts of the sums and sums of cross-­products
terms with expectations or averages. The bivariate example requires “imputations” for
the missing X, X2, and XY values. With multivariate normal data, a linear regression
model generates predictions for the missing terms.

(
E X | Y , μ ( ) , S ( ) = γ (0 ) + γ1( ) Yi
t t
) t t
(3.19)

( ) ( γ ( ) + γ ( )Y ) + σ ( )
2
E X 2 | Y, μ( ) , S( ) =
t t t t 2t
0 1 i X|Y

E ( XY | Y , μ ( ) , S ( ) )= Y × E ( X | Y )= Y ( γ ( ) + γ ( ) Y )
t t t t
i i 0 1 i

As you can see, the expectation of X is a predicted value from the regression, and expec-
tation of X2 is a squared predicted score plus a residual variance term that captures the
expected spread of the missing data. Finally, the regression parameters are straightfor-
ward functions of the estimates in μ(t) and Σ(t).

γ1( ) =σ(XY) / σY( )


t t 2t
(3.20)
(t ) (t ) (t ) (t )
γ 0 = μ X − γ1 μ Y
σ X(|Y) = σ X( ) − γ1 ( ) σY( )
2t 2t 2t 2t
114 Applied Missing Data Analysis

The M-step identifies updated parameter values for the next iteration by substitut-
ing the observed data and the expectations into the complete-­data formulae in Equation
3.18. The updated estimates for the bivariate example are as follows:

1 C 
n nM
( )
μ=
X
t +1

N  i 1 =i 1 ∑
 X i + E ( X i | Yi ) 

(3.21)
= 
n  
2
1 C
nM n nM

∑( )
1 C 2
σ X(=
2 t +1)
N  ∑ ∑ ∑
X i + E X i | Yi −  X i + E ( X i | Yi )  
2
N  i 1 =i 1  
=i 1 =i 1
 =  

1 C  nC 
n nM N nM
1
= ( t +1)
σ XY
N
 ∑
 i 1 =i 1 ∑
X i Yi + Yi E ( X i | Yi ) −
N ∑ ∑
 ∑
Yi  X i + E ( X i | Yi )  

=  =i 1 =  i 1 =i 1 
These equations clarify that EM is not filling in the missing values themselves, but
rather the missing parts of the sums and sums of cross-­products terms needed to com-
pute the complete-­data log-­likelihood function.
The updated estimates at the M-step carry forward to the next E-step, where the
algorithm uses them to generate new and improved estimates of the missing data, after
which it again updates the parameters. This two-step sequence repeats until the esti-
mates from consecutive M-steps no longer differ. Dempster et al. (1977) and others
(Little & Rubin, 2020; Schafer, 1997) show that the updated estimates from the M-step
always improve on those from the previous iteration in the sense that they increase the
observed-­data log-­likelihood. As such, EM is achieving the same goal as Newton’s algo-
rithm, albeit with a different approach that simplifies the iterative computations.

Analysis Example
To illustrate iterative optimization with missing data, I used the EM algorithm to estimate
the mean vector and covariance matrix of the artificial data from Figure 3.1. A custom R
program is available on the companion website for readers interested in coding EM by
hand, as is a program that implements Newton’s algorithm with missing data. To highlight
the resilience of the algorithm to (really) poor starting values, I fixed the initial means to
0 and set the variance–­covariance matrix to an identity matrix (a matrix with ones on
the diagonal and zeros elsewhere). Finally, I terminated the algorithm when the estimates
from consecutive iterations differed by less than .000001, as changes of this magnitude
effectively signal that the algorithm has identified the maximum likelihood estimates.
Table 3.4 shows the iterative updates to the estimates and the observed-­data log-­
likelihood (the complete-­data parameters converge immediately, so I omit these values
from the table). Consistent with Newton’s algorithm from Chapter 2, EM makes large
adjustments to the parameters at first and tiny alterations as it approaches the maximum
likelihood estimates. By the final cycle, the estimates are changing only in the fifth
decimal, so there is no reason to continue iterating. An attractive feature of EM is that
it doesn’t directly manipulate the observed-­data log-­likelihood function, the composi-
tion of which changes across missing data patterns. As such, the values in the rightmost
Maximum Likelihood Estimation with Missing Data 115

TABLE 3.4. Iterative Updates from the EM Algorithm


Iteration μX σX2 σYX Log-likelihood
0 0 0 1 –88,619.29320174140
1 5.442322 15.718141 32.372663 –1,399.53083691875
2 7.106733 16.343679 30.271614 –1,363.67778317532
3 7.867007 14.589867 24.200638 –1,345.63020984186
4 8.336837 12.840014 19.337030 –1,332.93459374866
5 8.667199 11.448826 16.000193 –1,324.59906381095
6 8.909155 10.395820 13.803467 –1,319.57991726612
7 9.088411 9.608737 12.373312 –1,316.74317423439
8 9.221634 9.022384 11.441578 –1,315.19543358861
9 9.320728 8.585959 10.830359 –1,314.36178649766
10 9.394453 8.261205 10.425111 –1,313.91266145061
... ... ... ... ...
37 9.608704 7.317397 9.520094 –1,313.37471009347
38 9.608723 7.317315 9.520035 –1,313.37471009347

column of the table are not a natural by-­product of EM, but I include them to highlight
the important conclusion that each iteration is guaranteed to improve on the previous
one (Dempster et al., 1977). Conceptually, these values demonstrate that the algorithm
is “hiking” to a higher elevation on the log-­likelihood surface every time it completes an
M-step, just like Newton’s algorithm (albeit with simpler math). Although EM doesn’t
naturally produce standard errors, methodologists have developed procedures for com-
puting these quantities (Cai, 2008; Little & Rubin, 2020; Louis, 1982; McLachlan &
Krishnan, 2007; Meng & Rubin, 1991), and marrying EM and the bootstrap is also an
option.

3.6 LINEAR REGRESSION

Having established its core ideas, we can extend maximum likelihood missing data
handling to regression models with directed pathways. Estimation is simple if missing
values are relegated to the outcome variable, in which case deleting incomplete data
records gives the optimal maximum likelihood estimates (Little, 1992; von ­Hippel,
2007). The situation is more complex and nuanced when explanatory variables are
incomplete, especially when one or more of the predictors are categorical. Rewinding
back to Section 2.10, explanatory variables functioned as known constants in the log-­
likelihood expression, and the normal curve assumption applied only to the outcome
(or its residuals, more precisely). The situation changes when predictors are incomplete,
because the covariates require their own probability distribution. This section describes
structural equation (Arbuckle, 1996; Muthén et al., 1987; Wothke, 2000) and factored
regression modeling (Ibrahim et al., 2002; Lipsitz & Ibrahim, 1996; Lüdtke et al., 2020a)
approaches to this problem. As you will see, the former generally foists a normal dis-
116 Applied Missing Data Analysis

tribution on the predictors, whereas the latter offers a more flexible specification that
accommodates mixed response types.
Switching gears to a different substantive context, I use the smoking data from the
companion website to illustrate a multiple regression analysis. The data set includes
several sociodemographic correlates of smoking intensity from a survey of N = 2,000
young adults (e.g., age, whether a parent smoked, gender, income). Piggybacking on
the data analysis example from Section 2.10, the model uses a parental smoking indica-
tor (0 = parents did not smoke, 1 = parent smoked), age, and income to predict smoking
intensity, defined as the number of cigarettes smoked per day. The model and its generic
counterpart are as follows:

INTENSITYi = β0 + β1 ( PARSMOKEi ) + β2 ( INCOMEi ) + β3 ( AGEi ) + ε i (3.22)

Yi = β0 + β1 X1i + β2 X 2i + β3 X 3i + ε i
(
ε i ~ N1 0, σ2ε )
The smoking intensity variable has 21.2% missing data, 3.6% of the parental smoking
indicator scores are missing, and 11.4% of the income values are unknown. Note that I
list the incomplete regressors first, because this will facilitate the factorization process
described below.

Joint Model versus Factored Regression Specification


The structural equation modeling and factored regression frameworks both introduce
probability distributions for the incomplete predictors, but they do so in different ways.
The “classic” structural equation modeling estimator specifies a multivariate normal
distribution for the analysis variables (Arbuckle, 1996). Using generic notation, this
joint distribution is f(Y, X1, X2, X3). In contrast, the factored regression approach uses
the probability chain rule to factorize the multivariate distribution into a product of
univariate distributions, each of which corresponds to a regression model.

f ( Y , X1 , X 2 ,=
X 3 ) f ( Y | X1 , X 2 , X 3 ) × f ( X1 | X 2 , X 3 ) × f ( X 2 | X 3 ) × f ( X 3 ) (3.23)

The first term to the right of the equals sign is the distribution induced by the analysis
model (e.g., Equation 3.22), and the remaining terms are regressions that define predic-
tor distributions. This idea resurfaces later in the context of Bayesian estimation and
multiple imputation, where the multivariate distribution on the left is consistent with
the joint model imputation framework (Asparouhov & Muthén, 2010c; Carpenter &
Kenward, 2013; Schafer, 1997), and the expression on the right is called a sequential
specification (Erler, Rizopoulos, Jaddoe, Franco, & Lesaffre, 2019; Erler et al., 2016;
Ibrahim et al., 2002; Ibrahim et al., 2005; Lipsitz & Ibrahim, 1996; Lüdtke, Robitzsch,
& West, 2020b).
Structural equation and factored regression models will give the same answer
when the data are normal, because the multivariate normal distribution always spawns
an equivalent set of linear regression models (Arnold, Castillo, & Sarabia, 2001; Liu,
Maximum Likelihood Estimation with Missing Data 117

Gelman, Hill, Su, & Kropko, 2014). The methods diverge with categorical predictors,
because no common joint distribution describes the co-­occurrence of discrete and con-
tinuous scores. Therefore, the classic structural equation model effectively foists a nor-
mal distribution on the predictors regardless of their metrics. Limited computer simu-
lations suggest this misspecification might not be detrimental with binary covariates
(Muthén et al., 2016), but it is nonsensical for multicategorical nominal predictors. In
contrast, the factorization on the right side of Equation 3.23 makes no assumption about
the multivariate distribution on the left; f(X1|X2, X3) could be a logistic or probit regres-
sion, f(X2|X3) could be a linear regression, and so on. In fact, the distinction between the
two modeling frameworks is not as clean as my description suggests, as some structural
equation modeling programs accommodate mixtures of categorical and continuous out-
comes in a way that is effectively equivalent to a factored regression model (Muthén et
al., 2016; Pritikin et al., 2018). With this advance organizer in mind, let’s dig into the two
modeling frameworks, paying particular attention to the parental smoking indicator.

Structural Equation Modeling Framework


A simple regression is sufficient for describing the main features of the structural equa-
tion model specification. As mentioned previously, the multivariate normal distribution
is virtually standard in this framework. From a practical perspective, this means that
each variable in the analysis functions as the outcome variable in a regression. The two
regression equations for a simple regression analysis are

X i =γ 0 + ri (3.24)

Yi = β0 + β1 X i + ε i

(
ri ~ N1 0, σ2r ) (
ε i ~ N1 0, σ2ε )
where γ0 and β0 denote the grand mean and regression intercept, respectively, β1 is the
slope coefficient, and r i and εi are normally distributed residuals with variances σr2 and
σε2. I generally use γ’s to denote nuisance parameters that we wouldn’t have estimated
had the data been complete, and here these coefficients represent features of the predic-
tor distribution.
An important feature of a structural equation model is that its parameters combine
to produce predictions about the population means and variance–­covariance matrix. I
refer to these model-­predicted or model-­implied moments as μ(θ) and Σ(θ) to differenti-
ate them from the μ and Σ arrays in previous sections. The regression model parameters
from Equation 3.24 make the following predictions about the population mean vector
and covariance matrix:
 μX (θ)   γ0 
μ ( θ ) =
=    (3.25)
 Y ( )  0
μ θ β + β γ
1 0

 σ2X ( θ ) σ XY ( θ )   σ2r β1σ2r 


=S ( θ ) =   
 σ ( θ ) σ2 ( θ )   β σ2 β12 σr2 + σ2ε 
 YX Y   1 r
118 Applied Missing Data Analysis

These expressions have intuitive meaning. For example, the mean of Y is the value
that results from substituting the mean of X into the regression equation, and Y’s vari-
ance has an explained component due to the predictor and leftover residual part. Linear
regression models like this one will always perfectly predict the sample moments (i.e.,
μ̂(θ) = μ̂ and Ŝ(θ) = Ŝ), but this won’t generally be true for more complex models (e.g., a
confirmatory factor analysis model, a path model with omitted arrows).
Because it assumes multivariate normality, maximum likelihood estimation for
structural equation models borrows heavily from concepts we’ve already covered. For
example, the observed-­data log-­likelihood replaces the population mean vector and cova-
riance matrix in Equation 3.6 with their model-­implied counterparts (Arbuckle, 1996).

( )
N N
1 1
2
=i 1 =i

ln S i ( θ ) −
LL μ, S | Y( obs ) = constant −
2 ∑ ( Y − μ ( θ ))′ S
1
i i
−1
i ( θ ) ( Yi − μ i ( θ ) ) (3.26)

Returning to the bivariate log-­likelihood expressions in Equations 3.7 and 3.8, the com-
posite or model-­implied parameters on the right side of Equation 3.25 replace their
normal distribution counterparts (e.g., individual deviation scores reflect distances
between X and γ0 and Y and β0 + β1γ0), but the expressions are otherwise the same.
Importantly, every variable in the analysis appears in the Y vector regardless of its role
in the model. This feature is vital, because missing data handling requires a distribution
for the outcome and the predictors. As before, the equation says that the data’s evidence
about the model parameters is restricted to those parameters for which an individual
has scores. Some data records provide more information than others, but the overall
goal remains the same—­identify the regression model parameters that maximize fit (or
minimize differences) between the observed data and model-­implied mean vector and
covariance matrix.
The standard errors for the structural equation model parameters also borrow heav-
ily from previous ideas. As you know, the matrix of second derivatives (the Hessian)
provides the building blocks for standard errors. Replacing μi and Σi in Equation 3.15
with μi(θ) and Σi(θ) gives the second derivative matrix for the model-­implied mean vec-
tor and covariance matrix, and multiplying the Hessian by –1 gives the observed infor-
mation matrix.
 ∂ 2 LL ∂ 2 LL 
 
 ∂μ ( θ )
2
∂μ ( θ ) ∂S ( θ ) 
IO ( μ ( θ ) , S ( θ ) ) =
−H O ( μ ( θ ) , S ( θ ) ) =
−  (3.27)
 ∂ 2 LL ∂ 2 LL 
 ∂S ( θ ) ∂μ ( θ ) ∂S 2 ( θ ) 

This matrix isn’t exactly what we need, however, because it reflects the data’s informa-
tion about the composite parameters in μ(θ) and Σ(θ). Introducing an additional matrix
that distributes the model-­implied information to the appropriate structural model
parameters provides the standard errors.
Returning to Equation 3.26, each mean, variance, and covariance in μ(θ) and Σ(θ) is
a weighted combination of the regression model parameters. The new array in question
summarizes these linkages in a matrix containing weights or coefficients that capture
the amount by which the model-­implied moments in the rows change as a function of
Maximum Likelihood Estimation with Missing Data 119

the regression parameters in the columns. Table 3.5 shows the coefficient matrix for
the simple regression model. The lone 1 in the first row indicates that γ0 is the sole
determinant of μX(θ) and no other structural model parameters contribute to this mean.
Similarly, the three nonzero coefficients in the second row reflect the amount by which
μY(θ) changes as a function of γ0, β0, and β1 (e.g., the β1 in the first column weights γ0’s
contribution to the mean, the γ0 in the third column reflects β1’s influence). To keep
notation simple, I denote the coefficient matrix in the table as Δ (technically, the weights
are derivatives of the model-­implied moments with respect to the regression model
parameters). Finally, substituting the maximum likelihood estimates, pre- and post-­
multiplying the information matrix by Δ, then taking the inverse (the matrix analogue
of a reciprocal) gives the variance–­covariance matrix of the regression model parameters.

( ( ( ) ( )) )
−1
Δˆ ′I O μ θˆ , S θˆ Δˆ
Sˆ θˆ = (3.28)

As always, the diagonal elements of the parameter covariance matrix contain squared
standard errors. The expression is somewhat complicated, but all it’s doing is reappor-
tioning the data’s information about the model-­implied mean vector and covariance
matrix to the appropriate structural model parameters. Interested readers can consult
Savalei and Rosseel (2021) for a more detailed account of estimation for structural equa-
tion models.
Returning to the multiple regression from Equation 3.22, the structural model
represents the analysis as four linear regression equations with normally distributed
residuals.

INTENSITYi = β0 + β1 ( PARSMOKEi ) + β2 ( INCOMEi ) + β3 ( AGEi ) + ε i (3.29)

PARSMOKEi =γ 01 + r1i
INCOMEi =γ 02 + r2i
AGEi =γ 03 + r3i

Figure 3.4 depicts the regressions as a path diagram, with rectangles denoting manifest
(measured) variables, circles representing latent variables or residuals, straight arrows

TABLE 3.5. Matrix of Derivatives Linking the Model‑Implied Moments


to the Focal Model Parameters
Structural model parameters
Moments γ0 β0 β1 σr2 σε2
μX(θ) 1 0 0 0 0
μY(θ) β1 1 γ0 0 0
σX2(θ) 0 0 0 1 0
σXY(θ) 0 0 σr2 β1 0
σY2(θ) 0 0 2σr2β2 β12 1
120 Applied Missing Data Analysis

PARSMOKE

INCOME INTENSITY

AGE

FIGURE 3.4. Path diagram of a three-­predictor regression model treating predictors as ran-
dom variables. Incomplete predictors are linked via correlations (curved arrows).

depicting regression coefficients, and double-­headed curved arrows symbolizing vari-


ances and covariances. The residuals attached to the predictors signify that the vari-
ables have a distribution, and the doubled-­headed curved arrows their correlations. As
mentioned previously, the model incorrectly assumes that parental smoking is normally
distributed. Limited computer simulation evidence suggests that treating binary predic-
tors as normal may be fine in some situations (Muthén et al., 2016), but the specification
is nevertheless awkward, because the estimator implicitly “imputes” continuous scores
as it iterates to a solution.
While the classic structural equation modeling estimator (Arbuckle, 1996) doesn’t
accommodate categorical predictors, some structural equation modeling frameworks do
accommodate mixtures of categorical and continuous outcomes (Muthén et al., 2016;
Pritikin et al., 2018). This raises the possibility of treating a covariate as discrete by
regressing it on other predictors; that is, instead of treating respondent age and the
parental smoking indicator as exogenous variables that link to one another via a cor-
related residual, you specify a model where age predicts the parental smoking indicator
and parental smoking in turn predicts smoking intensity. The regression of parental
smoking on age could be a logistic or probit model (Agresti, 2012; Johnson & Albert,
1999). This specification is not proposing a presumed causal or theoretical ordering for
the predictor variables. Rather, replacing a curved arrow with a straight arrow is simply
a mathematical device for linking the predictors in a way that honors the binary vari-
able’s metric. In fact, this specification is effectively equivalent to the factored regression
model that I describe next.

Factored Regression Specification


Factored regression models use the probability chain rule to convert the multivariate
distribution of the variables into the product of two or more univariate conditional dis-
tributions (Ibrahim et al., 2005; Lipsitz & Ibrahim, 1996; Lüdtke et al., 2020a). For a
regression model with K predictors, this factorization is

f ( Y , X1 ,…
   =, X K ) f ( Y | X1 ,…, X K ) × f ( X K | X1 ,…, X K −1 ) ×…× f ( X 2 | X1 ) × f ( X1 ) (3.30)
Maximum Likelihood Estimation with Missing Data 121

where each “f of something” represents a univariate probability distribution. The equa-


tion decomposes the multivariate distribution on the left side of the expression into the
product univariate distributions, each of which can be viewed as a regression model; the
f(Y|X1, . . ., XK) term corresponds to the focal analysis model, and the remaining regres-
sions link predictors to each other.
An important feature of the factored regression approach is that it makes no
assumption about the multivariate distribution on the left side of the expression; the
joint distribution could be a normal curve or it could be something very complex that
doesn’t have an established form. Breaking the distribution into a sequence of simpler
univariate models provides two important advantages over the classic structural equa-
tion modeling framework. First, the model naturally accommodates mixtures of cat-
egorical variables. For example, f(XK|X1, . . ., XK–1) could be a linear regression, f(X2|X1)
could be a logistic or probit regression, and so on. Second, this specification allows for
interactive or nonlinear effects, both in the focal model and in the covariate models.
This feature offers a major advantage over a joint modeling framework, which is prone
to bias when one of the interacting variables is incomplete (Enders et al., 2014; Lüdtke
et al., 2020a; Zhang & Wang, 2017).
When specifying a factored regression model, listing the incomplete predictors
prior to the complete regressors facilitates estimation, because the latter terms can be
ignored (i.e., treated as known constants). Applying this strategy to the smoking inten-
sity analysis gives the following factorization:

f ( INTENSITY | PARSMOKE, INCOME, AGE ) ×


(3.31)
f ( PARSMOKE | INCOME, AGE ) × f ( INCOME | AGE ) × f ( AGE )

I ultimately drop the rightmost term, because age is complete and does not require a dis-
tribution. The parental smoking distribution, f(PARSMOKE|INCOME, AGE), could be a
logistic or probit regression. I use logistic regression, but the choice is arbitrary, because
this model is not the substantive focus. The generic functions above translate into the
following regression models:

INTENSITYi = β0 + β1 ( PARSMOKEi ) + β2 ( INCOMEi ) + β3 ( AGEi ) + ε i (3.32)

 Pr ( PARSMOKEi = 1) 
ln 
 1 − Pr ( PARSMOKE =  = γ 01 + γ11 ( INCOMEi ) + γ 21 ( AGEi )
 i 1) 

INCOMEi = γ 02 + γ12 ( AGEi ) + r2

As a reminder, I use γ’s throughout the book to differentiate supporting model param-
eters from the focal model’s coefficients.
Figure 3.5 shows a path diagram of the models in Equation 3.32. As explained in
Section 2.13, categorical variable models envision binary responses originating from
an underlying latent response variable that represents one’s underlying proclivity or
propensity to endorse the highest category (Agresti, 2012; Johnson & Albert, 1999). Fol-
lowing diagramming conventions from Edwards, Wirth, Houts, and Xi (2012), I use an
122 Applied Missing Data Analysis

PARSMOKE* PARSMOKE

INCOME INTENSITY

AGE

FIGURE 3.5. Path diagram of a three-­predictor regression model that links the predictors
via regressions rather than correlations. The broken arrow denotes a link function that maps
the underlying latent variable to the discrete scores. Parental smoking is a latent variable in its
regression on age but a binary predictor of smoking intensity.

oval and rectangle to differentiate the latent response variable and its binary indicator,
respectively, and the broken arrow connecting the two is the link function that maps
the unobserved continuum to the discrete responses (e.g., the broken arrow reflects the
idea that latent scores above and below the threshold convert to 1’s and 0’s, respectively).
The latent variable has a residual term, but its variance is a fixed constant (π2 ÷ 3, the
variance of the standard logistic distribution). The figure highlights that the parental
smoking indicator simultaneously exists in two forms: The latent response variable (the
logit) appears as the outcome in the categorical regression, and the binary indicator is
a predictor in the focal model. The path diagram is inconsistent with the classic struc-
tural equation model formulation that assigns a multivariate normal distribution to the
variables, but newer modeling frameworks accommodate mixtures of categorical and
continuous outcomes in a manner equivalent to factored regressions (Muthén et al.,
2016; Pritikin et al., 2018).
The factored regression model is conceptually straightforward, as it simply casts a
set of multivariate associations as a sequence of univariate regression models. However,
estimating the models is not so straightforward, because incomplete variables can appear
in multiple equations, as they do here (e.g., the parental smoking indicator is the depen-
dent variable in one equation and an explanatory variable in another). This means that
maximum likelihood’s “implicit imputation” procedure must simultaneously account
for a variable’s role in two or more distributions. Ibrahim (1990) and Lipsitz and Ibrahim
(1996) use a variant of the EM algorithm called “EM by the method of weights” to obtain
the maximum likelihood estimates. The algorithm uses a procedure known as numeri-
cal integration (Rabe-­Hesketh, Skrondal, & Pickles, 2004; Wirth & Edwards, 2007) to
replace each missing value with a weighted set of pseudo-­imputations. I sketch the main
ideas behind the procedure and point interested readers to Lüdtke et al. (2020a) for a
worked example.
Maximum Likelihood Estimation with Missing Data 123

As described earlier, EM’s E-step treats the observed data and parameter values at
iteration t as known constants and fills in the missing parts of the data with expectations
or averages. Ibrahim’s (1990) method of weights achieves this in an imputation-­esque
fashion by filling in each individual’s missing values with more than one replacement
score. However, unlike multiple imputation, all participants share a common, fixed grid
of replacement values or “nodes” that span the incomplete variable’s entire range. For
example, the smoking intensity scores range from 2 to 29, so the procedure could use a
fixed grid of integer pseudo-­imputations ranging from 0 to 30. Similarly, missing paren-
tal smoking indicator scores are imputed twice with support nodes of 0 and 1. The data
are stacked, such that participants with missing values have multiple rows, one per each
combination of pseudo-­imputations. The primary goal of the E-step is to weight each
row according to the likelihood of its data given the current parameter values. These
weights are derived by substituting observed scores and pseudo-­imputations into the
distribution functions depicted in Equation 3.31 and performing the multiplication pre-
scribed by the factorization (e.g., the weights for this example involve the product of a
Bernoulli likelihood for the binary variable and two normal likelihoods).
The M-step updates the parameters for the next iteration by finding the estimates
that maximize a weighted complete-­data log-­likelihood function that accounts for the
fact that some participants have multiple data records with different pseudo-­imputations
(e.g., using Newton’s algorithm; Ibrahim, 1990; Lipsitz & Ibrahim, 1996). For this exam-
ple, the M-step estimates two linear regressions (the focal model and the income model)
and a logistic regression (the parental smoking model) from a stacked data set contain-
ing the newly updated grid of weighted replacement scores. As described earlier, each
successive iteration of the EM algorithm gives better predictions about the missing val-
ues (which in this case are encoded as the weighted pseudo-­imputations), which in turn
improve the estimates, which in turn sharpen the missing values, and so on.

Analysis Example
To illustrate maximum likelihood missing data handling for multiple regression, I used
the structural equation model and factored regression approaches to estimate the smok-
ing intensity model from Equation 3.22. Figures 3.4 and 3.5 show the path diagrams.
I centered the income and age and income variables at their grand means to maintain
the intercept’s interpretation as the expected smoking intensity score for a respondent
whose parents did not smoke. Importantly, the smoking intensity distribution has sub-
stantial positive skewness and kurtosis, so I used robust (sandwich estimator) standard
errors and test statistics to compensate. Bootstrap resampling is an alternative approach
to addressing this problem. Analysis scripts are available on the companion website.
Table 3.6 shows the maximum likelihood estimates of the focal model parameters.
In the interest of space, I omit the regressor model estimates, because they are not the
substantive focus. The two approaches produced estimates and standard errors that
were equivalent to the third decimal. The performance of the classic structural equation
model might be surprising given that it implicitly imputes the binary predictor with
continuous normal scores, but computer simulation evidence suggests that this may be
fine in some situations (Muthén et al., 2016). This example is probably optimal, because
124 Applied Missing Data Analysis

TABLE 3.6. SEM and Factored Regression Analysis Results


SEM Factored regression
Parameter Est. RSE Est. RSE
β0 8.782 0.105 8.784 0.105
β1 (PARSMOKE) 2.657 0.177 2.655 0.177
β2 (INCOME) –0.129 0.027 –0.129 0.027
β3 (AGE) 0.584 0.038 0.584 0.038
σε2 11.225 0.743 11.225 0.742

Note. RSE, robust standard errors; SEM, structural equation modeling.

the binary variable has similar category proportions, and larger differences could result
if group sizes are highly unbalanced. Importantly, the estimates from both models have
the same meaning. For example, the intercept (β̂0 = 8.78, SE = 0.11) is the expected num-
ber of cigarettes smoked per day for a respondent whose parents didn’t smoke, and the
parental smoking indicator slope (β̂1 = 2.66, SE = 0.18) is the mean difference, control-
ling for age and income.

3.7 SIGNIFICANCE TESTING

Maximum likelihood estimation offers three significance testing options: the Wald test
(Wald, 1943), likelihood ratio statistic (Wilks, 1938), and the score test or modifica-
tion index (Rao, 1948; Saris et al., 1987; Sörbom, 1989). All three procedures are appli-
cable to missing data, and their details are largely the same as those in Section 2.12.
For example, the Wald test requires parameter estimates and their variance–­covariance
matrix. We’ve already seen that the computation of the covariance matrix changes with
missing data (e.g., some cases contribute more information about the parameters than
others), but the composition and interpretation of the test statistic is otherwise identi-
cal to Equation 2.45. Similarly, corrective procedures for non-­normality have long been
available for missing data (Arminger & Sobel, 1990; Enders, 2002; Savalei, 2010; Savalei
& Yuan, 2009; Yuan, 2009b; Yuan & Bentler, 2000, 2010; Yuan & Zhang, 2012), and
there is a good deal of literature supporting their use (Enders, 2001; Savalei & Bentler,
2005; Savalei & Falk, 2014; Yuan, Tong, & Zhang, 2014; Yuan et al., 2012). Several
analysis examples in Chapter 10 illustrate significance tests and corrective procedures
for missing data.
Returning to the multiple regression model from Equation 3.22, I use the Wald test
and likelihood ratio statistic to evaluate the null hypothesis that R2 = 0. Both tests func-
tion like the omnibus F test from ordinary least squares in this context. To begin, the
Wald test standardizes discrepancies between the estimates and null values against the
covariance matrix of the parameter estimates, the diagonal of which contains squared
standard errors. The full covariance matrix is a 5 × 5 matrix, but the test uses only the
elements related to the slope coefficients. Equation 2.49 shows the composition of the
test statistic. The test statistic with Q = 3 degrees of freedom was statistically signifi-
Maximum Likelihood Estimation with Missing Data 125

cant, TW = 519.43 (p < .001), from which we can conclude that at least one slope is non-
zero. The normal-­theory Wald test is not optimal for this example, because the smoking
intensity variable is substantially skewed and kurtotic. The test statistic based on the
sandwich estimator covariance matrix was markedly lower at TW = 449.32 (p < .001) but
gave the same conclusion.
The likelihood ratio statistic evaluates the same hypothesis but requires a nested
or restricted model that aligns with the null. This secondary model is an empty regres-
sion that fixes the three slope coefficients to zero. With complete data, you can get the
restricted model log-­likelihood by constraining the slope coefficients to zero during
estimation or by excluding the explanatory variables from the analysis. The latter option
generally isn’t valid with missing data, because the observed data are not constant in the
two models (i.e., incomplete explanatory variables have a distribution and thus contrib-
ute to the log-­likelihood). Rather, keeping the explanatory variables in the model and
constraining their coefficients to zero during estimation always gives the correct test
statistic. The appropriate nested model is as follows:

INTENSITYi = β0 + ( 0 ) ( PARSMOKEi ) + 0 ( AGEi − μ 2 ) + 0 ( INCOMEi − μ 3 ) + ε i (3.33)

Fitting the two models and applying Equation 2.46 produced a statistically significant
test statistic with Q = 3 degrees of freedom, TLR = 449.38 (p < .001). The validity of this
test is questionable given the non-­normal data, so I applied the Satorra–­Bentler rescaled
test statistic as a comparison (Satorra & Bentler, 1988, 1994). The rescaled test statistic
from Equation 2.47 was markedly lower at TSB = 325.34 (cLR = 1.38) but gave the same
conclusion.

3.8 INTERACTION EFFECTS

The emergence of maximum likelihood missing data-­handling methods for interac-


tive and nonlinear effects is an important recent development (Lüdtke et al., 2020a;
Robitzsch & Lüdke, 2021), as the classic estimator based on multivariate normality is
known to introduce bias (Enders et al., 2014; Lüdtke et al., 2020a; Seaman et al., 2012;
Zhang & Wang, 2017). Moderated regression models are ubiquitous analytic tools, par-
ticularly in the social and behavioral sciences (Aiken & West, 1991; Cohen, Cohen,
West, & Aiken, 2002). A prototypical model features a focal predictor X, a moderator
variable M, and the product of the two.

Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + ε i (3.34)

(
ε i ~ N1 0, σ2ε )
In this model, β1 is a conditional effect that reflects the influence of X when M equals
zero, and β2 is the corresponding conditional effect of M when X equals zero. The β3
coefficient is usually of particular interest, since it captures the change in the β1 slope
for a one-unit increase in M (i.e., the amount by which X’s influence on Y is moderated
by M).
126 Applied Missing Data Analysis

Switching gears to a different substantive context, I use the chronic pain data to illus-
trate a moderated regression analysis with an interaction effect. The data set includes psy-
chological correlates of pain severity (e.g., depression, pain interference with daily life, per-
ceived control) for a sample of N = 275 individuals with chronic pain (see Appendix). The
motivating question is whether gender moderates the influence of depression on psychoso-
cial disability, a construct capturing pain’s impact on emotional behaviors (e.g., psychologi-
cal autonomy and communication; emotional stability). The moderated regression model is

DISABILITYi = β0 + β1 ( DEPRESSi − μ1 ) + β2 ( MALEi )


(3.35)
+ β3 ( DEPRESSi − μ1 )( MALEi ) + β4 ( PAIN i ) + ε i

where DISABILITY and DEPRESS are scale scores measuring psychosocial disability and
depression, MALE is a gender dummy code (0 = female, 1 = male), and PAIN is a binary
severe pain indicator (0 = no, little, or moderate pain, 1 = severe pain). I centered depres-
sion scores at their grand mean to facilitate interpretation. The disability and depression
scores have 9.1 and 13.5% of their scores missing, respectively, and approximately 7.3%
of the binary pain ratings are missing. By extension, 13.5% of the sample is also missing
the product term.

Just‑Another‑Variable Approach
The so-­called “just-­another-­variable” approach to estimating interactive and curvilinear
effects warrants a brief discussion, because it is easy to implement in the structural
equation modeling framework. The method should be a last resort, because it requires
an MCAR mechanism and is prone to substantial biases otherwise (Enders et al., 2014;
Lüdtke et al., 2020a; Zhang & Wang, 2017). The factored regression model is generally
a better choice, because it requires an MAR process.
As its name implies, the just-­another-­variable strategy treats the product term like
any other normally distributed predictor (von Hippel, 2009). To apply this method to the
moderated regression analysis in Equation 3.36, you would first compute a new variable
PRODUCT = DEPRESS × MALE, treating the result as missing when either of the com-
ponents is missing. The product would then function like any other variable in analysis.
The structural equation model for the chronic pain example represents the analysis with
four linear regression equations, each with normally distributed residual terms.

DISABILITYi = β0 + β1 ( DEPRESSi ) + β2 ( MALEi ) + β3 ( PRODUCTi ) + β4 ( PAIN i ) + ε i (3.36)


DEPRESSi =γ 01 + r1i
MALEi =γ 02 + r2i
PRODUCTi =γ 03 + r3i
PAIN i =γ 04 + r4 i

The path diagram for this analysis is like the one in Figure 3.4.
The normality assumption is especially problematic here, because the product of
two random variables isn’t normal (Craig, 1936; Lomnicki, 1967; Springer & Thomp-
Maximum Likelihood Estimation with Missing Data 127

son, 1966), and the mean and variance of a product are deterministic functions of the
component variables (Aiken & West, 1991; Bohrnstedt & Goldberger, 1969). Seaman
et al. (2012) present analytic arguments showing that the just-­another-­variable strategy
is approximately unbiased when missingness is completely at random (i.e., does not
depend on the data), and they further show that the procedure is biased with a condi-
tionally MAR process. Several simulation studies support this conclusion (Enders et al.,
2014; Kim, Belin, & Sugar, 2018; Kim, Sugar, & Belin, 2015; Lüdtke et al., 2020a; Sea-
man et al., 2012; von Hippel, 2009; Zhang & Wang, 2017).

Factored Regression Specification


As mentioned earlier, an important advantage of the factored regression approach is that
it readily accommodates interactive or nonlinear effects, both in the focal analysis and
in the covariate models. Although the statistical theory is quite mature (Ibrahim, 1990;
Lipsitz & Ibrahim, 1996), software tools for estimating these models are relatively new
to the scene (Lüdtke et al., 2020a; Robitzsch & Lüdke, 2021). In fact, we’ve already laid
the groundwork for specifying a factored regression with an interaction effect—­the focal
analysis model changes, but nearly everything else remains the same. Using generic
notation, the joint distribution of the four analysis variables factors into the product of
three univariate distributions as follows:

f (Y, =
X, M ) f ( Y | X, M, X × M, Z ) × f ( X | M, Z ) × f ( M | Z ) × f ( Z ) (3.37)

The first term to the right of the equals sign is now the distribution induced by the mod-
erated regression, and the regressor models follow the same pattern as before. Impor-
tantly, the product term is not a variable with its own distribution but operates more like
a deterministic function of X or M, either of which could be missing.
Assigning the complete predictor to the rightmost term gives the following factor-
ization for the psychosocial disability analysis:

f ( DISABILITY , DEPRESS, MALE, PAIN ) =


f ( DISABILITY | DEPRESS, MALE, DEPRESS × MALE, PAIN ) × (3.38)

(
f ( DEPRESS | PAIN, MALE ) × f PAIN * | MALE × f MALE* ) ( )
The first term to the right of the equals sign corresponds to the analysis model from
Equation 3.35, and the regressor models in the latter three terms translate into a linear
regression for depression, a logistic (or probit) model for the pain severity indicator,
and an empty logistic (or probit) model for the marginal distribution of gender (which
I ignore, because this variable is complete and does not require a distribution). An
asterisk superscript on the variable name denotes a latent response variable (logit or
probit).

DEPRESSi = γ 01 + γ11 ( PAIN i ) + γ 21 ( MALEi ) + r1i (3.39)

 Pr ( PAIN i = 1) 
ln 
 1 − Pr ( PAIN =  = γ 02 + γ12 ( MALEi )
 i 1) 
128 Applied Missing Data Analysis

Analysis Example
Continuing with the chronic pain example, I used the factored regression approach
to estimate the models in Equation 3.40. The procedure requires a grid of pseudo-­
imputations or nodes for numerical integration. The observed psychosocial disability
scores range from 6 to 32, so I specified a somewhat wider grid consisting of integers
between 0 and 40. The observed depression scores similarly range from 7 to 28, so I
used integer values from 0 to 35 for the pseudo-­imputations. Finally, the severe pain
indicator requires just two nodes, 0 and 1. The number of grid points is consistent with
recommendations from the literature (Skrondal & Rabe-­Hesketh, 2004), and using more
nodes had no impact on the analysis results. As explained previously, the EM algorithm
derives person-­specific weights for the pseudo-­imputations that quantify a node’s fit to
the observed data, and missing values are essentially replaced by a weighted average
over the entire grid of plausible values. Finally, I centered the depression scores at the
grand mean to enhance interpretability (Aiken & West, 1991; Cohen et al., 2002), and
doing so required a preliminary maximum likelihood analysis to estimate the mean
vector and covariance matrix of the analysis variables, as the mean of the available data
could be biased. Analysis scripts are available on the companion website.
Table 3.7 gives the parameter estimates from the analysis, and Figure 3.6 plots
the simple slopes for males and females. In the interest of space, Table 3.7 omits the
regressor model estimates, as they are not the substantive focus. Recall that lower-
order terms are conditional effects that depend on scaling; β̂1 = 0.38 (SE = 0.06) is the
effect of depression on psychosocial disability for female participants (the solid line
in Figure 3.6), and β̂2 = –0.79 (SE = 0.55) is the gender difference at the depression
mean (the vertical distance between lines at a value of zero on the horizontal axis).
The interaction effect captures the slope difference for males. The negative coefficient
β̂3 = –0.24, SE = 0.09) indicates that the male depression slope (the dashed line) was
approximately 0.24 points lower than the female slope (i.e., the male slope is β̂1 + β̂3 =
0.38 – 0.24 = 0.14).
Although the just-­another-­variable approach is not advisable, you can use the struc-
tural equation framework to recast the moderated regression as a multiple-­group model
that features separate regressions for males and females. This approach isn’t always an

TABLE 3.7. Maximum Likelihood Estimates from the Moderated Regression


Parameter Est. SE z p
β0 21.65 0.37 58.44 < .001
β1 (DEPRESS) 0.38 0.06 6.31 < .001
β2 (MALE) –0.79 0.55 –1.44 .15
β3 (DEPRESS)(MALE) –0.24 0.09 –2.69 .01
β4 (PAIN) 1.92 0.61 3.17 < .001
σε2 16.53 — — —
2
R   .22 — — —
Maximum Likelihood Estimation with Missing Data 129

40
30
Psychosocial Disability
20
10

Female
Male
0

–20 –10 0 10 20
Depression (Centered)

FIGURE 3.6. Simple slopes (conditional effects for males and females) from the moderated
regression analysis example. The zero value on the horizontal axis corresponds to the mean.

option, but it is here, because the moderator variable is categorical and complete. The
multiple-­group specification features three regression equations per group.

γ (01) + r1(i )
F F
DEPRESSi = (3.40)
(F) (F)
PAIN i =γ 02 + r2i
DISABILITYi = β(0 ) + β1( ) ( DEPRESSi ) + β2( ) ( PAIN i ) + ε i( )
F F F F

γ (01 ) + r1(i )
M M
DEPRESSi =
γ (02 ) + r2(i )
M M
PAIN i =
DISABILITYi = β(0 ) + β1( ) ( DEPRESSi ) + β(2 ) ( PAIN i ) + ε i( )
M M M M

The alphanumeric superscripts on the coefficients and residual terms indicate that
every parameter in the model can potentially differ by gender. The moderated regres-
sion is somewhat more restrictive, because it includes only group-­specific intercepts
and slopes. To align the multiple-­group model with the factored regression analysis, I
imposed between-­group equality constraints on all other parameters, but this specifica-
tion is optional.
Table 3.8 gives the multiple-­group model parameter estimates. Although it packages
the results differently, the multiple-­group estimates were quite like those of the factored
regression model. To highlight this linkage, I formed two contrasts that compared the
130 Applied Missing Data Analysis

TABLE 3.8. Maximum Likelihood Multiple‑Group Regression Estimates


Parameter Est. SE z p
Females
β0(F) 21.48 0.37 58.32 < .001
β1(F) (DEPRESS) 0.38 0.06 6.34 < .001
β2(F) (PAIN) 1.92 0.60 3.20 < .001
σε2 16.45 1.51 10.92 < .001

Males
β0(M) 20.95 0.49 42.74 0.00
β1(M) (DEPRESS) 0.15 0.07 2.15 0.03
β2(M) (PAIN) 1.92 0.60 3.20 0.00
σε2 16.45 1.51 10.92 0.00

Group contrasts
β0(M) – β0(F) –0.54 0.55 –0.98 0.33
β1(M) – β1(F) –0.24 0.09 –2.67 0.01

groups’ intercept and slope coefficients. The two bottom rows of Table 3.8 show that the
intercept difference is similar to the β2 coefficient from the factored regression and the
slope difference is equal to the β3 interaction effect. Furthermore, the difference between
the male and female slopes was statistically significant, as it was in the factored regres-
sion model.

3.9 CURVILINEAR EFFECTS

The factored regression approach readily accommodates other types of nonlinear terms.
Curvilinear regression models with polynomial terms and incomplete predictors are an
important example. To illustrate, consider a prototypical polynomial regression model
that features a squared or quadratic term for X (i.e., the interaction of X with itself).

Yi = β0 + β1 X i + β2 X i2 + ε i (3.41)

(
ε i ~ N1 0, σ2ε )
Like a moderated regression analysis, β1 is a conditional effect that captures the influ-
ence of X when X itself equals 0 (Aiken & West, 1991; Cohen et al., 2002). The β2 coef-
ficient is of particular interest, because it captures acceleration or deceleration (i.e., cur-
vature) in the trend line. For example, if β1 and β2 are both positive, the influence of X
on Y becomes more positive as X increases, whereas a positive β1 and a negative β2 imply
that X’s influence diminishes as X increases.
Maximum Likelihood Estimation with Missing Data 131

To provide a substantive context, I use the math achievement data set from the
companion website that includes pretest and posttest math scores and several academic-­
related variables (e.g., math self-­efficacy, anxiety, standardized test scores, sociodemo-
graphic variables) for a sample of N = 250 students. The literature suggests that anxi-
ety could have a curvilinear relation with math performance, such that the negative
influence of anxiety on achievement worsens as anxiety increases. The model below
accommodates this nonlinearity while controlling for a binary indicator that measures
whether a student is eligible for free or reduced-­priced lunch (0 = no assistance, 1 = eli-
gible for free or reduced-­price lunch), math pretest scores, and a gender dummy code (0 =
female, 1 = male):

MATHPOSTi = β0 + β1 ( ANXIETYi − μ1 ) + β2 ( ANXIETYi − μ1 )


2
(3.42)
+ β3 ( FRLUNCH i ) + β4 ( MATHPREi ) + β5 ( MALEi )

Anxiety scores are centered at their grand mean to facilitate interpretation. Approxi-
mately 16.8% of the posttest math scores and 8.8% of the math anxiety ratings are miss-
ing, as are 5.2% of the lunch assistance indicator codes.

Factored Regression Specification


The factored regression approach to curvilinear regression mimics the specification for
a moderated regression model. The joint distribution of the five analysis variables fac-
tors into the following product of univariate distributions:

f ( MATHPOST, ANXIETY , FRLUNCH, MATHPRE, MALE ) =

(
f MATHPOST | ANXIETY , ANXIETY 2 , FRLUNCH, MATHPRE, MALE × ) (3.43)
f ( ANXIETY | FRLUNCH, MATHPRE, MALE ) ×

( )
f FRLUNCH * | MATHPRE, MALE × f ( MATHPRE | MALE ) × f MALE* ( )
The first term to the right of the equals sign is the normal distribution induced by
the curvilinear regression in Equation 3.42, and the regressor models translate into a
linear regression for anxiety and a logistic (or probit) model for the lunch assistance
indicator.

ANXIETYi = γ 01 + γ11 ( FRLUNCH i ) + γ 21 ( MATHPREi ) + γ 31 ( MALEi ) + r1i (3.44)

 Pr ( FRLUNCH i = 1) 
ln 
 1 − Pr ( FRLUNCH =  = γ 02 + γ12 ( MATHPREi ) + γ 22 ( MALEi )
 i 1) 

Following earlier examples, I ignore the last two terms in Equation 3.43, because the
pretest scores and gender dummy code are complete and do not require a distribution.
Finally, note that the squared term is not a variable with its own distribution and instead
operates more like a deterministic function of the incomplete anxiety scores.
132 Applied Missing Data Analysis

TABLE 3.9. Maximum Likelihood Curvilinear Regression Estimates


Parameter Est. SE z p
β0 41.94 3.44 12.19 < .001
β1 (ANXIETY) –0.26 0.08 –3.10 < .001
β2 (ANXIETY2) –0.01 0.005 –2.28 .02
β3 (FRLUNCH) –3.65 1.04 –3.50 < .001
β4 (MATHPRE) 0.38 0.06 5.79 < .001
β5 (MALE) –4.00 1.01 –3.95 < .001
σε2 50.57 — — —
2
R   .41 — — —

Analysis Example
Continuing with the math achievement example, I used the factored regression approach
to estimate the curvilinear regression from Equation 3.42 and the supporting predictor
models. To enhance the interpretability of the estimates, I centered math anxiety at its
grand mean after first performing a maximum likelihood analysis to estimate the mean
vector and covariance matrix of the analysis variables. Recall that numerical integra-
tion requires a grid of pseudo-­imputations or nodes for any variable with a distribution.
The observed math scores range from 35 to 85, so I specified a somewhat wider grid
consisting of integers between 25 and 95. The observed anxiety scores range from 0 to
56, so I used integer values between –10 and 65 for the pseudo-­imputations. Again, the
number of grid points is consistent with recommendations from the literature (Skrondal
& Rabe-­Hesketh, 2004), and using more nodes had no impact on the analysis results.
Analysis scripts are available on the companion website.
Table 3.9 gives the parameter estimates from the analysis, and Figure 3.7 plots the
regression line (marginalizing over the covariates). In the interest of space, Table 3.9
omits the regressor model estimates, as they are not the substantive focus. Because of
centering, the lower-order anxiety slope (β̂1 = –0.26, SE = 0.08) reflects the influence of
this variable on math achievement at the anxiety mean (i.e., instantaneous rate of change
in the outcome when the predictor equals zero). The negative curvature coefficient (β̂2 =
–0.01, SE = 0.005) indicates that the anxiety slope became more negative as anxiety
increased. This interpretation is clear from the figure, where the regression function is
concave down.

3.10 AUXILIARY VARIABLES

A conditionally MAR mechanism is usually the default assumption for a maximum like-
lihood analysis. This process stipulates that whether a participant has missing values
depends strictly on observed data, and the unseen scores themselves are unrelated to
Maximum Likelihood Estimation with Missing Data 133

70
Posttest Math Achievement
60
50
40
30

–20 –10 0 10 20
Math Anxiety (Centered)

FIGURE 3.7. Estimated regression line from the curvilinear regression analysis. The predic-
tor variable, math anxiety, is centered at its grand mean. The zero value on the horizontal axis
corresponds to the mean.

missingness. In practical terms, the definition implies that the focal analysis model
should include all important correlates of missingness, as omitting such a variable could
result in a bias-­inducing MNAR-by-­omission process if the semipartial correlations are
strong enough. Section 1.5 described an inclusive analysis strategy that fine-tunes a
missing data analysis by introducing extraneous auxiliary variables into the model
(Collins et al., 2001; Schafer & Graham, 2002). Adopting such a strategy can reduce
nonresponse bias, improve precision, or both. The analysis example in Section 1.6 illus-
trated a method for selecting auxiliary variables, and this section describes strategies for
incorporating them into a maximum likelihood analysis.
I describe four broad strategies for introducing auxiliary variables, three of which
leverage the flexibility of the structural equation modeling framework. Graham (2003)
outlined two model specification strategies—­the saturated correlates and extra depen-
dent variable models—­that use a particular configuration of residual correlations and
regression slopes to connect the auxiliary variables to the focal analysis model. Two-
stage estimation is an alternative approach (Savalei & Bentler, 2009; Yuan & Bentler,
2000) that tackles the missing data in two steps. The first stage estimates the mean
vector and variance–­covariance matrix of a superset that includes analysis variables and
auxiliary variables, and the second stage uses a subset of these summary statistics as
input data for a complete-­data analysis. Two-stage estimation is analogous to multiple
134 Applied Missing Data Analysis

imputation in the sense that it uses a preliminary analysis to fill in the missing data,
after which it estimates the focal model. In this context, the filled-­in data are the sum-
mary statistics needed to estimate the complete-­data model. The factored regression
specification is a final option that is well suited for analyses with interactions or nonlin-
ear effects or mixtures of categorical and continuous variables.

Saturated Correlates Model


The saturated correlates model uses a series of correlations and residual correlations to
connect auxiliary variables to the focal variables and each other. Graham’s (2003) rules
for implementing the strategy are as follows: (1) correlate each auxiliary variable with all
explanatory variables, (2) correlate each auxiliary variable with all other auxiliary vari-
ables, (3) correlate each auxiliary variable with the residual terms of all manifest out-
come variables, and (4) do not correlate auxiliary variables with latent variables or their
residual terms. The term saturated correlates owes to the fact that the model includes
every possible correlation between an auxiliary variable and other manifest variables.
Figure 3.8 shows a path diagram of a multiple regression model with two auxiliary
variables, and Figure 3.9 shows the specification for a latent variable model. Notice that
both models use curved arrows (covariances or residual covariances) to connect the
auxiliary variable to the other variables. Importantly, these models don’t alter the inter-
pretation of the focal parameters. For example, in Figure 3.8, the arrow connecting X1 to
Y is simply a partial regression slope that controls for the other two regressors. While an
auxiliary variable doesn’t change the story about the data, it could impact the numerical
values of the estimates and standard errors (usually for the better, which is the point of

X1 X2 X3 A1 A2

FIGURE 3.8. Path diagram of a saturated correlates regression model with two auxiliary vari-
ables. The model uses curved arrows (covariances or residual covariances) to connect the auxil-
iary variables to each other and the residuals of all other variables.
Maximum Likelihood Estimation with Missing Data 135

LATENT

Y1 Y2 Y3 A1 A2

FIGURE 3.9. Path diagram of a latent variable saturated correlates model with two auxiliary
variables. The model uses curved arrows (covariances or residual covariances) to connect the
auxiliary variables to the residuals of all other manifest variables.

including them). For path or structural equation models, it is also worth noting that the
saturated correlates model doesn’t affect fit, because it “spends” the degrees of freedom
from the additional variables.
The saturated correlates model is prone to convergence failures, especially with
more than a few auxiliary variables (Graham, Cumsille, & Shevock, 2013; Howard et
al., 2015). Among other reasons, estimation problems can occur, because the model
imposes an awkward pattern on the residual covariance matrix that induces implau-
sible variances and covariances (Savalei & Bentler, 2009). My own experience suggests
that estimation usually tolerates a relatively small number of auxiliary variables (e.g.,
three to five) and almost certainly fails to converge as the number of variables increases.
This isn’t a major practical limitation, because it is often difficult to identify more than
one or two auxiliary variables that explain a meaningful amount of unique variation in
the analysis variables. In situations where it is necessary or desirable to leverage many
extra variables, Howard et al. (2015) describe a strategy that uses principal components
analysis to reduce a large set of auxiliary variables into a smaller number of linear com-
posites. Computer simulation results suggest that a single principal component or linear
combination can effectively replace an entire set of auxiliary variables. If the auxiliary
variables themselves are incomplete, a single imputation method (e.g., stochastic regres-
sion imputation, a single data set from multiple imputation) can fill in the missing val-
ues prior to data reduction.
136 Applied Missing Data Analysis

Extra Dependent Variable Model and Extensions


Graham (2003) also outlined an extra dependent variable model that introduces aux-
iliary variables as additional outcomes. The rules for constructing the model are some-
what different than they were for the saturated correlates model: (1) Each auxiliary
variable is regressed on all explanatory variables (manifest or latent); (2) each auxil-
iary variable connects to each outcome (manifest or latent) via a correlated residual;
and (3) each pair of auxiliary variables connects via a correlated residual. Figure 3.10
shows a path diagram of a multiple regression model with two auxiliary variables as
extra outcomes, and Figure 3.11 shows a similar model with a latent dependent variable.
Like the saturated correlates specification, these models do not alter the meaning of
the complete-­data analysis, because there are no additional straight arrows (regression
slopes) pointing to the outcome. However, the latent variable model in Figure 3.11 does
contribute additional degrees of freedom, because it uses an indirect link via the latent
factor to connect the auxiliary variable to Y1 and Y2 (i.e., the model places restrictions on
the associations between the auxiliary variable and the manifest indicators).

Two‑Stage Estimation
Two-stage estimation is an alternative approach to introducing auxiliary variables into
a structural equation model (Cai & Lee, 2009; Savalei & Bentler, 2009; Savalei & Falk,
2014; Savalei & Rhemtulla, 2017; Yuan & Bentler, 2000; Yuan et al., 2014). As the name
implies, the procedure requires two steps. The first stage treats the missing data by
estimating the mean vector and variance–­covariance matrix of a superset of variables
that includes auxiliary variables and the focal analysis variables. The top panel of Figure
3.12 shows the first-stage missing data model for a multiple regression analysis with two
auxiliary variables. The advantage of this preliminary step is that the saturated model

X1 X2 X3

A1 Y A2

FIGURE 3.10. Path diagram of a regression model that incorporates two auxiliary variables
as extra dependent variables. The model uses a combination of directed and curved arrows to
connect the auxiliary variables to other variables.
Maximum Likelihood Estimation with Missing Data 137

X A1

A2

LATENT

Y1 Y2 Y3

FIGURE 3.11. Path diagram of a latent variable regression model that incorporates two auxil-
iary variables as extra dependent variables. The model uses a combination of directed and curved
arrows to connect the auxiliary variables to other variables.

should be easy to estimate, even with many extra variables. We already know how to
estimate μ and Σ with missing data, so there is nothing new about the initial stage. The
second step ignores the auxiliary variables and uses a subset of the estimates in μ̂ and Ŝ
as input data for the focal analysis shown in the bottom panel of Figure 3.12.
The second estimation stage incorrectly assumes that the summary statistics came
from a data set with N complete cases. This has no bearing on the estimates, but the
standard errors are too small, because they fail to account for the missing information.
Yuan and Bentler (2000) outlined a sandwich estimator covariance matrix that fixes
this problem, and Savalei and Bentler (2009) extend this solution to an MAR process.
For completeness, the rest of this section describes the correction procedure for the
standard errors. Readers who aren’t interested in these finer points can skip to the next
section without losing important information.
A simple regression model is sufficient for describing the standard error adjust-
ment. Equation 3.24 gives the structural equations for the second-­stage regression anal-
ysis, and Equation 3.25 shows the model-­predicted or model-­implied moments, μ(θ) and
Σ(θ). Maximum likelihood estimation from summary data finds the regression model
parameters in θ that minimize the difference between the first-stage estimates in μ̂ and
Ŝ and the model-­implied moments in μ(θ) and Σ(θ). The classic maximum likelihood
discrepancy function below is found in most structural equation modeling texts (e.g.,
Bollen, 1989; Kaplan, 2009).

( )
f θ μˆ ,Sˆ = ( ) {(
ˆ −1 ( θ ) + ( μˆ − μ ( θ ) )′ S −1 ( θ ) ( μˆ − μ ( θ ) ) + tr SS
− ln SS ) }
ˆ −1 ( θ ) − V (3.45)

The expression features an offset term in curly braces that makes the function return a
138 Applied Missing Data Analysis

Y X1 X2 A1 A2
Stage 1
(Saturated Model)

X1
Stage 2 Y
(Focal Model)
X2

FIGURE 3.12. Path diagrams for two-stage estimation. The first-stage diagram that includes
the auxiliary variables is a saturated model consisting of means, variances, and covariances. The
second-­stage (focal) model is a multiple regression.

value of zero when the model-­implied moments perfectly match the sample estimates
(as they do in this example), but the optimization process is fundamentally the same as
maximizing a log-­likelihood function.
Although they aren’t the parameters of substantive interest, the variance–­covariance
matrix of the model-­implied mean vector and covariance matrix is a good place to start,
because the precision of these estimates is key to understanding how the sandwich esti-
mator works. Recall that second derivatives are stored in a symmetrical matrix known
as the Hessian. Multiplying this derivative matrix by –1 gives the information matrix
shown below (it is expected information here, because the second stage does not use the
raw data):
 ∂ 2 LL 
 2 0 
 μ (θ) 
IE (μ ( θ ), S ( θ )) =
−H E ( μ ( θ ) , S ( θ ) ) =
−  (3.46)
2
 0 ∂ LL 
 ∂S 2 ( θ ) 

The diagonal blocks of second derivative matrix are as follows:

∂ 2 LL
= −N S −1 ( θ ) (3.47)
∂μ 2 ( θ )

∂ 2 LL
∂S ( θ )
2
=
N
2
(
− D′V S −1 ( θ ) ⊗ S −1 ( θ ) DV ) (3.48)

The important thing about the derivative equations is that every participant contributes
the same amount of information regardless of missing data pattern; that is, instead of
summing N individual contributions to the derivative expressions, some of which con-
tain zeros (e.g., Equations 3.12 to 3.14), the equations invoke a constant contribution for
Maximum Likelihood Estimation with Missing Data 139

all N participants. As a result, some of the elements in the information matrix will be
too large, and taking the inverse gives sampling variances and standard errors that are
too small.
Revisiting ideas from Section 3.6, pre- and postmultiplying the information matrix
by a coefficient matrix Δ and then taking the inverse (the matrix analogue of a recipro-
cal) gives the variance–­covariance matrix of the regression model parameters.

( ( ( ) ( )) )
−1
Σˆ θˆ = Δˆ ′I E μ θˆ , Σ θˆ Δˆ (3.49)

Recall that the matrix Δ contains weights or coefficients that capture the amount by
which the model-­implied moments change as a function of the regression model param-
eters (see Table 3.5), and pre- and postmultiplying by Δ reapportions the data’s infor-
mation about the model-­implied mean vector and covariance matrix to the appropri-
ate structural model parameters. Again, the elements in Σˆ θˆ are too small, because they
incorrectly assume the data are complete, so taking the square root of the diagonal ele-
ments thus gives standard errors that are too precise.
The two-stage standard error correction uses sandwich estimator formulation like
the one for non-­normal data back in Section 2.8. In this application, the biased covari-
ance matrix from Equation 3.51 forms the outer pieces of “bread,” and the “meat” of the
sandwich inflates the outer terms to compensate for missing information. The two-stage
variance–­covariance matrix of the estimates from Savalei and Bentler (2009) is

Σˆ θˆ ( TS= ˆ
{ ˆ′
( ( ) ( )) (I (μˆ , Σˆ )) I (μ ( θˆ ), Σ ( θˆ )) Δˆ }× Σˆ
ˆ ˆ
) Σ θˆ × Δ I E μ θ , Σ θ O
−1
E θˆ
(3.50)

where IO(μ̂, Σ̂) is the observed information matrix from the first stage. The equation
is really complicated, but we can deconstruct the basic ideas. Focusing on the “meat”
inside the curly braces, IE(μ(θ̂), Σ(θ̂)) and IO(μ̂, Σ̂) should be identical with complete
data, because they estimate the same information matrix, albeit in different ways. In this
situation, their product sets off a cascade of identity matrices that simplifies the expres-
sion equal to Equation 3.49. With missing data, premultiplying the inverse of IO(μ̂, Σ̂)
by IE(μ(θ), Σ(θ)) returns a matrix with diagonal values that represent the proportional
increase in information going from incomplete to complete data (e.g., a value of 1.20
means that the complete-­data information for a given parameter is about 20% larger
than that of the observed data). This matrix, which effectively captures how far off the
elements in Σˆ θˆ are in proportional terms, is the crux of the adjustment, and the other
multiplications just distribute the correction terms to the structural model’s standard
errors.

Factored Regression
The factored regression framework provides a final method for introducing auxil-
iary variables. As you know, this method factorizes a multivariate distribution into a
sequence of univariate distributions, each of which corresponds to a regression model.
To maintain the desired interpretation of the focal model parameters, it is important to
specify a sequence where the analysis variables predict the auxiliary variables and not
140 Applied Missing Data Analysis

X3 Y

X2 A2

X1 A1

FIGURE 3.13. Path diagram of a factored regression model with two auxiliary variables.

vice versa. To illustrate, the factorization for a multiple regression model with a pair of
auxiliary variables is as follows:

f ( A1 | A2 , Y , X1 , X 2 ) × f ( A2 | Y , X1 , X 2 , X 3 ) (3.51)

f ( Y | X1 , X 2 , X 3 ) × f ( X1 | X 2 , X 3 ) × f ( X 2 | X 3 ) × f ( X 3 )

The first two terms are the auxiliary variable regression models, the third term is the
focal model, and the final three terms are the regressor models. The path diagram in
Figure 3.13 shows that the factored regression is similar in spirit to the extra dependent
variable model but replaces curved arrows with straight arrows. This specification is
especially useful for leveraging categorical auxiliary variables, because it isn’t straight-
forward to marry logistic or probit models with the correlated residuals in Graham’s
structural equation models. The strategy is also ideally suited for analyses with inter-
actions or nonlinear effects with incomplete data, as conventional structural equation
models (e.g., the just-­another-­variable model) generally do a poor job preserving these
effects.

Analysis Example
This example uses the psychiatric trial data on the companion website to illustrate max-
imum likelihood missing data handling with auxiliary variables. The data, which were
collected as part of the National Institute of Mental Health Schizophrenia Collaborative
Study, comprise four illness severity ratings, measured in half-point increments ranging
from 1 (normal, not at all ill) to 7 (among the most extremely ill). In the original study, the
437 participants were assigned to one of four experimental conditions (a placebo condi-
tion and three drug regimens), but the data collapse these categories into a dichotomous
treatment indicator (DRUG = 0 for the placebo group, and DRUG = 1 for the combined
medication group). The researchers collected a baseline measure of illness severity prior
to randomizing participants to conditions, and they obtained follow-­up measurements
1 week, 3 weeks, and 6 weeks later. The overall missing data rates for the repeated mea-
surements were 1, 3, 14, and 23%.
The focal regression model predicts illness severity ratings at the 6-week follow-­up
assessment from baseline severity ratings, gender, and the treatment indicator.
Maximum Likelihood Estimation with Missing Data 141

SEVERITY6i = β0 + β1 ( DRUGi ) + β2 ( SEVERITY0i − μ 2 ) + β3 ( MALEi − μ 3 ) + ε i (3.52)

(
ε i ~ N1 0, σ2ε )
Centering the baseline scores and male dummy code at their grand means facilitates
interpretation, as this defines β0 and β1 as the placebo group average and group mean
difference, respectively, marginalizing over the covariates.
This small data set offers limited choices for auxiliary variables, but the illness
severity ratings at the 1-week and 3-week follow-­up assessments are excellent candi-
dates, because they have strong semipartial correlations with the dependent variable
(r = .40 and .61, respectively) and uniquely predict its missingness. I used the saturated
correlates, extra dependent variable, and factored regression approaches to incorporate
these variables into the analysis. The saturated correlates model was identical to the
path diagram in Figure 3.8, and the extra dependent variable mimicked Figure 3.10.
The factored regression strategy features a sequence of univariate distributions,
each of which corresponds to a regression model. To maintain the desired interpretation
of the focal model parameters, it is important to specify a sequence where the analysis
variables predict the auxiliary variables and not vice versa. The factorization for this
analysis is as follows:

f ( SEVERITY3 | SEVERITY1 , SEVERITY6 , SEVERITY0 , DRUG, MALE ) ×


f ( SEVERITY1 | SEVERITY6 , SEVERITY0 , DRUG, MALE ) ×
(3.53)
f ( SEVERITY6 | SEVERITY0 , DRUG, MALE ) × f ( SEVERITY0 | DRUG, MALE ) ×

( ) (
f DRUG* | MALE × f MALE* )
The first two terms are auxiliary variable distributions that derive from linear regression
models, the third term corresponds to the focal linear regression, the fourth term is a
linear regression model for the incomplete baseline scores, and the final two terms are
regressions for the complete predictors (which I ignored, because these variables do not
require distributions). The full complement of regression models is shown below, and
analysis scripts are available on the companion website.

SEVERITY3i = γ 02 + γ12 ( SEVERITY1i ) + γ 22 ( SEVERITY6i ) + γ 32 ( DRUGi )


(3.54)
+ γ 42 ( SEVERITY0i ) + γ 52 ( MALEi ) + r2i

SEVERITY1i = γ 01 + γ11 ( SEVERITY6i ) + γ 21 ( SEVERITY0i )


+ γ 31 ( DRUGi ) + γ 41 ( MALEi ) + r1i

SEVERITY6i = β0 + β1 ( DRUGi ) + β2 ( SEVERITY0i − μ 2 ) + β3 ( MALEi − μ 3 ) + ε i

SEVERITY0i = γ 03 + γ13 ( DRUGi ) + γ 23 ( MALEi ) + r3i

Table 3.10 gives maximum likelihood estimates and standard errors with and with-
out auxiliary variables. In the interest of space, I omit the auxiliary variable and covari-
142 Applied Missing Data Analysis

TABLE 3.10. Maximum Likelihood Regression Estimates


with and without Auxiliary Variables
No AVs AVs
Effect Est. SE Est. SE
β0 4.29 0.16 4.41 0.16
β1 (DRUG) –1.24 0.19 –1.46 0.18
β2 (SEVERITY0) 0.31 0.09 0.28 0.09
β3 (MALE) 0.21 0.15 0.23 0.15
σε2 1.89 0.11 2.00 0.12
2
R   .16 .04   .19 .04

ate model parameters, because they are not the substantive focus. As you might expect,
the three auxiliary variable models produced identical results (to the third decimal), so
Table 3.10 reports a single set of results. As explained previously, introducing auxiliary
variables does not affect the interpretation of the focal model parameters; the intercept
coefficient is placebo group mean at the 6-week follow-­up (β̂0 = 4.41, SE = 0.16), and the
treatment assignment slope is the group mean difference for the medication condition
(β̂1 = –1.46, SE = 0.18), controlling for covariates.
Conditioning on the auxiliary variables had a substantial impact on the numeri-
cal values of key parameters estimates. In particular, the intercept coefficients (pla-
cebo group means) from the two analyses differed by nearly three-­fourths of a standard
error unit, and the slope coefficients (medication group mean differences) differed by
nearly 1.2 standard errors. Although the natural inclination is to favor the analysis with
auxiliary variables, there is no way to know for sure which is more correct, because
conditioning on the wrong set of variables can exacerbate nonresponse bias, at least
hypothetically (Thoemmes & Rose, 2014). Nevertheless, the differences are consistent
with the shift from an MNAR-by-­omission mechanism to a more MAR-like process. The
fact that the auxiliary variables have strong semipartial correlations with the dependent
variable (r = .40 and .61, respectively) and uniquely predict its missingness reinforces
this conclusion.
In my experience, the results in Table 3.10 are probably on the optimistic side of
what you might expect to see in practice. This example benefited from two variables that
explained a substantial proportion of unique variation above and beyond that already
captured by the focal model. As explained in Section 1.5, this net covariation is what
makes an auxiliary variable useful, and its bivariate associations with the analysis vari-
ables are less diagnostic in this regard. The impact of auxiliary variables also depends on
the amount of missing data. If the analysis variables have relatively small missing data
rates, you might expect to see little to no change in the estimates after adding auxiliary
variables, whereas there is more to gain with high missing data rates. Regardless, even
small gains are worthwhile given the ease with which you can include a few additional
variables.
Maximum Likelihood Estimation with Missing Data 143

3.11 CATEGORICAL OUTCOMES

The factored regression framework provides a straightforward way to integrate categori-


cal and continuous variables into a maximum likelihood analysis. I applied this strategy
earlier to a linear regression with an incomplete binary predictor (see Section 3.6), and
in this section consider a binary outcome with missing data. Returning to the employee
data set from earlier in the chapter, I used maximum likelihood estimation to fit probit
and logistic regression models that use leader–­member exchange, employee empower-
ment, and a male dummy code (0 = female, 1 = male) to predict a binary measure of turn-
over intention (TURNOVER = 0 if an employee has no plan to leave her or his position,
and TURNOVER = 1 if the employee has intentions of quitting). The probit and logistic
regression models are as follows:

TURNOVER i* = β0 + β1 ( LMX i ) + β2 ( EMPOWER i ) + β3 ( MALEi ) + ε i (3.55)


 Pr (TURNOVER i = 1) 
ln 
 1 − Pr (TURNOVER =  = β0 + β1 ( LMX i ) + β2 ( EMPOWER i ) + β3 ( MALEi )
 i 1) 

The probit model’s residual variance is fixed at 1 for identification, and the model addi-
tionally incorporates a fixed threshold parameter that divides the latent response vari-
able distribution into two segments. The logistic regression can also be viewed as a
latent response model, but it is typical to write the equation without a residual. Note
that I use β’s to represent focal model parameters, but the estimated coefficients will not
be the same (with complete data, logit coefficients are approximately 1.7 times larger
than probit coefficients; Birnbaum, 1968). Finally, approximately 5.1% of the turnover
intention scores are missing, and the leader–­member exchange and employee empower-
ment scales have 4.1 and 16.2% missing data rates (see Appendix).
As explained previously, factorizing the multivariate distribution such that incom-
plete predictors come before complete regressors simplifies estimation, because the lat-
ter terms can be ignored. Applying this strategy to the employee turnover analysis gives
the following factorization:
f (TURNOVER | LMX, EMPOWER, MALE ) ×
(3.56)
f ( LMX | EMPOWER, MALE ) × f ( EMPOWER | MALE ) × f MALE* ( )
The first term (the employee turnover distribution) corresponds to the probit or logistic
regression in Equation 3.55, and the supporting covariate models are linear regressions
with normally distributed residuals, as follows:

LMX i = γ 01 + γ11 ( EMPOWER i ) + γ 21 ( MALEi ) + r1i (3.57)

EMPOWER i = γ 02 + γ12 ( MALEi ) + r2i

Following earlier examples, I drop the rightmost term in Equation 3.56, because the
gender dummy code is complete and does not require a distribution (i.e., this variable
functions as a known constant).
144 Applied Missing Data Analysis

As described earlier, the EM algorithm uses a procedure known as numerical integra-


tion to fill in the missing parts of the data in an imputation-­esque fashion. The E-step fills
in everyone’s missing values with a fixed grid of replacement values or nodes that span
the incomplete variable’s entire range. The M-step updates the parameters for the next
iteration by finding the estimates that average over the weighted pseudo-­imputations.
Each successive iteration of the EM algorithm gives better predictions about the miss-
ing values (in this case, encoded as the weighted pseudo-­imputations), which in turn
improve the estimates, which in turn sharpen the missing values, and so on. Interested
readers can consult Lüdtke et al. (2020a) for a worked example.

Analysis Example
Table 3.11 shows the maximum likelihood analysis results for both models. I omit the
supporting predictor model estimates, because they are not the substantive focus. Start-
ing with the probit regression, the Wald test of the full model was statistically significant,
TW(3) = 25.14, p < .001, meaning that the estimates are at odds with the null hypothesis
that all three population slopes equal zero. Each slope coefficient reflects the expected
z-score change in the latent response variable for a one-unit increase in the predictor,
controlling for other regressors. For example, the leader–­member exchange coefficient
indicates that a one unit increase in relationship quality is expected to decrease the
latent proclivity to quit by 0.07 z-score units (β̂1 = –0.07, SE = .03), holding other predic-
tors constant.

TABLE 3.11. Probit and Logistic Regression Estimates


Parameter Est. RSE z p OR
Probit regression
β0 1.08 0.39 2.75   .01 —
β1 (LMX) –0.07 0.02 –3.16 < .01 —
β2 (EMPOWER) –0.03 0.02 –1.98   .05 —
β3 (MALE) –0.05 0.11 –0.45   .65 —
R2   .08 .03 2.75   .01 —

Logistic regression
β0 1.82 0.66 2.78   .01 —
β1 (LMX) –0.12 0.04 –3.11 < .01 0.89
β2 (EMPOWER) –0.05 0.03 –1.94   .05 0.95
β3 (MALE) –0.09 0.19 –0.47   .64 0.92
R2   .07 .03 2.66   .01 —

Note. RSE, robust standard error; OR, odds ratio; LMX, leader–member exchange.
Maximum Likelihood Estimation with Missing Data 145

Turning to the logistic regression results, the Wald test of the full model was again
statistically significant, and the test statistic’s numerical value was comparable to that
of the probit model, TW(3) = 24.13, p < .001. Each slope coefficient now reflects the
expected change in the log odds of quitting for a one-unit increase in the predictor,
holding all other covariates constant. For example, the leader–­member exchange slope
indicates that a one-unit increase in relationship quality decreases the log odds of quit-
ting by 0.12 (β̂1 = –0.12, SE = .04), controlling for employee empowerment and gender.
Although the rule of thumb is not quite as precise with missing data, the logistic coef-
ficients are roughly 1.7 times larger than the probit slopes. Exponentiating each slope
gives an odds ratio that reflects the multiplicative change in the odds for a one-unit
increase in a predictor (e.g., a one-point increase on the leader–­member exchange scale
multiplies the odds of quitting by 0.89). The analysis results highlight that probit and
logistic models are effectively equivalent and almost always lead to the same conclu-
sions. Some researchers favor logistic framework, because it yields odds ratios, but there
is otherwise little reason to prefer one approach to the other.

3.12 SUMMARY AND RECOMMENDED READINGS

This chapter extended maximum likelihood estimation to missing data problems. The
mechanics of estimation are largely the same as Chapter 2, where the goal was to iden-
tify estimates that maximize fit to the data. When confronted with missing values, the
estimator does not discard incomplete data records, nor does it impute them. Rather, it
identifies the parameter values with maximum support from whatever data are avail-
able, with some participants contributing more than others. Maximum likelihood anal-
yses have evolved considerably in recent years. The estimators that were widely available
when I was writing the first edition of this book were generally limited to multivariate
normal data. This is still a common (and effective) assumption for missing data analy-
ses, but flexible estimation routines that accommodate mixtures of categorical and con-
tinuous variables are now available. In particular, the factored regression strategy intro-
duced in this chapter is pivotal throughout the rest of the book, as Bayesian estimation
and contemporary multiple imputation routines leverage the same specification.
Speaking of which, Bayesian estimation is the next topic. Chapter 4 describes the
philosophical underpinnings of the Bayesian statistical paradigm and the Markov chain
Monte Carlo (MCMC) estimator for complete data. Chapter 5 extends MCMC estima-
tion to missing data problems, and Chapter 6 introduces models for mixtures of numeri-
cal and categorical variables. Like maximum likelihood estimation, the primary goal of
a Bayesian analysis is to fit a model to the data and use the resulting estimates to inform
one’s substantive research questions. However, unlike maximum likelihood, Bayes-
ian estimation repeatedly fills in or imputes the missing values en route to getting the
parameter values. As you will see, multiple imputation—­the third pillar of the book—
uses Bayesian MCMC to create several filled-­in data sets that are reanalyzed with fre-
quentist methods. Finally, I recommend the following for readers who want additional
details on topics from this chapter.
146 Applied Missing Data Analysis

Arbuckle, J. L. (1996). Full information estimation in the presence of incomplete data. In G. A.


Marcoulides & R. E. Schumacker (Eds.), Advanced structural equation modeling. Mahwah,
NJ: Erlbaum.

Graham, J. W. (2003). Adding missing-­data-­relevant variables to FIML-based structural equation


models. Structural Equation Modeling: A Multidisciplinary Journal, 10, 80–100.

Lüdtke, O., Robitzsch, A., & West, S. G. (2020). Analysis of interactions and nonlinear effects
with missing data: A factored regression modeling approach using maximum likelihood
estimation. Multivariate Behavioral Research, 55, 361–381.

Rubin, D. B. (1991). EM and beyond. Psychometrika, 56(2), 241–254.

Savalei, V. (2010). Expected versus observed information in SEM with incomplete normal and
nonnormal data. Psychological Methods, 15, 352–367.

Savalei, V., & Rosseel, Y. (2021, April 14). Computational options for standard errors and test
statistics with incomplete normal and nonnormal data. Structural Equation Modeling: A
Multidisciplinary Journal. [Epub ahead of print] https://doi.org/10.31234/osf.io/wmuqj
4

Bayesian Estimation

4.1 CHAPTER OVERVIEW

Bayesian analyses have gained a strong foothold in social and behavioral science disci-
plines in the last decade or so (Andrews & Baguley, 2013; van de Schoot, Winter, Ryan,
Zondervan-­Zwijnenburg, & Depaoli, 2017), and user-­friendly tools for conducting these
analyses are now widely available in software programs (e.g., Stan: Gelman, Lee, & Guo,
2015; Blimp: Keller & Enders, 2021; Mplus: Muthén & Muthén, 1998–2017; Jeffreys’s
Amazing Statistics Program [JASP]: Wagenmakers, Love, et al., 2018). Whereas the first
edition of this book viewed the Bayesian framework through a narrow lens—as an esti-
mation method co-opted for a specific type of multiple imputation—­this edition takes
the much broader view that Bayesian analyses are an alternative to maximum likelihood
estimation.
Stepping back and considering the organization of the book, Bayesian analyses are
a bridge connecting maximum likelihood to multiple imputation. Like maximum like-
lihood estimation, the primary goal of a Bayesian analysis is to fit a model to the data
and use the resulting estimates to inform the substantive research questions. When
confronted with missing values, maximum likelihood uses the normal curve to deduce
the missing parts of the data as it iterates to a solution (technically, the estimator mar-
ginalizes over the missing values). Bayesian estimation has more of a multiple imputa-
tion flavor, because it fills in the missing values en route to getting the parameters. Like
maximum likelihood, missing data handling happens behind the scenes, and imputa-
tion (implicit or explicit) is just a means to a more important end, which is to learn
something from the parameter estimates.
This chapter takes a hiatus from missing data issues to outline Bayesian analyses
with complete data. The goal of the chapter is to provide a user-­friendly introduction
to the Bayesian paradigm that serves as a springboard for accessing any number of the
specialized textbooks (Gelman et al., 2014; Hoff, 2009; Kaplan, 2014; Levy & Mislevy,
2016; Lynch, 2007; Robert & Casella, 2004) and the many tutorial articles on the topic

147
148 Applied Missing Data Analysis

(Abrams, Ashby, & Errington, 1994; Casella & George, 1992; Jackman, 2000; Kruschke
& Liddell, 2018; Lee & Wagenmakers, 2005; Sorensen & Vasishth, 2015; Stern, 1998;
Wagenmakers, Marsman, et al., 2018). Additionally, I try to provide a practical recipe
for implementing Bayesian analyses that generalizes to different analysis models and
software packages.
Following the structure of Chapter 2, I start with a univariate analysis and build
to a linear regression model. As you will see, the Bayesian analyses give results that are
numerically equivalent to those of maximum likelihood, but the interpretations of the
estimates and measures of uncertainty require a different philosophical lens. I spend a
good deal of time describing Markov chain Monte Carlo (MCMC) estimation as diag-
nosing whether the iterative algorithm is working properly is a vital part of applying
Bayes estimation (especially with missing data). The final section of the chapter covers
a multivariate normal data analysis comprising a mean vector and covariance matrix.
Collectively, these examples provide the building blocks for understanding more com-
plex analyses, and this chapter sets up foundational concepts that appear throughout
the remainder of the book. As you will see, all the major ideas readily generalize to
missing data in Chapter 5.

4.2 WHAT MAKES BAYESIAN STATISTICS DIFFERENT?

A key distinction between the Bayesian framework and the classic frequentist paradigm
that predominates in many disciplines is how they define a parameter. The frequentist
approach defines a parameter as a fixed value (e.g., the true mean that would result from
collecting data from the entire population of interest). The goal of a frequentist analysis
is to estimate the parameter and establish a confidence interval around that estimate.
The standard error, which is integral to this process, quantifies the expected variability
of an estimate across many different random samples. Defining a parameter as a fixed
quantity leads to some important subtleties. For example, when describing a 95% con-
fidence interval, it is incorrect to say that there is a 95% probability that the parameter
falls between values of A and B, because the confidence interval from any single sample
either contains the parameter or it does not. Rather, the correct interpretation describes
the expected performance of the interval across many repeated samples; if we drew 100
samples from a population and constructed a 95% confidence interval around the esti-
mate from each sample, 95 of those intervals should include the true population param-
eter. In a similar vein, the probability value from a frequentist significance test describes
the proportion of repeated samples that would yield a test statistic greater than or equal
to that of the sample data. In both situations, concepts of variation and probability apply
to the sample data and estimates, not to the parameter itself.
In contrast, the Bayesian paradigm views a parameter as a random variable that has
a distribution, just like any other variable we might measure. For example, a Bayesian
analysis defines the mean as a normally distributed variable where some realizations
are more plausible than others given the data. This distribution is called a posterior,
because it is constructed after observing the data (and in conjunction with a prior dis-
Bayesian Estimation 149

tribution). One of the goals of a Bayesian analysis is to use familiar measures of central
tendency and dispersion to describe the shape of the posterior distribution. For exam-
ple, the posterior mean (or median) quantifies the parameter’s most likely value and
functions like a frequentist point estimate. The posterior standard deviation describes
the distribution’s spread and is analogous to a frequentist standard error in the sense
that it describes uncertainty about the parameter after observing the data. However, it
is important to note that the Bayesian notion of uncertainty does not involve the hypo-
thetical process of drawing many random samples and counting the frequency of differ-
ent estimates. Rather, this form of uncertainty reflects our subjective degree of belief or
knowledge about the parameter after collecting and analyzing data (Kopylov, 2008; Levy
& Mislevy, 2016; O’Hagan, 2008).
Viewing a parameter as a random variable also has important implications for infer-
ence. For example, a Bayesian credible interval (the analogue to a frequentist confi-
dence interval) allows us to say that there is a 95% probability that the parameter falls
between values of A and B. This interpretation is very different from that of the frequen-
tist approach, because it attaches the probability statement to the parameter, not to the
hypothetical estimates from different samples of data. By extension, Bayesian probabil-
ity values refer directly to the parameter of interest and not to a collection of hypotheti-
cal estimates from different samples. For example, the probability that a slope parameter
is positive is simply the area of the posterior distribution above 0. These interpretations
are made possible by invoking Bayes’ theorem and a prior distribution that represents
our a priori beliefs about the parameter before data collection. The next section provides
a conceptual, equation-­free description of a simple Bayesian analysis on which I expand
in later sections.

4.3 CONCEPTUAL OVERVIEW OF BAYESIAN ESTIMATION

A Bayesian analysis consists of three major steps: (1) Specify a prior distribution for
the parameter of interest, (2) use a likelihood function to summarize the data’s evi-
dence about different parameter values, and (3) combine information from the prior and
the likelihood to generate a posterior distribution that describes the relative probabil-
ity of different parameter values given the data. Finally, familiar statistics such as the
mean and standard deviation summarize the posterior’s center and spread, respectively.
Rewinding back to Chapter 2, I emphasized that the likelihood is not a probability dis-
tribution, because the area under the function does not equal 1. Conceptually, you can
think of these three steps as a recipe for converting a likelihood function to a proper
probability distribution.
This remainder of this section gives a conceptual description of the three steps in
the context of a simple univariate analysis in which the goal is to estimate the popula-
tion proportion. Because the goal is to introduce the underlying logic behind Bayesian
estimation, I am purposefully vague about many of the mathematical details. Neverthe-
less, the steps for this simple analysis provide a recipe for applying Bayesian estimation
to more complex analyses.
150 Applied Missing Data Analysis

The Prior Distribution


Specifying a prior distribution for the parameter of interest is a necessary and unavoid-
able aspect of a Bayesian analysis. The prior distribution describes your subjective beliefs
about the relative probability of different parameter values before collecting any data. To
illustrate, suppose that two researchers use Bayesian methodology to estimate a popula-
tion proportion, π. To provide a substantive context, suppose π is the rate of postpartum
depression in the population, although it could be the incidence of any binary character-
istic of interest. The prior distribution specifies the relative probability of every possible
population proportion. After conducting a literature review, Researcher A believes that
depression rates between .10 and .15 are very likely, and she feels that the relative prob-
ability rapidly decreases as the proportion approaches 0 or 1. The solid curve in Figure
4.1 reflects this prior belief. Notice that the highest point of the distribution is located
near π = .13, and the relative probability (i.e., the height of the curve) quickly decreases
as π approaches 0 or 1. In contrast, Researcher B is uncomfortable speculating about dif-
ferent parameter values, so he assigns an equal weight to every value between 0 and 1.
The flat line in Figure 4.1 depicts this researcher’s prior beliefs. The Bayesian literature
often refers to Researcher B’s prior distribution as a noninformative prior, because it
reflects a lack of knowledge about the parameter.
Relative Probability

0 .20 .40 .60 .80 1


Population Proportion

FIGURE 4.1. Two prior distributions for the population proportion. The solid curve repre-
sents an informative prior that specifies some values as more likely than others. The dashed line
is a noninformative prior where all parameter values are equally likely.
Bayesian Estimation 151

Some readers may be wondering about the origin of the prior distributions in Fig-
ure 4.1. Researchers often adopt a conjugate prior distribution that belongs to the same
family as the likelihood function, as doing so simplifies the conversion of the likeli-
hood function to a probability distribution. The binomial likelihood function for binary
outcome data (see Equation 2.1) is a member of the beta distribution family. The beta
distribution’s shape is proportional to

(
f ( π ) ∝ πa −1 1 − πb −1 ) (4.1)

where f(π) is the height of the curve or the vertical coordinate at a particular value of π,
and a and b are constants that define the distribution’s shape (e.g., larger values of a and
b produce a distribution with greater spread, and the distribution becomes asymmetric
when a ≠ b). The informative prior in Figure 4.1 corresponds to a = 7 and b = 40, and the
flat prior aligns with a = 1 and b = 1. As you know from Chapter 2, a probability distri-
bution contains a scaling factor that makes the area under the curve sum or integrate
to 1. I drop unnecessary scaling terms whenever possible and use a “proportional to”
symbol to indicate that an expression omits one or more constants. This simplifies the
math without affecting the distribution’s shape.

The Likelihood Function


Like maximum likelihood, Bayesian analyses assign a probability distribution to the
data. For a binary outcome (e.g., clinical diagnosis vs. not) this is a Bernoulli distribu-
tion (see Equation 2.1). After collecting a sample of data, the observations become fixed
(i.e., they function like known constants) and a likelihood function summarizes the
data’s evidence about different parameter values. Maximum likelihood principles from
Chapter 2 also appear prominently in Bayesian estimation, sans the natural logarithm.
To illustrate, suppose that the two researchers recruited a sample of N = 100 participants
and found that nD = 7 individuals possessed the characteristic of interest, in this case
a clinical depression diagnosis. The binomial distribution (equivalent to the Bernoulli
likelihood from Chapter 2) is a convenient likelihood for the data, because it belongs
to same beta distribution family shown in Equation 4.1. Figure 4.2 shows the likeli-
hood function for this sample of data. For interested readers, the graph corresponds to
a beta distribution with shape parameters a = nD + 1 and b = N – nD + 1. Substituting the
observed data (e.g., 7 out of 100 “cases”) and a value for π into the function returns a
vertical coordinate that conveys the data’s evidence about a particular parameter value
on the horizontal axis. The maximum likelihood estimate, π̂ = .07, is located at the peak
of the likelihood function.

The Posterior Distribution


The final step of a Bayesian analysis is to derive or estimate the posterior distribution of
the parameter. This step converts the likelihood function to a probability distribution.
The posterior distribution is essentially a composite function that combines informa-
152 Applied Missing Data Analysis

Relative Probability

0 .20 .40 .60 .80 1


Population Proportion

FIGURE 4.2. Likelihood function displaying the relative probability of the observing seven
“cases” (e.g., the number of new mothers with postpartum depression diagnosis) from a sample
of N = 100, given that assumed population value on the horizontal axis.

tion from the prior and the likelihood. I describe the posterior in more detail later in the
chapter, but the conceptual idea is to weight each point on the likelihood function by the
magnitude of the corresponding point on the prior distribution. For example, attach-
ing a high prior probability to a particular parameter value would increase the height of
the likelihood function at that point on the horizontal axis. Conversely, assigning a low
prior probability to a particular parameter value would decrease the height of the likeli-
hood function at that point.
To illustrate, recall that Researcher B specified a noninformative prior where all
parameter values are equally likely (the flat dashed line in Figure 4.1). This prior speci-
fication assigns the same constant weight to every point on the likelihood function, so
the resulting posterior distribution is identical to the likelihood. Figure 4.3 shows this
posterior distribution as a dashed curve. Prior to collecting data, Researcher A assigned
a high probability to depression rates between π = .10 and .15, but the maximum likeli-
hood estimate from the data was somewhat lower at π̂ = .07. Researcher A’s posterior
distribution blends her prior beliefs with information from the data, giving the solid
curve in Figure 4.3. Comparing the two distributions, you can see that Researcher A’s
posterior is less elevated at π = .05, because she assigned a low prior weight to this
parameter value, and it is slightly elevated at π = .15, because she assigned a high prior
probability to this value. Both functions describe the probability of different parameter
values given the observed data.
Bayesian Estimation 153

Relative Probability

0 .05 .10 .15 .20 .25 .30


Population Proportion

FIGURE 4.3. Two posterior distributions of a population proportion π. The dashed curve cor-
responds to a flat prior that assigns the same constant weight to every point on the likelihood
function (the dashed curve in Figure 4.1). The solid curve assigns a high prior probability to
values between π = 0.10 and 0.15 (the solid curve in Figure 4.1).

Describing the center and spread of the posterior distribution is a primary goal
of a Bayesian analysis; this step is analogous to computing a point estimate and stan-
dard error. Researchers often use the mean or median to characterize the most likely
parameter value (the latter might be preferable, since the distribution is asymmetrical),
and they use the standard deviation to quantify spread or uncertainty. This example
is relatively straightforward, because these summary quantities are simple functions
of the beta distribution’s shape parameters, a and b. Without getting into specifics, the
mean and median of Researcher A’s posterior distribution are Mπ = .095 and Mdnπ = .093,
respectively, and the posterior standard deviation is SDπ = .024. In contrast, Researcher
B’s posterior mean and median are Mπ = .078 and Mdnπ = .076, respectively, and the
standard deviation is SDπ = .026. Researcher A’s posterior distribution is centered at a
somewhat higher value, because her prior distribution assigned high weights to param-
eter values between .10 and .15 (the strong prior information also reduced the spread).
As a comparison, a frequentist analysis with maximum likelihood estimation gives
a point estimate and standard error of π̂ = .070 and SE = .026. Although these quantities
are numerically identical to Researcher B’s posterior mode and standard deviation, they
have different interpretations. For example, π̂ is our best guess about the true population
proportion, and its standard error quantifies the expected variability of point estimates
across many different random samples. The Bayesian analysis, on the other hand, makes
154 Applied Missing Data Analysis

no reference to repeated samples. Rather, a probability distribution is the device for con-
veying our current knowledge about the parameter; the posterior distribution’s center
and spread characterize the most likely realization of the parameter and the degree of
knowledge or uncertainty about the parameter after analyzing the data, respectively.

4.4 BAYES’ THEOREM

Bayes’ theorem describes the relationship between two conditional probabilities; that
is, the theorem provides a rule that says how to get from the probability of B given A to
the probability of A given B. Applied to statistics, Bayes’ theorem is the machinery that
converts a likelihood function to a probability distribution. For two generic events, A
and B, the theorem is
Pr ( B ) × Pr ( A | B )
Pr ( B | A ) = (4.2)
Pr ( A )
where Pr(B|A) is the conditional probability of observing event B given that event A has
already occurred, Pr(A|B) is the conditional probability of A given B, and Pr(A) and Pr(B)
are the marginal probabilities of A and B (i.e., the probability of A without reference to
B and vice versa).
Applying Bayes’ theorem to probability distributions gives
f ( θ ) × f ( data | θ )
f ( θ | data ) = (4.3)
f ( data )
where θ is the parameter of interest, and “f of something” references the height of a func-
tion or curve at some point on its horizontal axis. Ignoring the term in the denominator
for the moment, the expression reflects the three components of the previous analysis:
f(θ) is the prior distribution (e.g., Figure 4.1), f(data|θ) corresponds to the likelihood
(e.g., Figure 4.2), and f(θ|data) is the posterior distribution of the parameter given the
data (e.g., Figure 4.3). To clarify some notation, recall that the probability distribution
of the data and the likelihood function are the same function with different arguments
treated as varying and constant; that is, after collecting a sample of data, the observa-
tions in f(data|θ) become fixed and a likelihood function summarizes the data’s evi-
dence about different parameter values. The data distribution and likelihood function
are proportional and differ by some constant that makes f(data|θ) a proper probability
distribution.
As you know, probability distributions such as the normal curve or beta distribu-
tion include scaling terms that do not affect their shape but make the area under the
curve sum or integrate to 1. The denominator of Equation 4.3—the marginal probability
of the data across many different realizations of the parameter—­serves this exact pur-
pose. Because the parameter of interest does not appear in the denominator, the scaling
term is usually dropped from the expression as follows:

f ( θ | data ) ∝ f ( θ ) × f ( data | θ ) (4.4)


Bayesian Estimation 155

In words, Equation 4.4 says that the posterior distribution is proportional to the product
of the prior and the likelihood. The “proportional to” symbol conveys the idea that the
posterior distribution on the left has the same basic shape as the product on the right.

4.5 THE UNIVARIATE NORMAL DISTRIBUTION

Looking forward, I devote considerable attention to analyses featuring normally distrib-


uted outcome variables. The normal curve is a reasonable approximation for many con-
tinuous variables in the behavioral and social sciences, and it also appears prominently
later in the book as a latent response distribution for categorical variables (Albert &
Chib, 1993; Johnson & Albert, 1999). A univariate analysis example is a useful starting
point, because the basic estimation principles from this simple context readily general-
ize to more complicated analyses. To that end, I use the math posttest scores from the
math achievement data set on the companion website to illustrate how to estimate the
mean and variance in the Bayesian framework. The data set includes pretest and posttest
math achievement scores and academic-­related variables (e.g., math self-­efficacy, stan-
dardized reading scores, sociodemographic variables) for a sample of N = 250 students
(see Appendix).
The univariate analysis can be written as an empty regression model as follows:

Yi = μ + ε i = E ( Yi ) + ε i (4.5)

(
Yi ~ N1 E ( Yi ) , σ 2
)
I used the same model in Chapter 2 to introduce maximum likelihood estimation, and
the Bayesian analysis in this section recycles some of that earlier material. To refresh
notation, N1 denotes a univariate normal distribution function, and the first and sec-
ond terms inside parentheses are the mean and variance parameters, respectively. The
expected value, E(Yi), corresponds to the grand mean in this example, but you can think
about it more generally as a predicted value.
Before getting into specifics, let’s apply the concept from Equation 4.4—the poste-
rior distribution is proportional to the product of the prior and the likelihood—­to this
example. Replacing the generic functions with the quantities from the analysis example
gives the following expression:

( ) ( ) (
f μ, σ2 | data ∝ f ( μ ) × f σ2 × f data | μ, σ2 ) (4.6)

Consistent with Equation 4.4, the leftmost term is the posterior distribution, f(μ) and
f(σ2) are prior distributions, and the rightmost term is the probability distribution of the
data (or equivalently, the likelihood once the data are collected). This analysis moves
closer to a realistic application, because there are multiple parameters; the posterior is
a bivariate distribution that describes the relative probability of different combinations
of μ and σ2 given the data.
156 Applied Missing Data Analysis

Probability Distribution and Likelihood Function


The data distribution or likelihood function is a good place to start, because it often
informs the choice of prior distribution (e.g., it is usually convenient to adopt conjugate
prior distributions that belong to the same family). Revisiting information from Chapter
2, the probability distribution is

 1 ( Y − μ )2 
f= (
Yi | μ, σ2 ) 1
2
exp  −
 2
i

σ2


(4.7)
2πσ  
where Yi is the outcome score for participant i (e.g., a student’s math posttest score), and
μ and σ2 are the population mean and variance, respectively. To reiterate some impor-
tant notation, the function on the left side of the equation can be read as “the relative
probability of a score given assumed values for the parameters.” Visually, “f of Y” is the
height of the normal curve at a particular score value on the horizontal axis. The joint
probability of N observations (or the likelihood of the sample data) is the product of the
individual contributions.
N  1 ( Y − μ )2 
( )
f data | μ, σ2 = ( )
∝ L μ, σ2 | data
1
∏ exp  − i
 (4.8)
( )  
N /2
2πσ2 2 σ2
i =1  
The scaling terms to the left of the product operator ensure that the area under the curve
sums to 1, and I simplify the expression by dropping these terms when possible.

Prior Distributions
Specifying prior distributions for the parameters is a key step in any Bayesian analysis.
We could specify an informative prior if we had a priori knowledge about the mean
(e.g., from a pilot study or meta-­analysis), but I focus on noninformative (or weakly
informative) prior distributions that exert as little influence as possible on the results.
Specialized Bayesian texts and methodological research studies describe other possibili-
ties (Chung, Gelman, Rabe-­Hesketh, Liu, & Dorie, 2015; Gelman, 2006; Gelman et al.,
2014; Kass & Wasserman, 1996; Liu, Zhang, & Grimm, 2016).
The data distribution and likelihood often inform the choice of prior distribution,
because it is convenient to work from the same distribution family (e.g., the binomial
likelihood and beta prior from the earlier analysis example). There are at least two ways
to implement a noninformative prior for the mean. We know from introductory sta-
tistics that the frequentist sampling distribution of the mean is a normal curve, and it
ends up that the posterior distribution of μ is also normal. To invoke a conjugate prior
distribution that imparts very little information, we could specify a normal prior with
an arbitrary mean and a very large variance (e.g., a normal curve with μ0 = 0 and σ02 =
10,000). The mean and variance of the prior are sometimes called hyperparameters.
Setting the spread to a very large number effectively produces a flat distribution
over the range of the data (e.g., the math posttest scores range from 32 to 85). Consis-
tent with the earlier example, we could also specify a uniform prior that is flat over the
Bayesian Estimation 157

entire range of the mean. I adopt this approach, because it yields the same result as the
conjugate prior but simplifies some of the ensuing math. A flat prior distribution assigns
every possible value of the mean the same a priori weight of 1.

f (μ ) ∝ 1 (4.9)

Adopting the normal distribution for the data induces a positively skewed inverse
gamma distribution for the variance. The following expression illustrates the linkage:
N 
 1 N 
 1
(σ ) 1
− +1 
f (X) ∝ X
−( a +1)
∑ (Y − μ)
2 2
exp −b ×  = 2  exp  − ×  (4.10)
 X  2 i
σ2
 i =1 
The left side of the equation is the generic expression for the inverse gamma distri-
bution, and f(X) means that the height of the function varies with different values of
X. Viewing the variance as the X in the normal distribution function and rearranging
terms gives the expression to the right side of the equals sign (I omit the scaling factor
2π). Visually, the inverse gamma is a positively skewed distribution bounded at 0. The
shape parameter a = N/2 controls the height of the distribution, with larger values of N
(the degrees of freedom) resulting in a more peaked distribution with thinner tails. The
scale parameter b = SS/2 controls the distribution’s spread, which increases as the sum
of squares (SS) increases.
To invoke a conjugate prior distribution for the variance, you would specify an
inverse gamma distribution with hyperparameters df0 and SS0. Conceptually, the degrees
of freedom value df0 can be viewed as the number of imaginary data points used to get
the a priori sum of squares value SS0. Allowing these quantities to approach 0 gives the
so-­called Jeffreys prior distribution for the variance (Jeffreys, 1946, 1961; Kass & Was-
serman, 1996; Lynch, 2007).

( )
f σ2 ∝
1
σ2
(4.11)

Figure 4.4 depicts this prior distribution for parameter values between 0 and 20. You
can see that the prior is somewhat informative, because the relative probability values
rapidly increase as the variance approaches 0. The Jeffreys prior is equivalent to specify-
ing a uniform distribution for the natural logarithm of the variance, as the natural log
stretches the exponential curve out to negative infinity.

The Posterior Distribution


We now have all the pieces to construct the posterior distribution of the parameters.
Multiplying the priors by the likelihood gives the following posterior distribution:

( ) ( ) (
f μ, σ2 | data ∝ f ( μ ) × f σ2 × f data | μ, σ2 ) (4.12)

Note that the function on the left now reads “the relative probability of the parameters
158 Applied Missing Data Analysis

Relative Probability

0 5 10 15 20
Population Variance

FIGURE 4.4. Jeffreys prior distribution for the variance. The distribution is somewhat infor-
mative, because the relative probability increases rapidly as the variance approaches 0.

given the data.” Substituting the relevant expression for each component on the right
side gives the following posterior distribution:
N  1 ( Y − μ )2 
( 2
) 1
f μ, σ | data ∝ 1 × 2 ×
1
∏ exp  − i
 (4.13)
( )  
N /2
σ 2πσ2 2 σ2
i =1  
Dropping the scaling constant 2π and combining terms involving the variance further
simplifies the expression.
N  N  1 ( Y − μ )2 
( ) ( )
− +1 
f μ, σ2 | data ∝ σ2 2 
∏ exp  −
 2
i

σ2


(4.14)
i =1  
Equation 4.14 is a bivariate distribution that describes the relative probability of
different combinations of μ and σ2 given the data. The bivariate distribution isn’t nec-
essarily useful for inference, because it intertwines two parameters. We usually want
univariate summaries that reflect the marginal distribution of one parameter without
regard for the other (e.g., the goal is to characterize the most likely value of the mean
without considering the variance, and vice versa). In general, deriving marginal distribu-
tions requires integral calculus, and the complexity of the calculations quickly becomes
intractable with more than a small handful of parameters. Instead, researchers use an
MCMC procedure called the Gibbs sampler (Casella & George, 1992; J­ ackman, 2000) to
Bayesian Estimation 159

estimate the posterior distributions. MCMC estimation readily scales to accommodate


complex analysis models with many parameters, and the algorithm’s basic machinery is
largely the same as it is for this simple example.

4.6 MCMC ESTIMATION WITH THE GIBBS SAMPLER

The Gibbs sampler breaks a complex multivariate problem into a series of simpler uni-
variate steps that iteratively estimate one parameter at a time, treating the current values
of the remaining parameters as known constants. Gelfand and Smith (1990) are often
credited with popularizing the Gibbs sampler as a flexible tool for Bayesian estimation,
and descriptions of the algorithm are widely available in specialized textbooks (­Gelman
et al., 2014; Hoff, 2009; Kaplan, 2014; Levy & Mislevy, 2016; Lynch, 2007; Robert &
Casella, 2004) and tutorial articles (Casella & George, 1992; Jackman, 2000; Smith &
Roberts, 1993). I rely heavily on the Gibbs sampler throughout the rest of the book.
To illustrate the underlying logic of MCMC estimation with the Gibbs sampler, con-
sider a statistical analysis with three parameters, θ1, θ2, and θ3. We need to track changes
in the parameters within an iteration and between successive iterations to fully decon-
struct the algorithm, so I use t = 1, 2, . . . , T to index these repetitive computational
cycles. The Gibbs sampler recipe for the three-­parameter analysis example is as follows:

Assign starting values to θ1(0), θ2(0), θ3(0) at iteration t = 0.


Do for t = 1 to T iterations.
> Estimate θ1(t) conditional on θ2(t–1) and θ3(t–1) and the data.
> Estimate θ2(t) conditional on θ1(t) and θ3(t–1) and the data.
> Estimate θ3(t) conditional on θ1(t) and θ2(t) and the data.
Repeat.

With additional parameters, each iteration continues in a round-robin fashion until all
quantities have been updated.
Three points are worth highlighting. First, each step estimates one parameter at a
time, holding all other parameters constant at their current values. The iteration super-
scripts show that some of the current values originate from the same iteration, while oth-
ers carry over from the previous cycle (e.g., the estimation step for θ2 depends on θ1 from
the current iteration and θ3 from the prior iteration). Second, the order of the estimation
steps usually doesn’t matter, and you should get the same results if you estimate θ3 first,
θ1 second, and θ2 third, for example. Third, the algorithm requires initial estimates (i.e.,
starting values) of all model parameters prior to the first iteration. In many situations,
these initial guesses need not be sophisticated (e.g., initial values of the mean and variance
could be set to 0 and 1, respectively). Remarkably, the Gibbs sampler usually converges
around a more reasonable set of estimates in just a few computational cycles, but accurate
starting values can reduce the number of iterations required to achieve this steady state.
Before going further, I need to clarify the meaning of estimation in this context.
Recall from Chapter 2 that maximum likelihood uses iterative algorithms that succes-
160 Applied Missing Data Analysis

sively adjust parameters until the estimates no longer change from one iteration to the
next. In contrast, the Gibbs sampler uses Monte Carlo computer simulation (the second
“MC” in MCMC) to “sample” or “draw” plausible parameter values at random from a
probability distribution. For example, the MCMC algorithm in the next section esti-
mates the mean by generating a random number from a normal distribution, and it
estimates the variance by drawing random numbers from a right-­skewed inverse gamma
distribution. I sometimes refer to these MCMC-­generated estimates as synthetic param-
eter values to differentiate them from estimates produced by an analytic solution such as
least squares. Of course, computers are quite adept at generating random numbers, and
all statistical software programs have built-in functions for this purpose. Several acces-
sible tutorial articles are available for readers who want more information about Monte
Carlo simulation (e.g., Morris, White, & Crowther, 2019; Muthén & Muthén, 2002;
Paxton, Curran, Bollen, Kirby, & Chen, 2001).
A typical MCMC chain consists of thousands of iterations, each computational
cycle producing a unique set of parameter values. Unlike maximum likelihood, which
identifies a single set of optimal estimates for the data, MCMC creates parameter values
that continually change as the algorithm iterates (e.g., running MCMC for T = 10,000
iterations gives a posterior distribution of 10,000 plausible parameter values). Whereas
iterative optimization routines for maximum likelihood estimation are akin to a hiker
trying to climb to the highest possible elevation on a mountain, MCMC estimation is
more like an explorer describing the geography of the entire mountain. The procedure
naturally produces the posterior distributions needed for inference, because each set of
parameter estimates marginalizes over all other parameters (e.g., the posterior distribu-
tion of θ1 is taken over many different realizations of θ2 and θ3, so we can talk about θ1
without regard for the others). As explained previously, familiar descriptive statistics
such as the median and standard deviation describe the center and spread of each pos-
terior distribution, respectively, and the estimates at the 2.5 and 97.5% quantiles of the
distribution define a 95% credible interval.

4.7 ESTIMATING THE MEAN AND VARIANCE WITH MCMC

Having sketched the basic ideas, I show how to use MCMC to estimate the posterior
distributions of the mean and variance. The Gibbs sampler alternates between two steps:
Estimate the mean given the current value of the variance, then update the variance
given the latest value of the mean (the order of the steps typically doesn’t matter). The
recipe below summarizes the algorithmic steps:

Assign starting values to all parameters.


Do for t = 1 to T iterations.
> Estimate the mean conditional on the variance from the prior iteration.
> Estimate the variance conditional on the updated mean.
Repeat.
Bayesian Estimation 161

Each estimation step draws a synthetic parameter value at random from a probabil-
ity distribution that treats the other parameter as a known constant. Mechanically, you
get these full conditional distributions by multiplying the prior and the likelihood, then
doing some tedious algebra to express the product as a function of a single unknown. I
give these distributions below and point readers to specialized Bayesian texts for addi-
tional details on their derivations (e.g., Hoff, 2009; Lynch, 2007).
First, MCMC estimates the mean by drawing a random number from the univariate
normal conditional distribution below:
 σ 2 
( )
f μ | σ2 ,data ∝ N1  Y ,  (4.15)
 N 
The arithmetic mean of the data in the first term defines the distribution’s center, and
the familiar expression for the squared standard error of the mean (i.e., the sampling
variance) defines the spread in the second term. Following van Buuren (2012), the dot
accent on σ 2 indicates a synthetic variance estimate from an earlier MCMC step (this
value functions like a known constant in the normal distribution equation). Figure 4.5
shows the normal distribution that generates the mean estimates.
Next, the algorithm samples an estimate of the variance from a positively skewed
inverse gamma distribution that conditions on (treats as known) the newly minted syn-
thetic mean. The full conditional distribution for this step is as follows:
Relative Probability

54 55 56 57 58 59 60
Population Mean

FIGURE 4.5. Conditional posterior distribution of the mean from the univariate analysis
example. The MCMC algorithm estimates the mean by drawing a number at random from this
normal distribution.
162 Applied Missing Data Analysis

Relative Probability

60 80 100 120 140


Population Variance

FIGURE 4.6. Conditional posterior distribution of the variance from the univariate analysis
example. The MCMC algorithm estimates the variance by drawing a number at random from this
inverse gamma distribution.

N 1 N 
( )
f σ2 | μ,data ∝ IG  ,
2 2 ∑ ( Y − μ )
i
2
 (4.16)
 i =1 
The first and second terms in the function are shape and scale parameters, respectively
(sometimes denoted a and b). The shape parameter determines the height of the dis-
tribution, with larger values of N (the degrees of freedom) resulting in a more peaked
distribution with thinner tails. The scale parameter controls the distribution’s spread,
which increases as the sum of squares around μ increases. Visually, the inverse gamma
looks like a chi-­square. To illustrate, Figure 4.6 shows the inverse gamma distribution
that generates the variance estimates.

Analysis Example
I use the math achievement data on the companion website to illustrate a Bayesian anal-
ysis involving the mean and variance. The empty regression model is

MATHPOSTi = μ + ε i (4.17)

Analysis scripts are available on the companion website, including a custom R program
for readers who are interested in coding the algorithm by hand. I specified an MCMC
Bayesian Estimation 163

chain consisting of T = 11,000 iterations, and I discarded the results from the first 1,000
iterations. Discarding estimates from this so-­called burn-in interval gives the algorithm
time to recover from its starting values and converge to a trustworthy steady state. I
discuss diagnostic tools for evaluating convergence and determining the length of this
initial interval later in the chapter.
Each MCMC iteration “estimates” the mean and variance by generating random
numbers from a normal curve and inverse gamma distribution (see Figures 4.5 and 4.6).
The normal distribution in Figure 4.5 is centered at the mean of the data, and its spread
is a function of the sample size and the variance estimate from the previous iteration.
Drawing a random number from this distribution gives a new estimate of the mean. The
inverse gamma distribution in Figure 4.6 is right skewed, such that its center is a func-
tion of the sample size or degrees of freedom, and its spread is determined by the sum of
squares of the data around the current mean estimate. Drawing a random number from
this distribution gives a new estimate of the variance. Conceptually, the process of gen-
erating random numbers is akin to wearing a blindfold and throwing a dart at a picture
of each distribution. For any throw that lands under the curve, the location of the dart
on the horizontal axis is the new parameter value. Naturally, you would be more likely
to hit the peaked areas of the curve and less likely to hit the areas in the tails, but over
the course of many throws, the darts would land throughout the entire distribution.
One feature that sets Bayesian estimation apart from maximum likelihood is that
it yields an entire distribution of estimates for each parameter (i.e., it maps the entire
geography of the hill rather than climbing directly to its peak). Furthermore, because
estimates are drawn at random from a distribution of plausible values, they usually don’t
change in a systematic direction from one iteration to the next. To illustrate, Table 4.1
shows the parameters from the first 10 MCMC iterations. Notice that the estimates oscil-
late up and down, and the changes between successive iterations are seemingly random,
with no pattern to their direction or magnitude. This behavior contrasts with maximum
likelihood estimation, which makes large adjustments early in the iterative process and
very small changes later as estimates approach their optimum values (see Table 2.4).

TABLE 4.1. Estimates from 10 MCMC Iterations


Iteration μ σ2
1 56.33 94.54
2 56.59 85.97
3 57.43 93.91
4 56.00 87.30
5 57.32 91.10
6 56.72 91.02
7 56.73 71.09
8 57.22 91.96
9 57.07 85.63
10 56.86 78.18
164 APPLIED MISSING DATA ANALYSIS

Turning to the posterior summaries, Figure 4.7 gives a kernel density plot (a fine-
grained histogram with a smoothed line connecting the tops of the bars) of the 10,000
mean estimates, and Figure 4.8 is the corresponding plot of the variances. The mean’s
distribution is approximately normal, whereas the variance’s distribution has a slight
positive skew. The slightly irregular shapes are not a function of my subpar graphing
skills, but instead reflect the random nature of the estimation process. Had I increased
the number of iterations (e.g., from 10,000 to 100,000), the distributions would be
smoother with fewer bumps. The kernel density plots are subtly different from the
conditional distributions in Figures 4.5 and 4.6. Whereas the conditional distributions
reflect variations in one parameter with the other held constant, the kernel density plots
display the marginal posterior distribution of each parameter taken over many different
realizations of the other. Readers familiar with calculus may recognize that the Gibbs
sampler is performing integration (marginalization) via brute force.
As described earlier, simple descriptive statistics characterize the center and spread
of the distributions. The solid vertical lines are the median estimates, and the dashed
lines denote the 95% credible intervals (the parameter values above and below which
2.5% of the distribution falls). The median intercept value is Mdnμ = 56.79, and its stan-
dard deviation is SDμ = 0.59. The 95% credible interval for this parameter spans from
approximately 55.64 to 57.98. The posterior median and standard deviation of the vari-
ance are Mdn σ2 = 88.89 and SDσ2 = 7.96, and its 95% credible interval ranges from 74.63
Relative Probability

54 55 56 57 58 59 60
Mean

FIGURE 4.7. Kernel density plot of 10,000 mean estimates. The solid vertical line is the poste-
rior median, and the dashed vertical lines denote the 95% credible interval limits.
Bayesian Estimation 165

Relative Probability

60 80 100 120 140


Variance

FIGURE 4.8. Kernel density plot of 10,000 variance estimates. The solid vertical line is the
posterior median, and the dashed vertical lines denote the 95% credible interval limits.

to 105.89. You can see that the credible interval in Figure 4.8 is not symmetrical, because
the posterior distribution is positively skewed.
It is instructive to compare the Bayesian summaries to the corresponding maximum
likelihood estimates from Chapter 2. The maximum likelihood estimate of the mean
was μ̂ = 56.79, its standard error was SE = 0.59, and the 95% confidence interval ranged
from 55.63 to 57.95. The point estimate and standard error are effectively identical to the
posterior mean and standard deviation, and the confidence interval boundaries closely
match the Bayesian credible interval. Despite their numerical similarities, the interpreta-
tions of these quantities differ in important ways. For example, the frequentist standard
error represents the expected variation of the estimate across many different random
samples, whereas the posterior standard deviation reflects subjective uncertainty about
the parameter after analyzing the data. Similarly, the 95% confidence interval conveys the
expected performance of many such intervals computed from different random samples,
whereas the 95% credible interval gives a range of high certainty about the parameter.
Turning to the variance, the maximum likelihood estimate and its standard error were
σ̂2 = 87.72 and SE = 7.85, and the 95% confidence interval ranged from 72.34 to 103.09.
Apart from its interpretation, the width of the 95% confidence interval also differs from
the Bayesian credible interval, because the maximum likelihood interval assumes that
sampling variation follows a normal curve (this is only true at very large sample sizes).
This assumption yields confidence intervals that are symmetrical around the point esti-
mate, whereas the Bayesian interval captures the long right tail of the distribution.
166 Applied Missing Data Analysis

4.8 LINEAR REGRESSION

Having established the basics of Bayesian inference, we can readily extend the pro-
cedure to linear regression. As you will see, the previous concepts generalize to this
analysis with virtually no modifications, because estimation still relies on the univariate
normal curve. A single-­predictor model is a useful starting point, because the poste-
rior distribution of the coefficients can be visualized in a three-­dimensional graph. The
simple regression model is

Yi = β0 + β1 X i + ε i = E ( Yi | X i ) + ε i (4.18)

(
Yi ~ N1 E ( Yi | X i ) , σ2ε )
where E(Yi|Xi) is the predicted value for individual i (i.e., the expected value or mean of
Y given a particular X score), the tilde means “distributed as,” N1 denotes the univariate
normal distribution function (i.e., the probability distribution in Equation 4.7), and the
conditional mean and residual variance are the distribution’s two parameters. In words,
the bottom row of the expression states that outcome scores are normally distributed
around a regression line with constant residual variation.
Switching gears to a different substantive context, I use the smoking data from the
companion website to illustrate a multiple regression analysis. The data set includes
several sociodemographic correlates of smoking intensity from a survey of N = 2,000
young adults (e.g., age, whether a parent smoked, gender, income). To facilitate graphing,
I start with a simple regression model where the parental smoking indicator (0 = parents
did not smoke, 1 = parent smoked) predicts smoking intensity (higher scores reflect more
cigarettes smoked per day):

INTENSITYi = β0 + β1 ( PARSMOKEi ) + ε i (4.19)

The intercept represents the expected smoking intensity score for a respondent whose
parents did not smoke, and the slope is the group mean difference. The analysis example
later in this section expands the model to include additional explanatory variables. For
now, there is no need to specify a distribution for explanatory variables, so any vari-
ables on the right side of the equation function like constants, as they do in ordinary
least squares and maximum likelihood estimation. This feature has no bearing on a
complete-­data regression analysis (Jackman, 2009), but we will need to specify a dis-
tribution for incomplete predictor variables in the next chapter (the same was true for
maximum likelihood).
Before getting into specifics, let’s again apply the idea from Equation 4.4—the pos-
terior distribution is proportional to the product of the prior and the likelihood—­to the
regression model. Replacing the generic functions with the quantities from the analysis
example gives the following expression:

( ) ( ) (
f β, σ2ε | data ∝ f β, σ2ε × f data | β, σ2ε ) (4.20)
Bayesian Estimation 167

where the leftmost term is the posterior distribution, β represents the vector of regres-
sion coefficients, f(β, σε2) denotes the prior distributions, and the rightmost term is the
distribution of the outcome variable (or equivalently, the likelihood once the data are
collected). The equation gives the relative probability of different combinations of the
coefficients and residual variance given the data (in the case, data refers to the dependent
variable, as the predictor functions like a known constant). Visually, “f of the parameters
given the data” is the height of a multivariate surface at different combinations of values
in β and σε2. Consistent with the previous section, we can use the Gibbs sampler to esti-
mate the marginal posterior distribution of each parameter.

Probability Distribution and Likelihood Function


The data distribution or likelihood function is again a good place to start, because it
often informs the choice of prior distribution (e.g., it is usually convenient to adopt
conjugate prior distributions that belong to the same family). Linear regression lever-
ages the univariate normal distribution function from Equation 4.7. The only differ-
ence is that a predicted value replaces μ and a residual variance replaces σ2. After col-
lecting a sample of data, the likelihood function is a product of N normal distribution
functions.

1 ( Yi − E ( Yi | X i ) )
N  2 
f ( data | β, σ2ε ) ( )
L β, σ2ε | data
∝=
1
∏ 
exp −  (4.21)
(2πσ )
2 N /2
ε i =1
 2

σ2ε 

To reiterate the notation, the function on the left side of the equation reads “the relative
probability of the data given assumed values for the parameters.” Visually, each individ-
ual’s contribution to “f of Y” is the height of the conditional normal curve that describes
the spread of scores around a particular point on the regression line (e.g., the normal
distribution of smoking intensity scores for participants who share the same value of
the parental smoking indicator). As before, some of the terms to the left of the product
operator comprise a scaling factor that I ignore whenever possible.

Prior Distributions
As explained previously, the data distribution and likelihood often inform the choice
of prior distribution, because it is convenient to work from the same distribution fam-
ily. There are at least two ways to implement a noninformative prior for the regression
coefficients. To invoke conjugate prior distributions that impart very little information,
we could again specify normal priors with μ0 = 0 and σ02 = 10,000, as this would yield
distributions that are effectively flat over the range of the data. Alternatively, we could
adopt a uniform prior that is flat over the entire range of each coefficient in β. Following
the univariate analysis, I use a uniform prior for the coefficients and a Jeffreys prior for
the residual variance (Jeffreys, 1946, 1961). These distributions are f(β) ∝ 1 and f(σε2) ∝
(σε2)–1.
168 Applied Missing Data Analysis

Posterior Distribution, MCMC Algorithm,


and Conditional Distributions
The posterior is a multivariate distribution that describes the relative probability of dif-
ferent combinations of the parameters given the observed data. As prescribed by Bayes’
theorem, the posterior distribution is the product of the prior and the likelihood.

( ) ( ) (
f β, σ2ε | data ∝ f ( β ) × f σ2ε × f data | β, σ2ε ) (4.22)

To reiterate notation, the function on the left now reads “the relative probability of the
parameters given the data.” Bayes’ theorem has converted the likelihood to a probability
distribution.
As explained previously, the Gibbs sampler breaks a complex multivariate problem
into a series of simpler univariate problems, each of which draws a synthetic param-
eter value at random from a probability distribution that treats all other parameters as
known constants. MCMC estimation for linear regression follows a two-step recipe:
Estimate the coefficients in β as a block given the current value of the residual variance,
then update the variance given the latest coefficients (again, the order of the steps typi-
cally doesn’t matter). The recipe below summarizes the algorithmic steps.

Assign starting values to all parameters and missing values.


Do for t = 1 to T iterations.
> Estimate coefficients conditional on the residual variance.
> Estimate the residual variance conditional on the updated coefficients.
Repeat.

Each estimation step draws synthetic parameter values at random from a probability dis-
tribution. Mechanically, you get these full conditional distributions by multiplying the
prior and the likelihood, then doing some tedious algebra to express the product as a func-
tion of a single unknown. I give these distributions below and point readers to specialized
Bayesian texts for additional details on their derivations (e.g., Hoff, 2009; Lynch, 2007).
First, the MCMC algorithm estimates regression coefficients by drawing a vector
of random numbers from a multivariate normal conditional distribution. With only two
coefficients, we can visualize the conditional posterior distribution of β0 and β1 in three
dimensions. Figure 4.9 shows the bivariate normal distribution of the intercept and slope.
The angle of the distribution owes to the fact that the coefficients are negatively corre-
lated (i.e., a larger mean difference requires a lower comparison group average). More
formally, the shape of the conditional distribution is given by the following equations:

( ) (
f β | σ2ε ,data ∝ N K +1 βˆ , S βˆ ) (4.23)
−1
=βˆ (=
X ′X ) X ′Y βˆ OLS
−1
S βˆ = σ 2ε ( X ′X )

where K is the number of predictors, NK+1 denotes a normal distribution with K + 1


Bayesian Estimation 169

dimensions or variables, Y is the vector of N outcome scores, and X denotes the N × (K


+ 1) matrix of explanatory variables that includes a column of ones for the intercept
(e.g., for the simple regression, X is a matrix with N = 2,000 rows and two columns).
The formulas that define the distribution’s mean vector (the peak of the distribution in
Figure 4.9) and covariance matrix (the distribution’s spread and rotation) are identical
to ordinary least squares solutions. As a reminder, a dot accent denotes a synthetic esti-
mate from a previous MCMC step (van Buuren, 2012).
Next, the algorithm samples an estimate of the residual variance from a positively
skewed inverse gamma distribution that conditions on the updated coefficients. The full
conditional distribution is

(
 ′ 
 N Y − Xβ Y − Xβ )( ) 
f ( σ2ε )
| β,data ∝ IG  ,  (4.24)
2 2 
 

1.0

0.8
Relative

0.6
Probabil

0.4
ity

4.0
0.2
3.5
ope

0.0
n Sl

3.0
8.4
io

8.6
ress

8.8
2.5
Reg

Reg 9.0
ress
ion 9.2
Inte
rcep
t 9.4
2.0
9.6

FIGURE 4.9. Conditional distribution of the intercept and slope from a simple regression
analysis. The MCMC algorithm estimates the coefficients by drawing a pair of numbers at ran-
dom from this bivariate normal distribution.
170 Applied Missing Data Analysis

where (Y – Xβ )′(Y – Xβ ) is the matrix expression for the residual sum of squares. This
distribution is identical to Equation 4.16 except that the scale parameter in the func-
tion’s second argument features the residual sum of squares rather than the sum of
squares around the mean. Visually, the distribution resembles the curve in Figure 4.6.

Analysis Example
To illustrate Bayesian estimation for a multiple regression, I expanded the previous
model to include age and income as predictors. I centered the additional variables at
their grand means to maintain the intercept’s interpretation as the expected smoking
intensity score for a respondent whose parents did not smoke.

INTENSITYi = β0 + β1 ( PARSMOKEi ) + β2 ( AGEi − μ AGE ) + β3 ( INCOMEi − μ INC ) + ε i (4.25)

You may recall from the corresponding maximum likelihood analysis that the smok-
ing intensity distribution has substantial positive skewness and kurtosis. Frequentist
corrections like robust standard errors do not apply to the Bayesian framework, but the
posterior distributions are not constrained to be a particular shape and will naturally
vary in reaction to the data. Even with this flexibility, the Bayesian estimates won’t
be materially different from those of normal-­theory maximum likelihood. Chapter 10
describes a promising procedure for modeling non-­normal distributions (Lüdtke et al.,
2020b; Yeo & Johnson, 2000).
Consistent with the previous example, I specified an MCMC chain with T = 11,000
iterations and discarded the results from the first 1,000 iterations. Analysis scripts are
available on the companion website, including a custom R program for readers inter-
ested in coding the algorithm by hand. Discarding the burn-in or warm-up cycles will
be standard practice moving forward, as doing so allows the algorithm to recover from
its starting values and converge to a trustworthy steady state. I discuss convergence
and present diagnostic tools for determining the length of this initial period in the next
section. In the interest of space, I omit kernel density plots, because the distributions
look like Figures 4.7 and 4.8. Instead, Table 4.2 gives a tabular summary of the analysis
that includes the posterior median, standard deviation, and 95% credible interval limits
(the .025 and .975 quantiles of each distribution). From a substantive perspective, the
interpretation of the coefficients is the same as a least squares or maximum likelihood.
For example, the intercept (Mdnβ0 = 9.09) is the expected number of cigarettes smoked
per day for a respondent whose parents didn’t smoke, and the parental smoking indi-
cator slope (Mdnβ1 = 2.91) is the mean difference, controlling for age and income. The
standard deviations of the coefficients are analogous to frequentist standard errors in
the sense that they reflect our uncertainty or degree of knowledge about the parameters
after analyzing the data, but they do so without reference to other hypothetical samples
from the population.
Table 4.2 also gives maximum likelihood estimates as a comparison. The point
estimates and normal-­theory standard errors were effectively numerically equivalent
to the posterior median and standard deviation, respectively, and the 95% confidence
interval boundaries closely match the Bayesian credible intervals. It is important to
Bayesian Estimation 171

TABLE 4.2. Posterior Summary of the Multiple Regression Analysis


Bayesian estimation
Parameter Mdn SD LCL UCL
β0 9.09 0.13 8.84 9.34
β1 (PARSMOKE) 2.91 0.19 2.54 3.28
β2 (AGE) 0.59 0.04 0.51 0.67
β3 (INCOME) –0.10 0.03 –0.15 –0.05
σε2 17.18 0.54 16.15 18.30
R2 0.19 0.02 0.17 0.22
Maximum likelihood
Parameter Est. SE LCL UCL
β0 9.09 0.13 8.85 9.34
β1 (PARSMOKE) 2.91 0.19 2.55 3.28
β2 (AGE) 0.59 0.04 0.51 0.67
β3 (INCOME) –0.10 0.03 –0.16 –0.05
σε2 17.15 0.54 16.08 18.21
R2 0.19 0.02 — —

Note. LCL, lower credible or confidence limit (Bayesian and maximum Likelihood, respectively);
UCL, upper credible or confidence limit (Bayesian and maximum likelihood, respectively).

reiterate that the interpretations of these quantities differ in important ways. For
example, the 95% confidence intervals convey the expected performance of many such
intervals computed from different random samples, whereas the 95% credible inter-
vals give a range of high certainty about the parameters. Frequentist hypothesis tests
would declare the three slope coefficients statistically significant, because a null value
of zero falls outside their 95% confidence intervals. The 95% credible intervals similarly
suggest that zero is an unlikely value for the slope parameters. The Bayesian analysis
also allows you to assign probability values to parameters. For example, the lowest β1
coefficient in the distribution of 10,000 estimates was greater than zero, from which
we can conclude that the probability that the parameter is positive (individuals whose
parents smoked have higher smoking intensity scores) is effectively 100%. This state-
ment makes no sense in the frequentist framework, because the parameter is fixed in
the population.

4.9 ASSESSING CONVERGENCE OF THE GIBBS SAMPLER

Recall that the goal of maximum likelihood is to find the optimal parameter values for
the data. An iterative optimization routine like Newton’s algorithm or EM is akin to a
hiker that climbs to the highest possible elevation on a hill as quickly as possible. The
hike ends when parameter estimates no longer change from one iteration to the next.
In contrast, Bayesian estimation generates an entire distribution of plausible parameter
172 Applied Missing Data Analysis

values for the data. The hiker in the analogy is now tasked with mapping the geography
of a hill. Because the algorithm draws parameters at random from a distribution, esti-
mates will continually change for as long as the algorithm is running—­the MCMC hiker
will dutifully map the hill until you tell it to stop.
In the context of MCMC estimation, convergence means that the iterative algorithm
is generating estimates that form a stable distribution; that is, running the algorithm for
additional iterations does not change the mean and variance of the estimates. Further-
more, we say that the algorithm is “mixing” well if it’s producing values throughout the
entire range of the distribution. Software programs that implement maximum likeli-
hood have automated rules for determining when the estimator has converged (e.g., stop
when the largest change in any parameter from one iteration to the next is less than
.00001), but we need to be more proactive and involved when using Bayesian estima-
tion. As a rule, it is good practice to perform a preliminary diagnostic run to examine
convergence and mixing.
As you might imagine, Bayesian methodologists have proposed many techniques for
assessing the convergence of iterative algorithms like the Gibbs sampler (Brooks & Gel-
man, 1998; Cowles & Carlin, 1996; Gelman & Rubin, 1992; Geweke, 1992; Geyer, 1992;
Johnson, 1996; Mykland, Tierney, & Yu, 1995; Raftery & Lewis, 1992; Ritter & Tanner,
1992; Zellner & Min, 1995), and authors of specialized texts discuss these methods in
detail (Gelman et al., 2014; Hoff, 2009; Kaplan, 2014; Lynch, 2007; Robert & Casella,
2004). I focus primarily on line graphs known as trace plots and a numerical diagnostic
called the potential scale reduction factor (Brooks & Gelman, 1998; Gelman et al., 2014;
Gelman & Rubin, 1992), as both are readily available in Bayesian analysis software. I use
the regression analysis from Section 4.8 to illustrate these diagnostics.

Trace Plots
A trace plot is a line graph that displays the iterations on the horizontal axis and the
corresponding synthetic parameter values on the vertical axis. To illustrate, consider the
intercept parameter from the smoking intensity regression model. The previous analysis
produced 11,000 estimates of this parameter, and I discarded the results from first 1,000
iterations. Figure 4.10 shows a trace plot of the parameter values from the first 500 itera-
tions following the burn-in period (i.e., iterations 1,001 to 1,500). The solid horizontal
line denotes the median parameter value across all 10,000 iterations, and the dashed
lines represent the 95% credible interval boundaries. The jagged pattern reflects the
fact that the algorithm is sampling parameter values at random from a distribution. To
emphasize this point, Figure 4.11 superimposes the posterior distribution on the trace
plot. You can see from this graph that the trace plot displays the location of the estimates
in the distribution in serial order.
The trace plot illustrates two important features, both of which suggest the algo-
rithm is working well. First, the algorithm appears to have reached a steady state,
because the intercept estimates oscillate around a flat line located at the center of the
distribution. Furthermore, the magnitude and variation of the random shocks are con-
sistent with the 95% credible intervals. The absence of long-term vertical drifts (up or
down) is exactly what you want to see, as this provides evidence that the estimates have
Bayesian Estimation 173

9.6
9.4
Regression Intercept
9.2
9.0
8.8
8.6

1000 1100 1200 1300 1400 1500


Iteration

FIGURE 4.10. Trace plot of intercept estimates from the first 500 MCMC iterations following
the burn-in period.
9.6
9.4
Regression Intercept
9.2
9.0
8.8
8.6

1000 1100 1200 1300 1400 1500


Iteration

FIGURE 4.11. Posterior distribution of 10,000 estimates superimposed over a trace plot of
intercept estimates from the first 500 MCMC iterations following the burn-in period.
174 APPLIED MISSING DATA ANALYSIS

converged around a stable mean (i.e., the algorithm is mapping values around the hill’s
peak). Second, the algorithm appears to be mixing well, because the 500 estimates are
dispersed throughout the distribution’s entire range; that is, the algorithm is mapping
the geography of the entire hill without focusing too much on one region of the surface.
Model parameters usually differ with respect to how quickly they achieve a steady
state and how long it takes MCMC to map their full range of values. For this reason, it is
important to examine trace plots for every model parameter. As a second example, Figure
4.12 shows a trace plot of residual variance estimates from the same 500-iteration interval,
and Figure 4.13 superimposes the full posterior distribution over the trace plot. The plot
looks a bit different, because the posterior is positively skewed, but you can see that the
parameter values have converged to a stable distribution and the algorithm is mixing well.
The previous figures are good prototypes for an ideal trace plot, but it is just as
instructive to see what a problematic plot looks like. Figure 4.14 plots 20,000 parameter
estimates from a different data set. Notice that the parameter, which happens to be a
regression slope, never converges to a stable distribution. Rather, the plot features long
periods of vertical drift (e.g., the increasing trend between iterations 10,000 and 15,000),
and it is nearly impossible to identify the center of the distribution. In this case, the lack
of convergence is caused by missing data, as the observed scores do not contain enough
information to estimate the model. Interestingly, trace plots for other parameters in the
same analysis look just fine, which underscores the importance of examining trace plots
for the full set of model parameters.
20
19
Residual Variance
18
17
16
15

1000 1100 1200 1300 1400 1500


Iteration

FIGURE 4.12. Trace plot of residual variance estimates from the first 500 MCMC iterations
following the burn-in period.
20
19
Residual Variance
18
17
16
15

1000 1100 1200 1300 1400 1500


Iteration

FIGURE 4.13. Posterior distribution of 10,000 estimates superimposed over a vertical trace
plot of residual variance estimates from the first 500 MCMC iterations following the burn-in
period.
30
25
20
Parameter Value
15
10
5
0

0 5000 10000 15000 20000


Iteration

FIGURE 4.14. Trace plot of a parameter that fails to converge.

175
176 APPLIED MISSING DATA ANALYSIS

Deciding on a Burn‑In Period


The previous trace plots indicate that the MCMC algorithm is working properly after
1,000 iterations, but they don’t tell us how quickly the algorithm settles into a steady
state. To illustrate what happens during the burn-in period, I ran the Gibbs sampler
twice (i.e., I ran two MCMC chains) using very different starting values. Figure 4.15
shows a trace plot of the residual variance estimate from the first 20 iterations. The solid
horizontal line denotes the median estimates across the final 10,000 iterations, and the
dashed lines represent the 95% credible interval boundaries. You can see that the initial
estimates (starting values) are quite different, but both chains quickly began producing
estimates that oscillate around the same center line. However, the chains are not yet
mixing well, because the algorithm has not sampled values throughout the distribu-
tion’s entire range (e.g., most of the estimates from the latter half of the chain are below
the median). Figure 4.16 extends the plot to display the first 200 iterations. Both chains
continue to hover around the same center line, and the estimates have mapped the entire
distribution, including the tails. The second plot suggests that the chains converge to
a stable distribution and are mixing well after only 200 iterations. The trace plots for
the other parameters showed similar patterns, suggesting that a 1,000-iteration burn-in
period is more than sufficient.
Looking forward to the next few chapters, a variety of features influence conver-
gence and mixing. Not surprisingly, missing values are influential and increase the
22
20
Residual Variance
18
16
14

0 5 10 15 20
Iteration

FIGURE 4.15. Trace plot of the residual variance from the first 20 iterations of two MCMC
chains with different starting values.
Bayesian Estimation 177

22
20
Residual Variance
18
16
14

0 50 100 150 200


Iteration

FIGURE 4.16. Trace plot of the residual variance from the first 200 iterations of two MCMC
chains with different starting values.

number of iterations required to achieve convergence (often dramatically). Perhaps


counterintuitively, applying the Gibbs sampler to large samples slows convergence,
because estimates that are highly precise require many iterations to thoroughly map
the area under the posterior distribution. As you will see, ordered categorical variables
often require very long burn-in periods, because they introduce additional threshold
parameters that determine the response proportions in each category. These are just a
few things that influence the Gibbs sampler, and I try to highlight these and other fea-
tures as we encounter them. Of course, specialized textbooks are also a good source of
practical information (Gelman et al., 2014; Hoff, 2009; Kaplan, 2014; Levy & Mislevy,
2016; Lynch, 2007; Robert & Casella, 2004).

Potential Scale Reduction Factor


Trace plots are simple and powerful tools for examining convergence and mixing, but
they are a bit like interpreting a Rorschach test—it takes practice to become confident
and efficient at making interpretations. The potential scale reduction factor (PSRF;
Gelman & Rubin, 1992) is a useful numerical diagnostic that is widely available in
Bayesian software programs. Computing the PSRF requires two or more MCMC chains.
The basic idea is that when two chains converge to the same stable distribution, their
178 APPLIED MISSING DATA ANALYSIS

4.0
Discard Compare

3.5
Regression Slope
3.0
2.5
2.0

0 5 10 15 20
Iteration

FIGURE 4.17. Slope estimates from two MCMC chains composed of 10 iterations each. The
dashed lines denote the mean estimates from each chain, and the solid horizontal line is the
grand mean. The PSRF from the two chains is 1.12.

means should be very similar, particularly when gauged against the magnitude of the
within-chain variation across iterations. In contrast, when two chains have not con-
verged or are not mixing well, the mean difference will be large relative to the within-
chain variation. The PSRF is intuitively appealing, because it defines each chain as a
group of estimates and uses familiar mean square expressions from one-factor analysis
of variance (ANOVA) designs to quantify between-chain mean differences and within-
chain noise variation.
To illustrate the PSRF, consider the first 20 estimates of the parental smoking
regression slope (the β1 coefficient in Equation 4.25). Figure 4.17 shows a trace plot of
the estimates, with dashed horizontal lines showing each chain’s group average. Gelman
et al. (2014, pp. 284–285) recommend splitting MCMC chains in half and using the sec-
ond halves to compute the PSRF. Thus, the diagnostic considers the distributions from
two chains with 10 iterations each (i.e., iterations 11–20). The vertical separation of the
dashed lines is the between-chain mean difference, and the magnitude of the random
spikes around the mean lines reflect within-chain noise variation. The PSRF uses the
between-group mean square from ANOVA to quantify the between-chain mean differ-
ence
C

∑(θ )
T 2
Between-Chain=
Variance −θ (4.26)
(C − 1) c =1
c
Bayesian Estimation 179

where C is the total number of chains, T is the number iterations per chain, θc is the
mean estimate from chain c, and θ is the grand mean (i.e., the mean of the means). The
within-­group mean square from ANOVA quantifies the pooled within-­chain variance.
C
1
=
Within-Chain Variance
C ∑σ
c =1
2
θc (4.27)

The variance of chain c’s estimates is computed by applying the sample variance formula
to the T estimates within each chain. The total variance of the estimates is a weighted
sum of the between- and within-­chain variance.
 T −1  1 
Total Variance =
 × Within-Chain  +  × Between-Chain  (4.28)
 T   T 
Finally, we have the components to define the PSRF:

Total Variance
PSRF = (4.29)
Within-Chain Variance

The idea behind the PSRF is that when the two chains have converged to a stable
distribution and are mixing well, the between-­chain mean difference will be very small
relative to within-­chain variation, in which case the total variance in the numerator will
be similar to the denominator. In a hypothetically perfect scenario where two chains
have identical means, the between-­chain variation vanishes, and the fraction under the
radical is approximately equal to 1. Conversely, the ratio of total variance to within-­
chain variance grows increasingly larger as the mean difference increases. From this,
we see that lower PSRF values are better, and the best possible value equals 1. Rules of
thumb from popular Bayesian texts suggest that PSRF values less than 1.05–1.10 are
usually sufficient for practical applications (Gelman et al., 2014).
Returning to the regression analysis, the two chains in Figure 4.17 give a PSRF of
1.10, which is above the recommended threshold. The high PSRF value indicates that
the algorithm has not iterated long enough for the chains to converge and mix well.
Increasing the number of iterations should address this issue and reduce the PSRF. The
between-­chain mean difference essentially vanishes when comparing the second halves
of two chains of 200 iterations (i.e., iterations 101–200), and the PSRF drops to near its
theoretical minimum. Consistent with the trace plots, we should examine PSRF values
for all model parameters, as the diagnostic can vary dramatically from one parameter
to the next. Using the largest (worst) PSRF value to specify a burn-in period is often a
safe strategy.
Software packages often compute the PSRF values at regular intervals during the
burn-in period. To illustrate, Table 4.3 shows the PSRF value for each parameter at itera-
tions 20, 30, 40, 50, and 100 (following recommendations, I again compared the second
halves of each interval). Table 4.3 highlights three important points. First, parameters
converge at different rates; the parental smoking and age slopes produce acceptable PSRF
values almost immediately, whereas other parameters require more iterations. Second,
180 Applied Missing Data Analysis

TABLE 4.3. Split‑Chain PSRF Comparisons after 20, 30, 40, 50,
and 100 Iterations
Comparison interval for PSRF computation
Parameter 11 to 20 16 to 30 21 to 40 26 to 50 51 to 100
β0 1.04 1.03 1.04 1.00 1.08
β1 (PARSMOKE) 1.10 1.02 1.05 1.01 1.02
β2 (AGE) 1.18 1.00 1.00 1.04 1.00
β3 (INCOME) 1.01 1.09 1.07 1.00 1.00
σε2 1.08 1.04 1.10 1.18 1.01

PSRF values can increase or decrease from one interval to the next (e.g., the PSRF for β1
drops from 1.10 to 1.02, then it increases to 1.05). These oscillations tend to diminish
as the number of iterations increases, because the chain means are less susceptible to
large random shocks and outlier estimates. Increasing the number of chains also stabi-
lizes the PRSF estimates. Third, at the 100th iteration, the highest (worst) PSRF had not
dropped below the conservative threshold of 1.05, but all indices were acceptably low
by the 200th MCMC cycle. Considered as a whole, the numerical diagnostics reinforce
the conclusion from the trace plots, which is that the algorithm converges and mixes
thoroughly well before the end of the 1,000-iteration burn-in period. I could reduce the
length of the burn-in interval if I wanted, but there is no compelling reason to do so.

4.10 MULTIVARIATE NORMAL DATA

The multivariate normal distribution played an important role in maximum likeli-


hood estimation, and it appears prominently in Bayesian analyses and multiple imputa-
tion. This section uses the distribution as a backdrop for estimating a mean vector and
variance–­covariance matrix. As you will see, the concepts we’ve already established
readily generalize to multivariate data with virtually no modifications (although some
of the equations are messier). I use the employee data from the companion website to
provide a substantive context. The data set includes several workplace-­related variables
(e.g., work satisfaction, turnover intention, employee–­supervisor relationship quality)
for a sample of N = 630 employees. The illustration uses a 7-point work satisfaction rat-
ing (1 = extremely dissatisfied to 7 = extremely satisfied) and two composite scores that
measure employee empowerment and a construct known as leader–­member exchange
scale (the quality of an employee’s relationship with his or her supervisor). I treat work
satisfaction as a normally distributed variable, because it has a sufficient number of
response options and a symmetrical distribution (Rhemtulla et al., 2012). The Appendix
gives a description of the data set and variable definitions.
To tie the multivariate normal distribution back to earlier material, it is useful to
cast the analysis as three empty regression models. Using generic notation, the models
are as follows:
Bayesian Estimation 181

 WORKSATi   Y1i   μ1   ε1i 


       
Yi =  EMPOWER i  =  Y2i  +  μ 2  +  ε 2i  = μ + ε (4.30)
 LMX  Y  μ  ε 
 i   3i   3   3i 
Yi ~ N 3 ( μ, S )

Recall that N3 denotes a three-­dimensional normal distribution, and the first and second
terms in parentheses are the mean vector and variance–­covariance matrix (the multi-
variate distribution’s parameters).

Probability Distribution and Likelihood Function


By now, you know that the posterior distribution is proportional to the product of the
prior distributions and the likelihood function. Symbolically, the distribution for this
example is

f ( μ, S | data ) ∝ f ( μ ) × f ( S ) × f ( data | μ, S ) (4.31)

Recycling information from Chapter 2, the joint probability of N observations (or the
likelihood of the sample data) is the product of the individual contributions.
N
 1
f ( data | μ, S ) =π
(2 ) ( − N ×V ×.5)
S
( − N ×.5)
∏exp  − 2 ( Y − μ )′ S
i
−1
( Yi − μ )  (4.32)
i =1 

The column vector Yi contains the V observations for participant i, μ is the correspond-
ing vector of population means, and Σ is a variance–­covariance matrix of the V vari-
ables. As before, the function on the left side of the expression can be read as “the
relative probability of the data given assumed values for the parameters.” With a bit of
algebra, the expression simplifies to

f ( data | μ, S ) ∝ S
( − N ×.5)  1
( 
exp  − tr SS −1 
 2 
) (4.33)

where S is the sum of squares and cross-­products matrix of the data computed at the
population means in μ (Hoff, 2009, pp. 110–111).

Prior Distributions
The simple strategy of specifying a flat distribution over the entire range of the popula-
tion mean readily extends to the mean vector from a multivariate analysis. This prior
is f(μ) ∝ 1. In the univariate case, a positively skewed inverse gamma distribution is a
conjugate prior for the variance. The inverse Wishart distribution is a multivariate gen-
eralization of the inverse gamma to a covariance matrix Σ with V variables. Visually, the
inverse Wishart looks like a three-­dimensional skewed distribution like the one in Fig-
ure 4.18 (which shows the distribution for a simple bivariate example that is graphable
182 Applied Missing Data Analysis

1.0

0.8
Relative

0.6
Probabil

0.4
ity

0.2

0.0
0.0 3.0
0.5 2.5
Po
pu 1.0 2.0 X
lat f
io eo
n
Va
1.5 1.5
ianc
ria Var
nc 2.0 1.0
on
eo ati
fY 2.5 0.5 p ul
Po
3.0 0.0

FIGURE 4.18. Inverse Wishart distribution for a bivariate example with 25 degrees of free-
dom and variance–­covariance matrix elements σ̂X2 = 1, σ̂Y2 = 1, and σ̂XY = .30.

in three dimensions). Like other distributions we’ve encountered, the vertical height
gives the relative probability of the parameter values listed along the horizontal and
depth axes. Muthén and Asparouhov (2012) provide a useful appendix that describes
the inverse Wishart distribution, and I summarize some of its main features below.
Ignoring scaling terms, the inverse Wishart prior distribution is

f (S) ∝ S
−( df0 + V +1) /2  1
( )

exp  − tr S0 S −1 
 2 
(4.34)

where S0 and df0 (the hyperparameters) are prior estimates of the sum of squares and
cross-­products matrix and degrees of freedom, respectively. Roughly speaking, the
hyperparameters encode a prior guess about the population covariance matrix, and
the degrees of freedom parameter is essentially the number of imaginary data points
assigned to that matrix. The function on the left side is a height coordinate that summa-
Bayesian Estimation 183

rizes the relative probability of a particular combination of parameter values in Σ given


the “data” in S0 and df0.
Understanding how the hyperparameters work is useful, because it may be neces-
sary (or desirable) to modify these values (e.g., to conduct a sensitivity analysis that
considers whether the choice of prior meaningfully impacts the analysis results). To
begin, it is useful to note that the expected value of the inverse Wishart distribution is
a fraction that mimics the formula for a covariance matrix (i.e., a sum of squares and
cross-­products matrix divided by the degrees of freedom).
S0
E (S) = (4.35)
df0 − V − 1
Thus, if you had access to an a priori estimate of the population covariance matrix (e.g.,
from a meta-­analysis or pilot data), setting S0 equal to Σ0 × (df0 – V – 1) would induce an
informative prior that centers the distribution at Σ0.
In contrast, the degrees of freedom parameter determines the spread or variance
along each dimension of Σ. To illustrate, the variance of the diagonal elements is
S20 vv
var ( S vv ) = (4.36)
( df0 − V − 1)2 ( df0 − V − 3 )
where S02vv is the squared sum of squares value at the intersection of row and column
v. The equation highlights that the variance decreases (i.e., the distribution becomes
more peaked or informative) as the df0 in the denominator increases. This conclusion is
intuitive, because increasing the degrees of freedom parameter assigns more data points
worth of information to the prior. The distribution also becomes more informative as
the sum of squares in the numerator decreases.
The question remains, what values of the hyperparameters impart as little informa-
tion as possible? Assigning S0 = 0 and df0 = V – 1 gives the multivariate sibling of the Jef-
freys prior from Equation 4.11 (Gelman et al., 2014), and other common choices include
S0 = I (i.e., an identity matrix) and df0 = V, S0 = I and df0 = V + 1, and an improper prior
(its mean is undefined) with df0 = –V – 1 (Asparouhov & Muthén, 2010a; Chen, 2011;
Lunn, Jackson, Thomas, & Spiegelhalter, 2013). The hyperparameters can also be esti-
mated from the data (Casella, 2001; Darnieder, 2011; McNeish, 2016b). A limitation of the
inverse Wishart prior is that it can’t disentangle information about measures of spread and
correlation, as all elements of the covariance matrix are necessarily dependent in compli-
cated ways (Grimm, Ram, & Estabrook, 2016; Muthén & Asparouhov, 2012). While this
dependence ensures a positive definite covariance matrix (i.e., all variances are positive
and correlations range between ±1), studies have cautioned against using the prior with
small samples, as it can introduce bias (McNeish, 2016a, 2016b). An alternative specifi-
cation decomposes the covariance matrix into correlations and standard deviations (or
variances) and assigns distinct priors to each. The so-­called “separation strategy” (Bar-
nard, McCulloch, & Meng, 2000; Grimm et al., 2016) is one such approach, and modeling
correlations with phantom variables is another (Merkle & Rosseel, 2018). In practice,
there is no way to know which prior is best for a given situation, so it is often a good idea
to conduct a sensitivity analysis that examines whether the choice of prior meaningfully
impacts the results. The analysis example at the end of this section illustrates this idea.
184 Applied Missing Data Analysis

Posterior Distribution, MCMC Algorithm,


and Conditional Distributions
The posterior distribution of the mean vector and variance–­covariance matrix is a mul-
tivariate function that describes the relative probability of different combinations of
μ and Σ, given the data. Following established logic, MCMC deconstructs a complex
multiparameter problem into a series of simpler computational steps, each of which
estimates one parameter (or block of parameters) while treating the current values of
all others as known. The Gibbs sampler recipe for a multivariate analysis follows a two-
step recipe: Draw a mean vector from a multivariate normal distribution that conditions
on the current estimate of the covariance matrix, then sample a new covariance matrix
from an inverse Wishart distribution that conditions on updated means. The recipe
below summarizes the algorithmic steps.

Assign starting values to all parameters.


Do for t = 1 to T iterations.
> Estimate the mean vector conditional on the variance–­covariance matrix
from the prior iteration.
> Estimate the covariance matrix conditional on the updated mean vector.
Repeat.

The full conditional distributions for each estimation step are found by multiplying
the prior and the likelihood, then doing some tedious algebra to express that product as
a function of a single unknown. Specialized Bayesian texts provide additional details on
their derivations (e.g., Hoff, 2009, pp. 109–112). In this scenario, MCMC first estimates
the means by drawing a vector of random numbers from a multivariate normal distribu-
tion. The full conditional distribution is
 S 
f ( μ | S, data ) ∝ N V  Y,  (4.37)
 N
where NV denotes a normal distribution with V dimensions or variables, Y is the vector
of arithmetic means computed from the sample data, and S is a synthetic estimate of the
variance–­covariance matrix from a prior MCMC step. Dividing S by N gives the usual
expression for the covariance matrix of the means, the frequentist version of which has
squared standard errors on the diagonal.
Next, MCMC updates the variance–­covariance matrix by drawing a matrix of ran-
dom numbers from an inverse Wishart distribution. The full conditional distribution is

f ( S | μ,data ) ∝ S
−( df0 + N + V +1) /2  1
(( ) ) 
exp  − tr S0 + S S −1 
 2 
(4.38)

where S is a sum of squares and cross-­products matrix that reflects variation and covari-
ation around the synthetic means from the preceding step. The shorthand notation for
the distribution function is as follows:
Bayesian Estimation 185

( (
f ( S | μ,data ) ∝ IW df0 + N, S0 + S )
−1
) (4.39)

The degrees of freedom in the first term—the sum of the sample size and the number
of imaginary observations assigned to the prior—­determines the distribution’s center,
and the sum of squares and cross-­products matrix in the second term—also the sum of
prior information and information from the data—­determines the distribution’s spread.

Analysis Example
Returning to the empty regression models in Equation 4.30, I use work satisfaction,
employee empowerment, and leader–­member exchange scale to illustrate Bayesian esti-
mation for a mean vector and variance–­covariance matrix. Estimation scripts are avail-
able on the companion website, including a custom R program for readers interested in
coding the MCMC algorithm by hand. To explore the influence of different prior dis-
tributions, I implemented the Wishart specifications described earlier and a separation
strategy that specifies distinct priors for variances and correlations (Merkle & Rosseel,
2018). Following earlier examples, I used the PSRF to determine the burn-in period, and
I based the final analyses on 10,000 MCMC iterations.
Table 4.4 gives Bayesian summaries of the means, standard deviations, variances
and covariances, and correlations (Table 2.6 shows the corresponding maximum likeli-
hood estimates). In the interest of space, Table 4.4 shows the results from an improper
inverse Wishart prior with S0 = 0 and df0 = –V – 1 (Asparouhov & Muthén, 2010a) and a
separation strategy (Merkle & Rosseel, 2018). For the inverse Wishart prior, I computed
the standard deviations and correlations as auxiliary functions of the estimated vari-
ances and covariances (e.g., a correlation is a covariance divided by the square root of the
product of two variances), whereas the correlations and variances were the estimated
parameters for the separate prior strategy. Table 4.4 shows that the choice of prior had no
influence on the variances and covariances (or correlations), as the posterior medians
and standard deviations were effectively identical. This won’t always be the case, but it
is here, because the sample size is very large relative to the prior degrees of freedom.

4.11 SUMMARY AND RECOMMENDED READINGS

Bayesian analyses have gained a strong foothold in social and behavioral science disci-
plines in the last decade or so (Andrews & Baguley, 2013; van de Schoot et al., 2017), and
this approach is now a viable alternative to likelihood-­based estimation. Like maximum
likelihood, the primary goal of a Bayesian analysis is to fit a model to the data and use
the resulting estimates to inform the substantive research questions. The examples in
this chapter highlight that Bayesian analyses often give results that are numerically
equivalent to those of maximum likelihood, although the interpretations of the esti-
mates and measures of uncertainty require a philosophical lens that views parameters
as random variables instead of fixed quantities. This framework is very different from
the frequentist approach, because it makes no reference to hypothetical estimates from
different samples of data.
186 Applied Missing Data Analysis

TABLE 4.4. Posterior Summaries from Two Prior Distributions


IW prior Separate priors
Effect Mdn SD Mdn SD
Means
Work Satisfaction 3.99 0.05 3.99 0.05
Empowerment 28.61 0.18 28.62 0.18
LMX 9.59 0.12 9.60 0.12

Standard deviations
Work Satisfaction 1.27 0.04 1.26 0.04
Empowerment 4.54 0.13 4.52 0.13
LMX 3.04 0.09 3.02 0.09

Variances and covariances


Work Satisfaction 1.61 0.09 1.59 0.09
Empowerment 20.63 1.16 20.43 1.15
LMX 9.21 0.52 9.13 0.52
Work Satisfaction ↔ Empowerment 1.66 0.24
Work Satisfaction ↔ LMX 1.63 0.17
Empowerment ↔ LMX 5.43 0.59

Correlations
Work Satisfaction ↔ Empowerment   .29 .04   .28 .03
Work Satisfaction ↔ LMX   .42 .03   .41 .03
Empowerment ↔ LMX   .39 .03   .39 .03

Note. IW, inverse Wishart; LMX, leader–member exchange.

Having established the major details behind estimation and inference, Chapter 5
applies Bayesian estimation to missing data problems. As you will see, everything from
this chapter carries over to missing data applications, where missing values are just one
more unknown quantity for MCMC to estimate. After updating the parameters using the
estimation steps from this chapter, each iteration concludes with the algorithm using
the updated parameter estimates to construct a model that imputes the missing values.
At that point, the data are complete, and the next iteration proceeds as if there were no
missing values. Finally, I recommend the following articles for readers who want addi-
tional details on topics from this chapter:

Andrews, M., & Baguley, T. (2013). Prior approval: The growth of Bayesian methods in psychol-
ogy. British Journal of Mathematical and Statistical Psychology, 66, 1–7.

Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. American Statistician, 46,
167–174.
Bayesian Estimation 187

Jackman, S. (2000). Estimation and inference via Bayesian simulation: An introduction to Mar-
kov chain Monte Carlo. American Journal of Political Science, 44, 375–404.

Lynch, S. M. (2007). Introduction to applied Bayesian statistics and estimation for social scien-
tists. Berlin: Springer.

van de Schoot, R., Winter, S. D., Ryan, O., Zondervan-­Zwijnenburg, M., & Depaoli, S. (2017).
A systematic review of Bayesian articles in psychology: The last 25 years. Psychological
Methods, 22, 217–239.

Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Love, J., . . . Epskamp, S.
(2018). Bayesian inference for psychology: Part I. Theoretical advantages and practical
ramifications. Psychonomic Bulletin and Review, 25, 35–57.
5

Bayesian Estimation
with Missing Data

5.1 CHAPTER OVERVIEW

This chapter shows how Bayesian analyses address missing data. Virtually everything
from Chapter 4 extends to the missing data context with little to no modification. Recall
that MCMC breaks a complex multivariate estimation problem into a series of sim-
pler steps that address one parameter (or block of similar parameters) at a time, while
treating other parameters as known constants. This part of MCMC’s “mathemagical”
machinery carries over with no changes. After updating the parameters, the algorithm
uses the newly minted estimates to construct model-­predicted distributions of the miss-
ing values, from which it samples imputations. At this point, the data are complete, and
the next iteration proceeds as if there were no missing values. These two major steps—­
update the parameters based on the filled-­in data, then impute the data based on the
current parameter values—­repeat for many iterations, just as shown in Chapter 4.
Because MCMC estimation doesn’t change, missing data imputation is the focus
of this chapter. I start by describing imputation for an outcome variable, after which I
extend the procedure to incomplete explanatory variables. There are at least two ways to
construct missing data distributions for predictors, both of which readily accommodate
interactions and curvilinear terms. The emergence of missing data-­handling methods
for interactive and nonlinear effects is an important recent innovation since the first
edition of this book (Bartlett, Seaman, White, & Carpenter, 2015; Enders, Du, & Keller,
2020; Erler et al., 2016; Goldstein, Carpenter, & Browne, 2014; Kim et al., 2015, 2018;
Lüdtke et al., 2020b; Zhang & Wang, 2017). As you will see, the methodology for treat-
ing nonlinearities is the Bayesian equivalent of the factored regression approach from
Chapter 3.
Bayesian analyses are a bridge connecting maximum likelihood to multiple imputa-
tion. On one side of that bridge is maximum likelihood, which extracts the parameter
estimates of interest directly from the observed data. The other side of the bridge is
188
Bayesian Estimation with Missing Data 189

multiple imputation, which creates and saves filled-­in data sets for later use. A Bayesian
analysis is like maximum likelihood in the sense that model parameters are the focus,
but the machinery that generates missing values is identical to multiple imputation. The
distinction between a Bayesian analysis and multiple imputation can get blurry, because
the latter co-opts the MCMC algorithms from this chapter and Chapter 6. For now, the
goal is to construct temporary imputations that service a particular analysis. The focus
shifts in Chapter 7, where the Bayesian machinery is a mathematical device that creates
suitable imputations for reanalysis in the frequentist framework.

5.2 IMPUTING AN INCOMPLETE OUTCOME VARIABLE

Imputation for an incomplete outcome variable is a good place to start, because the
procedure is relatively straightforward and doesn’t depend on the composition of the
analysis model (e.g., imputation is the same whether the analysis is a simple regression
or a complex model with interaction effects). In truth, there is no need to impute at all in
this situation, because classic regression models are known to produce good estimates
when missing values are restricted to the dependent variable and missingness is due to
predictors (Little, 1992; von Hippel, 2007). However, that limited scenario doesn’t arise
too often in practice, and we’ll generally need to impute outcomes.
To keep the discussion as straightforward as possible, I use a simple regression
model for the first part of the chapter.

Yi = β0 + β1 X i + ε i = E ( Yi | X i ) + ε i (5.1)

(
Yi ~ N1 E ( Yi | X i ) , σ2ε )
where E(Yi|Xi) is the predicted value for individual i (i.e., the expected value or mean of
Y given a particular X score), the tilde means “distributed as,” N1 denotes the univari-
ate normal distribution function (i.e., the probability distribution in Equation 4.7), and
the conditional mean and residual variance are the distribution’s two parameters. The
bottom row of the expression says that outcome scores are normally distributed around
a regression line with constant residual variation. As you will see, this normal curve
defines the distribution of missing values.
I use the employee data from the companion website to provide a substantive con-
text. The data set includes several workplace-­related variables (e.g., work satisfaction,
turnover intention, employee–­supervisor relationship quality) for a sample of N = 630
employees. The Appendix describes the data set and variable definitions. The simple
regression model features the leader–­member exchange scale (a construct measur-
ing the quality of an employee’s relationship with his or her supervisor) predicting an
employee’s sense of empowerment.

EMPOWER i =β0 + β1 ( LMX i ) + ε i (5.2)

Both variables are incomplete, but I focus on the missing outcome scores for now.
190 Applied Missing Data Analysis

The Posterior Distribution and MCMC Algorithm


Revisiting concepts from Chapter 4, the posterior distribution for a linear regression is
a multivariate function describing the relative probability of different combinations of
model parameters given the data. When a variable is incomplete, it appears as an addi-
tional unknown in the multivariate posterior distribution. The Gibbs sampler algorithm
estimates each unknown quantity sequentially by drawing random numbers from a
probability distribution that treats all other parameters as known constants. The algo-
rithm now follows the three-step recipe shown below.

Assign starting values to all parameters and missing values.


Do for t = 1 to T iterations.
> Estimate coefficients conditional on residual variance and imputations.
> Estimate the residual variance conditional on coefficients and imputations.
> Estimate missing values conditional on model parameters.
Repeat.

The first two steps condition on the missing values, which means that estimation
is carried out on the filled-­in data set from the prior iteration. In fact, these operations
are identical to the complete-­data estimation steps for linear regression; the MCMC
algorithm “estimates” regression coefficients by drawing a vector of random numbers
from the multivariate normal distribution in Equation 4.23, after which it updates the
residual variance by drawing a random number from the inverse gamma distribution
in Equation 4.24. The final MCMC step creates new imputations based on the updated
parameter values.

Imputing Missing Values


A consequence of assuming a conditionally MAR process is that the focal regression
model completely determines the missing outcome scores. The distribution of the miss-
ing values is a conditional normal distribution with a predicted value and residual vari-
ance defining its center and spread, respectively.

( )
σ2ε , X i
f Yi( mis ) | β,= (
N1 E ( Yi | X i ) , σ2ε ) (5.3)

This density is sometimes called a posterior predictive distribution, because it gener-


ates predictions based on the posterior distributions of the model parameters. The equa-
tion says that missing outcome scores are normally distributed around the regression
line, and drawing an imputation is equivalent to computing a predicted value, then add-
ing a random noise term from a normal distribution. As you will see, this concept—an
imputation equals predicted value plus noise—is very general and applies to a wide range
of analysis models.
To illustrate imputation more concretely, Figure 5.1 shows the distribution of plau-
sible empowerment imputations at three values of leader–­member exchange. The con-
Bayesian Estimation with Missing Data 191

tour rings convey the perspective of a drone hovering over the peak of the bivariate
normal population distribution, with smaller contours denoting higher elevation (and
vice versa). Candidate imputations fall exactly on the vertical hashmarks, but I added
horizontal jitter to emphasize that more scores are located at higher contours near the
regression line. The MCMC algorithm generates an imputation by randomly selecting a
value from the candidate scores along the vertical lines (technically, the algorithm draws
from a full distribution of replacement scores, not just those displayed in the graph).
Figure 5.2 shows a filled-in data set, with black crosshairs denoting cases with
imputed empowerment scores. The concentration of imputed values on the left side of
the graph is consistent with a conditionally MAR process where the probability of miss-
ing data increases as employee– supervisor relationship quality decreases (e.g., because
they are not as engaged or invested). A practical implication of MAR assumption is that
all participants share the same model; that is, the distribution of empowerment is the
same for any two participants with the same leader–member exchange score, regardless
of the missing data pattern. This feature of imputation is clear in Figure 5.2, because
the filled-in observations blend in seamlessly along the same regression line as the com-
plete scores. Chapter 9 describes models that create imputations according to a MNAR
process.
50
40
Empowerment
30
20
10

0 5 10 15 20
Leader-Member Exchange

FIGURE 5.1. Distribution of plausible empowerment imputations at three values of leader–


member exchange. Candidate imputations fall exactly on vertical hashmarks, but I added hori-
zontal jitter to emphasize that more scores are located near the regression line.
192 APPLIED MISSING DATA ANALYSIS

50
40
Empowerment
30
20
10

0 5 10 15 20
Leader-Member Exchange

FIGURE 5.2. Filled-in data set from one iteration of the MCMC algorithm. The black cross-
hair symbols denote observations with imputed outcome scores.

5.3 LINEAR REGRESSION

Specifying a distribution for complete explanatory variables is unnecessary, but the situ-
ation changes with incomplete regressors, because their values also need to be sampled
from a distribution. Consistent with the maximum likelihood framework, Bayesian
structural equation modeling is one option (Kaplan & Depaoili, 2012; Merkle & Ros-
seel, 2018; Palomo, Dunson, & Bollen, 2007), as are factored regression models (Ibrahim
et al., 2002, 2005). The former generally foists a normal distribution on the predictors,
whereas the latter offers a more flexible specification that accommodates interactive or
nonlinear effects and mixed response types. I focus on factored regression models in
this section and return to multivariate normal data later in the chapter.
Expanding on the previous example, I use a multiple regression model with leader–
member exchange, leadership climate, and a gender dummy code (0 = female, 1 = male)
as predictors.

EMPOWER i =β0 + β1 ( LMX i ) + β2 ( CLIMATEi ) + β2 ( MALEi ) + ε i (5.4)

Yi = β0 + β1 X1i + β2 X 2i + β3 X 3i + ε i = E ( Yi |X i ) + ε i
Bayesian Estimation with Missing Data 193

(
Yi ~ N1 E ( Yi |X i ) , σ2ε )
As a reminder, E(Yi|Xi) is a predicted value, the tilde means “distributed as,” N1 denotes
the univariate normal distribution function (i.e., the probability distribution in Equa-
tion 4.7), and the terms inside the parentheses define the distribution’s mean and vari-
ance. The bottom row says that the dependent variable is normally distributed around
points on a regression plane with constant variation.

Factored Regression Models (Sequential Specification)


Recall from Chapter 3 that factored regression models use the probability chain rule
to express a multivariate distribution as the product of univariate distributions, each
of which corresponds to a regression model. In the Bayesian framework, this strategy
is often referred to as a sequential specification (Erler et al., 2016, 2019; Ibrahim et al.,
2002, 2005; Lüdtke et al., 2020b), but it is nevertheless a factored regression. The factor-
ization for the employee empowerment example is as follows:

f ( EMPOWER, LMX, CLIMATE, MALE ) =


f ( EMPOWER | LMX, CLIMATE, MALE ) × f ( LMX | CLIMATE, MALE ) × (5.5)

f ( CLIMATE | MALE ) × f MALE ( *


)
The first term to the right of the equals sign corresponds to the focal analysis model
(the normal distribution from Equation 5.4), and the remaining terms are supporting
models that define the predictor distributions. I ultimately drop the rightmost term,
because the gender dummy code is complete and does not require a distribution. The
previous generic functions translate into the following linear regression models for the
incomplete predictors:

LMX i = γ 01 + γ11 ( CLIMATEi ) + γ 21 ( MALEi ) + r1i (5.6)

CLIMATEi = γ 02 + γ12 ( MALEi ) + r2i


MALEi* =γ 03 + r3i

The asterisk superscript in the bottom equation reflects a latent response variable for-
mulation for the dummy code, which I discuss in Chapter 6.
An alternative model specification factorizes the joint distribution of the analysis
variables into a double product featuring the focal model and a multivariate distribu-
tion for the regressors. The factorization for the employee empowerment example is as
follows:
f ( EMPOWER, LMX, CLIMATE, MALE ) =
(5.7)
(
f ( EMPOWER | LMX, CLIMATE, MALE ) × f LMX, CLIMATE, MALE* )
The second term is a trivariate normal distribution for the explanatory variables.
194 Applied Missing Data Analysis

 LMX i   µ1   r1i 
     
 CLIMATEi  = µ 2  +  r2i  (5.8)
 *     
 MALEi   µ 3   r3i 
 LMX i    µ   σ12 σ12 σ13  
    
1 
 CLIMATE i ~ N 
3  2  ,  σ 21
µ σ22 σ23  
     
*
 MALEi    µ 3   σ 31 σ 32 σ 23  
  
The asterisk superscript again reflects a latent response variable formulation, which I
discuss in Chapter 6.
For lack of a better term, I refer to this specification as a partially factored regres-
sion model or partially sequential specification. Figure 5.3 depicts the fully factored
and partially factored regression models as path diagrams. It suggests that the two
approaches are equivalent, because they simply swap out a straight arrow for a curved
arrow. The models are, in fact, exchangeable in this example, but that won’t always be
the case. I contrast the two strategies later in this section.
As an aside, an equivalent version of the partially factored model expresses the
multivariate distribution in Equation 5.8 as a series of round-robin linear regression
equations, as follows on top of the next page (Enders et al., 2020; Goldstein et al., 2014):

(a) Factored regression (sequential) specification

X3

X2 Y

X1

(b) Partially factored regression specification

X3

X2 Y

X1

FIGURE 5.3. The path diagram in panel (a) corresponds to a factored regression or sequential
specification, and the diagram in panel (b) is a partially factored regression that specifies a joint
distribution for the regressors.
Bayesian Estimation with Missing Data 195

(
LMX i = μ1 + γ11 ( CLIMATEi − μ 2 ) + γ 21 MALEi* − μ 3 + r1i ) (5.9)

CLIMATEi = μ 2 + (
γ12 MALEi* )
− μ 3 + γ 22 ( LMX i − μ1 ) + r2i
MALEi* = μ 3 + γ13 ( LMX i − μ1 ) + γ 23 ( CLIMATEi − μ 2 ) + r3i

This parameterization, which is mostly an algorithmic tweak, leverages the well-known


property that a multivariate normal distribution’s parameters can be expressed as an
equivalent set of linear regression models (Arnold et al., 2001; Liu et al., 2014). In prac-
tice, there is no meaningful distinction between Equations 5.8 and 5.9, but the latter
provides a simple way to handle mixtures of categorical and numeric regressors.

Distribution of a Missing Regressor


Imputing an incomplete regressor is more complicated than imputing an outcome vari-
able, because the distribution of missing values must account for the predictor’s role
in two or more models. Considering the sequential specification, the leader–­member
exchange variable appears on the right side of the focal regression in Equation 5.4 and
on the left side of its own model in Equation 5.6. Accordingly, its distribution conditions
on the other analysis variables via two sets of model parameters. In a similar vein, the
leadership climate scores depend on three models, because they appear twice as a pre-
dictor and once as an outcome.
To illustrate the posterior predictive distribution of an incomplete predictor, con-
sider the leader–­member exchange variable (i.e., X1 using generic notation). The MCMC
algorithm draws imputations from the conditional distribution of the incomplete regres-
sor given all other analysis variables (e.g., the conditional distribution of leader–­member
exchange given employee empowerment, leadership climate, and gender). Applying
rules of probability reveals that this distribution is proportional to the product of two
univariate distributions, each of which aligns with a regression model.

f ( Y , X1 , X 2 , X 3 ) f ( Y | X1 , X 2 , X 3 ) × f ( X1 | X 2 , X 3 ) × f ( X 2 , X 3 )
( X1 | Y , X 2 , X 3 )
f= =
f ( Y, X2 , X3 ) f ( Y, X2 , X3 ) (5.10)
∝ f ( Y | X1 , X 2 , X 3 ) × f ( X1 | X 2 , X 3 )

The expression says that the model-­implied distribution of X1 is found by multiplying


the two univariate distributions, and the “proportional to” symbol results from drop-
ping terms that don’t depend on X1. The composition of the expression makes concep-
tual sense—the distribution of X1 depends on the two models in which it appears.
Digging a bit deeper, the product to the right of the “proportional to” symbol cor-
responds to a pair of normal curve functions. Dropping unnecessary scaling terms and
substituting the kernels of the distributions gives the following expression:

f ( Yi | X1i , X 2i , X 3i ) × f ( X1i = ( ) ( )
| X 2i , X 3i ) N1 E ( Yi|X i ) , σ2ε × N1 E ( X1i | X 2i , X 3i ) , σr21 ∝

1 ( Yi − ( β0 + β1X1i + β2 X 2i + β3 X 3i ) )  × exp  − 1 ( X1i − ( γ 01 + γ11X 2i + γ 21X 3i ) )


 2   2  (5.11)
exp  − 
 2 σ2ε   2 σr21 
   
196 APPLIED MISSING DATA ANALYSIS

Deriving the conditional distribution of X1 involves multiplying the two normal curve
functions and performing some straightforward but tedious algebra that combines the
component functions into a single distribution for X1. The result of that hard work is
a normal distribution with a complex mean and variance that depend on the focal and
regressor model parameters (θθ and φ, respectively).

( )
f X1i( mis ) | Yi , X 2i , X 3i = N1 ( E ( X1i | Yi , X 2i , X 3i ) , var ( X1i | Yi , X 2i , X 3i ) ) (5.12)

γ +γ X +γ X β ( Y − β 0 − β 2 X 2 i − β 3 X 3i ) 
E ( X1i | Yi , X 2i , X 3i ) =var ( X1i | Yi , X 2i , X 3i ) ×  01 11 22i 21 3i
+ 1 i 
 σr1 σ2ε 
 
−1
 1 β12 
var ( X1i | Yi , X 2i , X=
3i )  2 + 2 
 σr σ ε 
 1 

There is nothing especially intuitive about the equation, but its two-part structure
clearly shows that the conditional distribution of X1 depends on the two models in
which it appears.
To further illustrate, Figure 5.4 shows the distribution of plausible leader–member
50
40
Empowerment
30
20
10

0 5 10 15 20
Leader-Member Exchange

FIGURE 5.4. Distribution of plausible leader–member exchange imputations at three values


of empowerment (marginalizing over gender). Candidate imputations fall exactly on horizontal
hashmarks, but I added vertical jitter to emphasize that more scores are located near the means.
Bayesian Estimation with Missing Data 197

exchange imputations at three values of employee empowerment (the figure marginal-


izes over the other variables). The expected values or conditional means are the solid
black dots, and the spread of the imputations conveys the conditional variance. Candi-
date imputations fall directly on a horizontal line, but I added vertical jitter to empha-
size that more scores are located near the means. Conceptually, MCMC generates an
imputation by randomly selecting a value from the candidate scores on each horizontal
line (technically, it uses the entire distribution with an infinite number of possibilities).
Although the composition of the distribution is more complex, the previous conclu-
sion—­an imputation equals a predicted value plus noise—still holds true.

Choosing a Model Specification


Returning to Figure 5.3, the factored (sequential) and partially factored (partially
sequential) specifications provide seemingly equivalent ways to introduce distributions
for incomplete explanatory variables. The sequential specification’s major strength is
that it makes no assumption about the multivariate distribution of the predictors. In
practical terms, this means that the univariate regressions linking predictors can mix
and match distributions and need not be linear models. Although the sequential speci-
fication accommodates a wider range of response types, the partially factored specifica-
tion also allows for binary, ordinal, and nominal predictors via a latent response variable
formulation (Bartlett et al., 2015; Enders et al., 2020; Goldstein et al., 2014).
The sequential specification also permits nonlinear associations among the predic-
tors, whereas such effects are strictly incompatible with a multivariate normal distri-
bution. For example, top expression in Equation 5.6 could be a curvilinear regression
where leader–­member exchange is a quadratic function of leadership climate. On paper,
this seems like a substantial advantage, because ignoring nonlinear associations could
introduce bias into the focal model’s estimates (Lüdtke et al., 2020b), but researchers
could only realize these benefits by specifying the correct sequence of functional forms.
Lüdtke et al. provide several good practical recommendations for ordering the variables
and specifying a sequence of regression models.
Specifying a multivariate distribution for incomplete explanatory variables is par-
ticularly useful for analyses with centered predictors. To illustrate, consider the follow-
ing multiple regression:

Yi = β0 + β1 ( X1i − μ1 ) + β2 ( X 2i − μ 2 ) + β3 ( X 3i − μ 3 ) + ε i (5.13)

A partially factored specification is ideally suited for this analysis, because the grand
means are explicit model parameters that MCMC iteratively estimates (see Equations
5.8 and 5.9). This seemingly routine analysis is somewhat harder to implement with a
sequential specification, which requires the following regressions:

X1i = γ 01 + γ11 ( X 2i − γ 02 ) + γ11 ( X 3i − γ 03 ) + r1i (5.14)

X 2i = γ 02 + γ12 ( X 3i − γ 03 ) + r2i
X 3i =γ 03 + r3i
198 Applied Missing Data Analysis

MCMC estimation typically estimates each model’s coefficients as a block (see Equa-
tion 4.23), but a more complex algorithm is needed here to account for the fact that two
intercept coefficients appear in multiple equations.

MCMC Algorithm
The estimation recipe below shows the MCMC algorithmic steps in their full generality
where all analysis variables could be missing:

Assign starting values to all parameters and missing values.


Do for t = 1 to T iterations.
> Estimate focal model coefficients conditional on its residual variance and
all imputations.
> Estimate the focal model residual variance conditional on its coefficients
and all imputations.
Do for k = 1 to K predictors.
> Estimate predictor model coefficients conditional on its residual vari-
ance and all imputations.
> Estimate predictor model residual variance conditional on its coeffi-
cients and all imputations.
Repeat.
> Estimate missing outcome scores conditional on focal model parameters.
Do for k = 1 to K predictors.
> Estimate missing predictor scores conditional on focal model param-
eters and at least one supporting model.
Repeat.
Repeat.

Fundamentally, both the fully factored and partially factored specifications (Equations
5.5 and 5.7, respectively) share the same algorithmic steps, and the primary difference
is the composition of the supporting regressor models. In either case, the focal model
alone determines the distribution of the missing outcome scores, and incomplete predic-
tors depend on two or more sets of model parameters.

Analysis Example
Continuing with the employee data, I applied Bayesian missing data handling to the lin-
ear regression model in Equation 5.4. The missing data rates for the employee empow-
erment and leader–­member exchange scales are approximately 16.2 and 4.1%, respec-
tively, and 9.5% of the leadership climate scores are missing. The gender dummy code is
complete. The potential scale reduction factor (Gelman & Rubin, 1992) diagnostic from
Chapter 4 indicated that the MCMC algorithm converged in fewer than 200 iterations,
so I continue using 11,000 total iterations with a conservative 1,000-iteration burn-in
Bayesian Estimation with Missing Data 199

TABLE 5.1. Posterior Summary from the Multiple Regression


Variables Mdn SD LCL UCL
β0 18.24 1.05 16.19 20.26
β1 (LMX) 0.59 0.06 0.47 0.71
β2 (CLIMATE) 0.19 0.04 0.10 0.27
β3 (MALE) 1.80 0.34 1.14 2.47
σε2 14.54 0.92 12.89 16.47
R2   .26 .03   .20   .32

Note. LCL, lower credible limit; UCL, upper credible limit.

period. Analysis scripts are available on the companion website, including a custom R
program for readers interested in coding the algorithm by hand.
Table 5.1 gives posterior summaries of the analysis model parameters. The sequen-
tial and partially factored specifications gave identical estimates (to the third decimal),
so I report the latter. Furthermore, I omit the regressor model parameters, because they
are not the substantive focus. The interpretation of the regression parameters is the same
as a least squares or maximum likelihood analysis. For example, the leader–­member
exchange slope (Mdnβ1 = 0.59, SDβ1 = 0.07) indicates that a one-unit increase in supervi-
sor relationship quality increases employee empowerment by about 0.59, controlling
for other regressors. As you know, the posterior standard deviations are analogous to
frequentist standard errors in the sense that they quantify uncertainty about the param-
eters after analyzing the data, but the subjective definition of uncertainty doesn’t refer-
ence hypothetical estimates from different random samples. Applying null hypothesis-­
like logic, the population slope coefficients are unlikely equal to zero, because this null
value falls well outside the 95% credible intervals.

5.4 INTERACTION EFFECTS

The emergence of Bayesian missing data-­handling methods for interactive and nonlin-
ear effects is an important recent development (Bartlett et al., 2015; Enders et al., 2020;
Erler et al., 2016; Kim et al., 2015, 2018; Lüdtke et al., 2020b; Zhang & Wang, 2017).
Moderated regression models are ubiquitous analytic tools, particularly in the social
and behavioral sciences (Aiken & West, 1991; Cohen et al., 2002). A prototypical model
features a focal predictor X, a moderator variable M, the product of X and M, and one or
more covariates like Z below:

Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + β4 Zi + ε i = E ( Yi | X i , Mi , X i × Mi , Zi ) + ε i (5.15)

(
Yi ~ N1 E ( X i , Mi , X i × Mi , Zi ) , σ2ε )
In this model, β1 is a conditional effect that reflects the influence of X when M equals
zero, and β2 is the corresponding conditional effect of M when X equals zero. The β3
coefficient is usually of particular interest, because it captures the change in the β1 slope
200 Applied Missing Data Analysis

for a one-unit increase in M (i.e., the amount by which X’s influence on Y is moderated
by M).
Switching gears to a different substantive context, I use the chronic pain data
to illustrate a moderated regression analysis with an interaction effect. The data set
includes psychological correlates of pain severity (e.g., depression, pain interference
with daily life, perceived control) for a sample of N = 275 individuals with chronic
pain. The motivating question is whether gender moderates the influence of depression
on psychosocial disability, a construct capturing pain’s impact on emotional behaviors
such as psychological autonomy and communication, emotional stability, and so forth.
The moderated regression model is

DISABILITYi = β0 + β1 ( DEPRESSi − μ1 ) + β2 ( MALEi )


(5.16)
+ β3 ( DEPRESSi − μ1 )( MALEi ) + β4 ( PAIN i ) + ε i

where DISABILITY and DEPRESS are scale scores measuring psychosocial disability and
depression, MALE is a gender dummy code (0 = female, 1 = male), and PAIN is a binary
severe pain indicator (0 = no, little, or moderate pain, 1 = severe pain). I centered depres-
sion scores at their grand mean to facilitate interpretation. The disability and depression
scores have 9.1 and 13.5% of their scores missing, respectively, and approximately 7.3%
of the binary pain ratings are missing. By extension, 13.5% of the sample is also missing
the product term.

Factored Regression Specification


The factored regression modeling framework readily accommodates interactive and
nonlinear effects; following ideas established in Section 3.8, the focal model changes,
but nearly everything else remains the same. For example, a sequential specification for
the psychosocial disability analysis applies the factorization below. Recall that listing
complete predictors last simplifies estimation, because these models can be dropped.
f ( DISABILITY , DEPRESS, MALE, PAIN ) =
f ( DISABILITY | DEPRESS, MALE, DEPRESS × MALE, PAIN ) × (5.17)

(
f ( DEPRESS | PAIN, MALE ) × f PAIN * | MALE × f MALE*) ( )
The first term to the right of the equals sign corresponds to the analysis model from
Equation 5.16. Importantly, the product is not a variable with its own distribution, but
rather a deterministic function of depression and gender, either of which could be miss-
ing. The regressor models in the next three terms translate into a linear regression for
depression, a probit (or logistic) model for the severe pain indicator, and an empty probit
(or logistic) model for the marginal distribution of gender (which I ultimately ignore,
because the variable is complete).

DEPRESSi = γ 01 + γ11 ( PAIN i ) + γ 21 ( MALEi ) + r1i (5.18)

PAIN i* = γ 02 + γ12 ( MALEi ) + r2i


MALEi* =γ 03 + r3i
Bayesian Estimation with Missing Data 201

The asterisk superscripts reflect a latent response variable formulation for the binary
variables, which I discuss in Chapter 6.
The partially factored specification has a two-part construction that comprises the
focal model and a multivariate distribution for the predictors (e.g., see Equation 5.7).
The trivariate normal distribution for this example is as follows:

 DEPRESSi   μ1   r1i 
 *     
 PAIN i  = μ 2  +  r2i  (5.19)
 *     
 MALEi   μ 3   r3i 

 DEPRESSi    μ   σ12 σ12 σ13  


    
1 
 PAIN i  ~ N 3   μ 2  ,  σ21 σ23  
*
σ22
 * 
   
 MALEi    μ 3   σ 31 σ 32 σ23  
  

The asterisk superscripts again reflect the latent response variable formulation described
in Chapter 6. As noted previously, an equivalent version of this specification expresses
the multivariate normal distribution as a series of round-robin linear regressions like
Equation 5.9 (Enders et al., 2020; Goldstein et al., 2014).

Distribution of a Missing Regressor


The distribution of an incomplete predictor again depends on every model in which it
appears, but the presence of the nonlinear term makes certain distributions even more
complex. To illustrate, consider the posterior predictive distribution of the depression
variable. The MCMC algorithm draws imputations from the conditional distribution of
the incomplete regressor given all other analysis variables (e.g., the conditional distri-
bution of depression given psychosocial disability, severe pain, and gender). Following
Equation 5.10, this distribution is proportional to the product of two univariate normal
distributions, each of which aligns with a regression model. Using generic notation, the
distribution of missing values is as follows:

f ( X | Y , M, Z ) ∝ f ( Y | X, M, X × M, Z ) × f ( X | M, Z ) =
(5.20)
( ) (
N1 E ( Yi |X i , Mi , X i × Mi , Zi ) , σ2ε × N1 E ( X i |Mi , Zi ) , σr21 )
Dropping unnecessary scaling terms and substituting the normal curve’s kernels into
the right side of the expression gives the following:
f ( Yi | X i , Mi , X i × Mi , Zi ) × f ( X i | Mi , Zi ) ∝
 ( Yi − ( β0 + β1 X i + β2 Mi + β3 X i Mi + β4 Zi ) ) 
2
1
exp  − ×
 2 σ2ε 
  (5.21)
 ( X i − ( γ 01 + γ11Mi + γ 21 Zi ) ) 
2
1
exp  − 
 2 σ2r1 
 
202 Applied Missing Data Analysis

Deriving the conditional distribution of X involves multiplying the two normal


curve functions and performing algebra that combines the component functions into
a single distribution for X. The result is a normal distribution with two-part mean and
variance expressions that depend on the focal and regressor model parameters.

( )
f X i( mis ) | Yi , Mi , Zi = N1 ( E ( X i | Yi , Mi , Zi ) , var ( X i | Yi , Mi , Zi ) ) (5.22)

E ( X i | Yi , Mi , Zi ) var ( X i | Yi , Mi , Zi ) ×
=
 γ + γ M + γ Z ( β1 + β3 Mi )( Yi − β0 − β2 Mi − β4 Zi ) 
 01 11 2 i 21 i
+ 
 σ σ 2 
 r1 ε 
−1
 1 ( β + β M )2 
var ( X i | Yi , Mi ,=
Zi )  2 + 1 23 i 
 σr σε 
 1 
Kim et al. (2015) give a comparable expression for a partially factored specification that
assigns a multivariate normal distribution to the predictors.
Comparing the distribution above to the one from Equation 5.12, you’ll notice that
the parts of the mean and variance that depend on the focal model expand to incorporate
the interaction effect (e.g., X’s slope is replaced by its simple slope), and the contribution
of the covariate model remains the same. Although it isn’t obvious from Equation 5.12,
drawing scores from this distribution yields imputations that are consistent with the
estimated interaction effect. As such, each time MCMC estimates the moderated regres-
sion from the filled-­in data, the product of the imputed X and M scores will preserve any
interaction effect in the data, because the posterior predictive distribution constructs
imputations that anticipate this multiplication. This doesn’t mean that imputation will
create an interaction where none exists; the procedure creates imputations that are con-
sistent with the estimated interaction effect, which could be 0.
Looking at the variance of the imputations, the interaction introduces heterosce-
dasticity, such that the distribution’s spread depends on a person’s moderator score
(i.e., each value of M gives a different variance). Applied to the chronic pain data, the
variance expression implies that the spread of the missing depression scores differs
for males and females. Plugging in estimates from the ensuing analysis gives variance
estimates of 33.47 and 26.83 for males and females, respectively. This result highlights
that nonlinearities induce differences in spread that are incompatible with a multivari-
ate normal distribution (i.e., Equation 5.22 is a mixture of normal distributions that
differ with respect to their spread). Classic maximum likelihood and multiple imputa-
tion approaches that assume multivariate normality (e.g., the so-­called “just-­another-­
variable” approach) do a poor job of approximating this heteroscedasticity and are prone
to substantial biases (Bartlett et al., 2015; Kim et al., 2018; Liu et al., 2014; Seaman et al.,
2012; von Hippel, 2009). A growing body of methodological research suggests that fac-
tored regression models and their Bayesian counterparts are superior options for model-
ing incomplete interaction effects (Bartlett et al., 2015; Enders et al., 2020; Erler et al.,
2016; Grund, Lüdtke, & Robitzsch, 2021; Kim et al., 2015, 2018; Lüdtke et al., 2020b;
Zhang & Wang, 2017).
Bayesian Estimation with Missing Data 203

Analysis Example
Continuing with the chronic pain example, I applied Bayesian missing data handling to
the moderated regression model in Equation 5.16. As explained previously, a partially
factored specification that assigns a multivariate distribution to the predictors is ideally
suited for models with centered predictors, because MCMC iteratively estimates the
grand means. The potential scale reduction factors (Gelman & Rubin, 1992) from a pre-
liminary diagnostic run indicated that MCMC converged in fewer than 400 iterations,
so I continue using 11,000 total iterations with a conservative 1,000-iteration burn-in
period. Analysis scripts are available on the companion website.
Table 5.2 summarizes the posterior distributions of the parameters, and Table 3.7
shows the corresponding maximum likelihood estimates from the factored regression
model. To get a better understanding of the interaction effect, Figure 5.5 uses the pos-
terior medians to plot the regression lines for males and females (averaging over the
severe pain indicator). Recall that lower-order terms are conditional effects that depend
on scaling; Mdnβ1 = 0.38 (SD = 0.06) is the effect of depression on psychosocial disabil-
ity for females (the solid line in the figure), and Mdnβ2 = –0.77 (SD = 0.57) is the gender
difference at the depression mean (the vertical distance between lines at a value of zero
on the horizontal axis). The interaction effect captures the slope difference for males.
The negative coefficient (Mdnβ3 = –0.24, SD = 0.09) indicates that the male depression
slope (the dashed line) was approximately 0.24 points lower than the female slope (i.e.,
the male slope is Mdnβ1 + Mdnβ3 = 0.38 – 0.24 = 0.14). The 95% credible interval for the
interaction does not include 0.
Researchers routinely probe interaction effects by computing the conditional effect
of the focal predictor at different levels of the moderator (i.e., simple slopes; Aiken &
West, 1991; Bauer & Curran, 2005). Following familiar procedures from ordinary least
squares regression, you could compute simple slopes by substituting the point estimates

TABLE 5.2. Posterior Summary from the Moderated Regression


Parameter Mdn SD LCL UCL
Focal analysis model
β0 21.63 0.39 20.86 22.39
β1 (DEPRESS) 0.38 0.06 0.26 0.50
β2 (MALE) –0.77 0.57 –1.89 0.35
β3 (DEPRESS)(MALE) –0.24 0.09 –0.42 –0.06
β4 (PAIN) 1.92 0.62 0.72 3.15
σε2 16.83 1.60 14.12 20.42
R2   .23 .05   .14   .32

Conditional effects (simple slopes by gender)


βFemale 0.38 0.06 0.26 0.50
βMale 0.14 0.07 0.00 0.28

Note. LCL, lower credible limit; UCL, upper credible limit.


204 Applied Missing Data Analysis

40
30
Psychosocial Disability
20

Female
Male
10
0

–20 –10 0 10 20
Depression (Centered)

FIGURE 5.5. Simple slopes (conditional effects) for males and females.

(posterior medians) and dummy codes into the regression equation, but this approach
doesn’t yield posterior standard deviations or credible intervals. A better strategy is to
define the conditional effects as auxiliary parameters that depend on the focal model
parameters from each MCMC iteration (Keller & Enders, 2021). The bottom panel of
Table 5.2 summarizes the posterior distributions of the depression slopes for males and
females. The posterior medians define the slopes of the lines in Figure 5.5.

5.5 INSPECTING IMPUTATIONS

Most methods in this book leverage the normal distribution in important ways; Bayes-
ian estimation makes this dependence explicit by sampling imputations from normal
curves, and maximum likelihood estimation similarly intuits the location of missing
values by assuming they are normal. Of course, the normal distribution is often a rough
approximation for real data where variables are asymmetric and/or kurtotic. Using the
normal curve for missing data handling is fine in many situations, but misspecifications
can introduce bias if the data diverge too much from this ideal (some estimands are
more robust than others).
Bayesian estimation is particularly useful for evaluating the impact of non-­normality,
because it produces explicit estimates of the missing values. Graphing imputations next
to the observed data can provide a window into an estimator’s inner machinery, as
severe misspecifications can produce large numbers of out-of-range or implausible val-
ues (e.g., negative imputes for a strictly positive variable). Maximum likelihood estima-
Bayesian Estimation with Missing Data 205

tion is a bit more of a black box in this regard, because it does the same thing—intuits
that missing values extend to a range of implausible score values—without producing
explicit evidence of its assumptions.
Returning to the moderated regression analysis, the observed depression scores
are positively skewed and somewhat platykurtic (skewness = 0.60 and excess kurtosis
= –0.75). To illustrate the impact of sampling normally distributed imputations, I saved
the filled-in data from the final iteration of 10 different MCMC chains. Figure 5.6 shows
overlaid histograms with the observed data as gray bars and the missing values as white
bars with a kernel density function (the graph reflects a stacked data set with all imputa-
tions in the same file). As you can see, the observed data are skewed with scores rang-
ing from 7 to 28, whereas the imputations follow a symmetric distribution that extends
from –5.39 to 32.06. The imputed data are essentially a weighted mixture of a normal
distribution and a skewed distribution, and about 14.6% of the imputed values fall below
the lowest possible score of 7 (about 2% of the imputations are negative).
Implausible score values clearly offend our aesthetic sensibilities, but out-of-range
imputations don’t necessarily invalidate results and translate into biased estimates;
computer simulation studies show that a normal imputation model can work surpris-
ingly well, especially with means and regression coefficients (Demirtas, Freels, & Yucel,
2008; Lee & Carlin, 2017; von Hippel, 2013; Yuan et al., 2012). To underscore this point,
I reran the analysis applying the Yeo–Johnson power transformation (Lüdtke et al.,
2020b; Yeo & Johnson, 2000) to the depression scale. As described in Chapter 10, the
Frequency

–10 0 10 20 30 40
Depression

FIGURE 5.6. Overlaid histograms with the observed data as gray bars and the missing values
as white bars with a kernel density function. The observed data are positively skewed with scores
ranging from 7 to 28, whereas the imputations follow a symmetrical distribution that extends
from –5.39 to 32.06.
206 APPLIED MISSING DATA ANALYSIS

Frequency

–10 0 10 20 30 40
Depression

FIGURE 5.7. Overlaid histograms with the observed data as gray bars and the missing values
as white bars with a kernel density function. The Yeo–Johnson imputations follow a skewed dis-
tribution that mimics the shape of the observed data.

Yeo–Johnson procedure estimates the shape of the data as MCMC iterates, and it gener-
ally creates imputations that closely match the observed-data distribution. The overlaid
histograms in Figure 5.7 show that the resulting imputations were skewed and more
like the observed scores. However, you might be surprised to find out that the analysis
results were indistinguishable from those in Table 5.2.
On balance, filling in the skewed depression data with normal imputes doesn’t
appear to be problematic, despite the relatively large proportion of out-of-range values. In
my experience, this is often the case. In general, the impact of applying normal imputa-
tions to non-normal data depends on the amount of missing data, and misspecifications
are more likely to introduce bias if the skewed variable has a very high missing data rate.
Because there is no analogue to robust standard errors in the Bayesian framework, the
Yeo–Johnson transformation that I previewed here is a potentially important tool for mod-
eling non-normal missing data, and it has shown great promise when paired with a fac-
tored regression specification (Lüdtke et al., 2020b). Inspecting imputations with simple
frequency distribution tables or graphs such as Figures 5.6 and 5.7 can identify potential
candidates for the procedure, which I describe in more detail later in Section 10.3.

5.6 THE METROPOLIS–HASTINGS ALGORITHM

A recurring theme of this chapter is that the conditional distribution of an incomplete


explanatory variable is the product of two or more component distributions. By now
Bayesian Estimation with Missing Data 207

you probably have an appreciation for how complicated these functions can be, even
when the analysis model is relatively simple; not surprisingly, the distributions get even
more complex with additional predictors or nonlinear effects (Levy & Enders, 2021). An
alternative imputation strategy uses a Metropolis–­Hastings algorithm (Gilks, Richard-
son, & Spiegelhalter, 1996; Hastings, 1970) to approximate these complicated functions
without the need for derivations. This algorithm is a very general and powerful tool for
Bayesian analyses, and it will resurface throughout the remainder of the book.
At its core, the Metropolis–­Hastings algorithm performs the same task as the
Gibbs sampler: It draws random numbers from a probability distribution. However,
the algorithm is particularly adept at sampling from complex functions like Equation
5.22, because it works with the simpler component distributions (e.g., a pair of normal
curves). The Metropolis–­Hastings algorithm is far more than a convenience feature, as
there are situations in which the product of two distributions is prohibitively complex to
derive or doesn’t have a known form. The curvilinear regression model in the next sec-
tion is one such example (Lüdtke et al., 2020b), and there are many others (e.g., analyses
that use nonconjugate prior distributions).

Moderated Regression Revisited


To explore the Metropolis–­Hastings algorithm in a context we’ve already encountered,
reconsider the moderated regression model from Equation 5.16. We already know that
the conditional distribution of X (e.g., depression) is a normal curve, and Equation 5.22
gives the exact form of the function. Graphing the distribution helps convey the logic of
the Metropolis sampler, so I use the parameter estimates from Table 5.2 for illustration.
To begin, suppose that we need to impute X for a participant with M = 0, Z = 0, and Y =
25 (i.e., a female with little or no chronic pain who scores well above the dependent vari-
able’s mean). The solid curve in Figure 5.8 shows the analytic distribution of replace-
ment values for this hypothetical participant. To emphasize, we happen to know the
exact distribution of the missing values in this case, but we often won’t have an equation
that describes the shape of the function from which we need to draw (e.g., because it
is intractable, or we aren’t motivated enough to derive it). For this reason, I refer to the
normal curve in the figure as the target distribution or target function.
The Metropolis sampler works with target distribution’s individual components,
which in this case are normal curves induced by a pair of regression models. Substitut-
ing parameters and score values into the kernels from Equation 5.21 returns the height
of each normal curve at a particular value of Y or X (i.e., the relative probability of each
score). By extension, the product of the relative probabilities determines the height of
the target function for that combination of inputs.
Ignoring how I got the value for a moment, suppose that the participant begins
iteration t with an imputed score of X(old) = 22. The “old” subscript is needed to differ-
entiate this value from the new imputation we are about to consider. The starting point
for the Metropolis algorithm is to identify X’s current location in the target function, as
this will help evaluate the quality of the next imputation. Substituting parameters, data
values, and current imputation into both parts of Equation 5.21 gives the height of the
target function at X = 22. Figure 5.8 shows the current imputation’s vertical coordinates
208 Applied Missing Data Analysis

Proposal
Target
Relative Probability

0 10 20 30 40
Depression

FIGURE 5.8. The solid line shows the target distribution of missing values, and the dashed
curve is a normal proposal distribution. The black circle and the white circle denote the current
and candidate imputations, respectively. Relative to the current value, the candidate imputation
is at a higher elevation on the target distribution.

as a solid circle. To attach some numbers to the example, I used R’s normal distribution
function to compute the likelihood (relative probability) from each normal distribution.
Multiplying these values as follows gives the height of the target distribution at the cur-
rent imputation (the R function retains the omitted scaling factors):

( ) ( )
f Yi | X i( old ) , Mi , X i( old ) × Mi , Zi × f X i( old ) | Mi , Zi = 0.0385 × 0.0315 = 0.0012 (5.23)

The algorithm can’t sample a new imputation directly from the target function with-
out an equation that defines its exact shape (again, we typically won’t have that). The idea
behind the Metropolis step is to sample a candidate imputation from a simple distribu-
tion known as a proposal distribution or jumping distribution, then determine whether
it is a good match to the unknown target function. The candidate value becomes the new
imputation if it has a high probability of originating from the target distribution. Other-
wise, if the candidate is a bad match, the algorithm discards it and uses the participant’s
current imputation for another iteration. For imputation, the proposal distribution is a
just a normal curve centered at the current imputation (as explained later, the variance
is fixed at a value that leads the algorithm to accept new imputations at some optimal
rate). The dashed curve in Figure 5.8 shows a normal proposal distribution centered at
the current imputation (the black dot). Again, the current imputation is the jumping-­off
Bayesian Estimation with Missing Data 209

point for evaluating an updated replacement value. As a minor clarification, the proce-
dure I’m describing is technically a Metropolis algorithm, because it uses a symmetrical
proposal distribution, whereas a Metropolis–­Hastings algorithm uses an asymmetrical
jumping distribution (Gelman et al., 2014; Lynch, 2007).
For the sake of illustration, suppose that the algorithm draws a random number
from the proposal distribution and gets a value of X(new) = 19 as a candidate imputation.
At this point, the algorithm has proposed a jump from its current position at X(old) = 22
to a new position four points lower on the horizontal axis. The next step is to evaluate
whether the proposed jump is a good one. To do this, we need the candidate imputa-
tion’s location in the target function. Substituting the parameters, data values, and can-
didate imputation into both parts of Equation 5.21 gives the height of the target function
at X = 19. Figure 5.8 shows this relative probability as a solid white circle. I again used
R’s normal distribution function to compute the height of each normal distribution at
this new score, and I multiplied these quantities to get the corresponding height of the
target function.

( ) ( )
f Yi | X i( new ) , Mi , X i( new ) × Mi , Zi × f X i( new ) | Mi , Zi = 0.0540 × 0.0518 = 0.0027 (5.24)

Notice that the candidate imputation’s relative probability is more than twice as large
as that of the current imputation (i.e., 0.0027 vs. 0.0012), which means that its vertical
elevation on the target function is that much higher as well.
Figure 5.8 shows that the proposed jump from X(old) = 22 to X(new) = 19 moves to a
higher elevation on the target distribution. This change suggests that the proposed jump
is a very good one, because the candidate imputation is in a more populated region of
the distribution. As such, we want to accept the candidate and assign it as the current
imputation for the next MCMC iteration. More formally, the relative height of the tar-
get function evaluated at the candidate and current values is known as the importance
ratio. Forming the fraction of Equation 5.24 over Equation 5.23 gives the following ratio:

IR =
( ) (
f Yi | X i( new ) , Mi , X i( new ) × Mi , Zi × f X i( new ) | Mi , Zi )
(
f Yi | X i( old ) , Mi , X i( old ) × M ,Z )× f (X (
i i i old ) | M i , Zi ) (5.25)
0.0540 × 0.0518 0.0027
= = = 2.31
0.0385 × 0.0315 0.0012
This value agrees with Figure 5.8, which shows that the white circle (the candidate
imputation) is more than twice as high in elevation as the black circle (the current impu-
tation). An importance ratio greater than one implies that the proposed jump should
automatically be accepted, because the candidate imputation is in a more populated
region of the target distribution.
Because the proposal distribution is symmetrical, it is just as likely for the candi-
date imputation to move to a lower elevation on the target function. To illustrate what
happens in that case, suppose that the algorithm instead draws a random number from
the proposal distribution and gets X(new) = 24 as a candidate imputation. Figure 5.9 also
shows the current and candidate values as black and while circles, respectively. The
210 Applied Missing Data Analysis

Proposal
Target
Relative Probability

0 10 20 30 40
Depression

FIGURE 5.9. The solid line shows the target distribution of missing values, and the dashed
curve is a normal proposal distribution. The black circle and the white circle denote the current
and candidate imputations, respectively. Relative to the current value, the candidate imputation
is at a lower elevation on the target distribution.

algorithm has now proposed a jump from its current position at X(old) = 22 to a new posi-
tion two points higher on the horizontal axis. Importantly, this jump moves the can-
didate imputation to a lower elevation on the target distribution. We still want to draw
imputations from this region of the distribution, just not as frequently. This is where
the importance ratio comes into play. I again used the R normal distribution function to
compute the importance ratio, which is now 0.4744.

IR =
( ) (
f Yi | X i( new ) , Mi , X i( new ) × Mi , Zi × f X i( new ) | Mi , Zi )
(
f Yi | X i( old ) , Mi , X i( old ) × M ,Z )× f (X (
i i i old ) | M i , Zi ) (5.26)
0.0294 × 0.0195 0.0006
= = = 0.4744
0.0385 × 0.0315 0.0012
You can visually verify the importance ratio by noting that the white circle’s elevation is
about 50% as high as the black circle. Although the relative probabilities in the numera-
tor and denominator of the ratio don’t have an absolute interpretation, they do have
a relative one—the candidate imputation is about 47% as likely as the current value.
As such, the probability of accepting the jump to a lower elevation is 0.47. To decide
whether to keep the candidate value, the algorithm generates a random number from
Bayesian Estimation with Missing Data 211

a binomial distribution with a 47% success rate, which is akin to tossing a biased coin
with a 47% chance of turning up heads. If the random draw is a head (i.e., a “success”),
then the candidate imputation becomes the new imputation for the next MCMC itera-
tion. Otherwise, the participant’s current imputation is used for another iteration.
There is one final detail to address. I previously mentioned that the proposal dis-
tribution’s variance is fixed at some predetermined value. It ends up that the spread of
the jumping distribution largely determines the overall rate at which candidate imputa-
tions are accepted; increasing the variance decreases the overall acceptance rate, and
decreasing the variance increases acceptance rates. Although recommendations vary
from one author to the next, common rules of thumb suggest that acceptance rates
between 25 and 50% are ideal (Gelman et al., 2014; Johnson & Albert, 1999; Lynch,
2007). In practice, software programs often start with a preliminary guess about the
variance and then “tune” the parameter periodically by making upward or downward
adjustments to achieve the desired probability. I use a simple tuning scheme in the R
program on the companion website, and dedicated Bayesian texts describe more sophis-
ticated approaches (e.g., Gelman et al., 2014).

5.7 CURVILINEAR EFFECTS

The factored regression approach readily accommodates other types of nonlinear terms.
Curvilinear regression models with polynomial terms and incomplete predictors are an
important example. To illustrate, consider a prototypical polynomial regression model
that features a squared or quadratic term for X (i.e., the interaction of X with itself).

Yi = β0 + β1 X i + β2 X i2 + ε i (5.27)

εi ~ (
N1 0, σ2ε )
Like a moderated regression analysis, β1 is a conditional effect that captures the influ-
ence of X when X itself equals 0 (Aiken & West, 1991; Cohen et al., 2002). The β2 coef-
ficient is of particular interest, because it captures acceleration or deceleration (i.e., cur-
vature) in the trend line. For example, if β1 and β2 are both positive, the influence of X
on Y becomes more positive as X increases, whereas a positive β1 and a negative β2 imply
that X’s influence diminishes as X increases.
To provide a substantive context, I use the math achievement data set from the
companion website that includes pretest and posttest math scores and several academic-­
related variables (e.g., math self-­efficacy, anxiety, standardized test scores, sociodemo-
graphic variables) for a sample of N = 250 students. The literature suggests that anxi-
ety could have a curvilinear relation with math performance, such that the negative
influence of anxiety on achievement worsens as anxiety increases. The following model
accommodates this nonlinearity while controlling for a binary indicator that measures
whether a student is eligible for free or reduced-­priced lunch (0 = no assistance, 1 = eli-
gible for free or reduced-­price lunch), math pretest scores, and a gender dummy code (0 =
female, 1 = male):
212 Applied Missing Data Analysis

MATHPOSTi = β0 + β1 ( ANXIETYi − μ1 ) + β2 ( ANXIETYi − μ1 )


2
(5.28)
+ β3 ( FRLUNCH i ) + β4 ( MATHPREi ) + β5 ( MALEi )
Anxiety scores are centered at their grand mean to facilitate interpretation. Approxi-
mately 16.8% of the posttest math scores and 8.8% of the math anxiety ratings are miss-
ing, as are 5.2% of the lunch assistance indicator codes.

Factored Regression Specification


The factored regression (sequential) specification for a curvilinear regression mimics
the setup for an interaction effect. The joint distribution of the five analysis variables
factors into the following product of univariate distributions:

f ( MATHPOST, ANXIETY , FRLUNCH, MATHPRE, MALE ) =

(
f MATHPOST | ANXIETY , ANXIETY 2 , FRLUNCH, MATHPRE, MALE × ) (5.29)
f ( ANXIETY | FRLUNCH, MATHPRE, MALE ) ×

( )
f FRLUNCH * | MATHPRE, MALE × f ( MATHPRE | MALE ) × f MALE* ( )
The first term to the right of the equals sign is the normal distribution induced by the
curvilinear regression, and the remaining terms are supporting models for the incom-
plete predictors. Importantly, the squared term is not a variable with its own distribu-
tion, but rather a deterministic function of the incomplete anxiety scores. The regressor
models translate into a linear regression for anxiety, a probit (or logistic) model for the
lunch assistance indicator, a linear regression for math pretest scores, and an empty
probit (or logistic) model for the marginal distribution of gender.

ANXIETYi = γ 01 + γ11 ( FRLUNCH i ) + γ 21 ( MATHPREi ) + γ 31 ( MALEi ) + r1i (5.30)

FRLUNCH i* = γ 02 + γ12 ( MATHPREi ) + γ 22 ( MALEi ) + r2i


MATHPREi = γ 03 + γ13 ( MALEi ) + r3i
MALEi* =γ 04 + r4 i

I ultimately ignore the bottom two equations, because these variables are complete and
do not require a model. As before, the asterisk superscripts reflect a latent response
variable formulation for the binary variables, which I discuss in Chapter 6. The partially
factored specification replaces the univariate regression with a multivariate normal dis-
tribution (or an equivalent set of round-robin regressions), as shown in Equations 5.8
and 5.9.
Thus far, the conditional distributions of incomplete explanatory variables have
been normal curves (complicated curves, but normal nonetheless). With enough moti-
vation and effort, we could use the factorization as a recipe for deriving an exact equa-
tion for the distribution (e.g., Equation 5.22). The same is not true for curvilinear regres-
sion models, as Lüdtke et al. (2020b) show that a quadratic function induces a quartic
Bayesian Estimation with Missing Data 213

exponential distribution (Cobb, Koppstein, & Chen, 1983; Matz, 1978) that usually isn’t
normal and may even have multiple modes. Depending on the composition of the analy-
sis model, the task of deriving this distribution falls somewhere between very difficult
and intractable. Fortunately, we can use the Metropolis sampler to draw imputations
from this nonstandard target function.

Analysis Example
Continuing with the math achievement example, I used the partially factored regression
specification to illustrate Bayesian estimation for the curvilinear regression model from
Equation 5.28. As explained previously, a partially factored specification that assigns a
multivariate distribution to the predictors is ideally suited for models with centered pre-
dictors, because the grand means are iteratively updated model parameters. The poten-
tial scale reduction factor diagnostic (Gelman & Rubin, 1992) suggested that the MCMC
algorithm converged in fewer than 500 iterations, so I continue using 11,000 total itera-
tions with a conservative 1,000-iteration burn-in period. Analysis scripts are available
on the companion website.
Table 5.3 summarizes the posterior distributions of the model parameters, and Fig-
ure 5.10 shows the regression line based on the posterior medians (averaging over the
other predictors). As a comparison, Table 3.9 gives maximum likelihood estimates from
the factored regression model. I omit the regressor model parameters, because they are
not the substantive focus. Because of centering, the lower-order anxiety slope (Mdnβ1 =
–0.26, SD = 0.09) reflects the influence of this variable on math achievement at the anxi-
ety mean (i.e., instantaneous rate of change in the outcome when the predictor equals 0).
The negative curvature coefficient (Mdnβ2 = –0.01, SD = 0.006) indicates that the anxi-
ety slope became more negative as anxiety increased. This interpretation is clear from
Figure 5.10, where the regression function is concave down. The curvature parameter is
unlikely equal to 0, because this null value falls outside the 95% credible interval (albeit
barely). Finally, note that the maximum likelihood-­produced results are numerically
equivalent but with frequentist interpretations. This has been a recurring theme across
several examples.

TABLE 5.3. Posterior Summary from the Curvilinear Regression


Parameter Mdn SD LCL UCL
β0 42.30 3.57 35.44 49.39
β1 (ANXIETY) –0.26 0.09 –0.42 –0.09
β2 (ANXIETY2) –0.014 0.006 –0.026 –0.002
β3 (FRLUNCH) –3.68 1.07 –5.78 –1.60
β4 (MATHPRE) 0.37 0.07 0.24 0.50
β5 (MALE) –4.05 1.04 –6.04 –2.01
σε2 52.16 5.33 42.93 63.92
R2   .43 .05   .32   .52

Note. LCL, lower credible limit; UCL, upper credible limit.


214 Applied Missing Data Analysis

70
65
Posttest Math Achievement
60
55
50
45
40

–20 –10 0 10 20
Math Anxiety (Centered)

FIGURE 5.10. Estimated regression line from the curvilinear regression analysis, averaging
over the covariates.

5.8 AUXILIARY VARIABLES

A conditionally MAR mechanism is usually the default assumption for contemporary


missing data-­handling procedures, including Bayesian estimation. This process stipu-
lates that whether a person has missing values depends strictly on observed data, and
the unseen scores themselves are unrelated to missingness. In practical terms, the def-
inition implies that the focal analysis model should include all important correlates
of missingness, as omitting such a variable could result in a bias-­inducing MNAR-by-­
omission process if the semipartial correlations are strong enough. Chapter 1 described
an inclusive analysis strategy that fine-tunes a missing data analysis by introducing
extraneous auxiliary variables into the model (Collins et al., 2001; Schafer & Graham,
2002). Adopting such a strategy can reduce nonresponse bias, improve precision, or
both. Sections 1.5 and 1.6 provide guidance on this topic.
Section 3.10 described four strategies for introducing auxiliary variables into a
maximum likelihood analysis, three of which readily extend to Bayesian estimation.
Graham’s (2003) saturated correlates and extra dependent variable models work well for
multivariate analyses cast in the structural equation modeling framework. These mod-
els, which are primarily designed for multivariate normal data, use a particular configu-
ration of residual correlations and regression slopes to connect the auxiliary variables to
the focal analysis variables. Figures 3.8 through 3.11 show prototypical path diagrams
for these models. The factored regression or sequential specification is an alternative
Bayesian Estimation with Missing Data 215

option that is well suited for analyses with interactions or nonlinear effects or mixtures
of categorical and continuous variables. The mechanics of implementing this strategy
are identical to those used for the interactive and curvilinear models described previ-
ously.

Analysis Example
This example uses the psychiatric trial data on the companion website to illustrate a
Bayesian regression analysis with auxiliary variables. The data, which were collected
as part of the National Institute of Mental Health Schizophrenia Collaborative Study,
consist of four illness severity ratings, measured in half-point increments ranging from
1 (normal, not at all ill) to 7 (among the most extremely ill). In the original study, the 437
participants were assigned to one of four experimental conditions (a placebo condition
and three drug regimens), but the data collapse these categories into a dichotomous
treatment indicator (DRUG = 0 for the placebo group, and DRUG = 1 for the combined
medication group). The researchers collected a baseline measure of illness severity prior
to randomizing participants to conditions, and they obtained follow-­up measurements
1 week, 3 weeks, and 6 weeks later. The overall missing data rates for the repeated mea-
surements were 1, 3, 14, and 23%, respectively.
The focal regression model predicts illness severity ratings at the 6-week follow-­up
assessment from baseline severity ratings, gender, and the treatment indicator.

SEVERITY6i = β0 + β1 ( DRUGi ) + β2 ( SEVERITY0i − μ 2 ) + β3 ( MALEi − μ 3 ) + ε i (5.31)

(
ε i ~ N1 0, σ2ε )
Centering the baseline scores and male dummy code at their grand means facilitates
interpretation, as this defines β0 and β1 as the placebo group average and group mean
difference, respectively, marginalizing over the covariates.
This small data set offers limited choices for auxiliary variables, but the illness
severity ratings at the 1-week and 3-week follow-­up assessments are excellent candi-
dates, because they have strong semipartial correlations with the dependent variable
(r = .40 and .61, respectively) and uniquely predict its missingness. Following estab-
lished procedures, a factored regression specification features a sequence of univariate
distributions, each of which corresponds to a regression model. To maintain the desired
interpretation of the focal model parameters, it is important to specify a sequence where
the analysis variables predict the auxiliary variables and not vice versa. The factoriza-
tion for this analysis is as follows:

f ( SEVERITY3 | SEVERITY1 , SEVERITY6 , SEVERITY0 , DRUG, MALE ) ×


f ( SEVERITY1 | SEVERITY6 , SEVERITY0 , DRUG, MALE ) ×
(5.32)
f ( SEVERITY6 | SEVERITY0 , DRUG, MALE ) × f ( SEVERITY0 | DRUG, MALE ) ×

( ) (
f DRUG* | MALE × f MALE* )
216 Applied Missing Data Analysis

The first two terms are auxiliary variable distributions that derive from linear regres-
sion models, the third term corresponds to the focal analysis, the fourth term is a linear
regression model for the incomplete baseline scores, and the final two terms are regres-
sions for the complete predictors (which I ignore, because these variables do not require
distributions). The auxiliary variable regressions are shown below, and the predictor
distributions follow earlier examples:

SEVERITY3i = γ 02 + γ12 ( SEVERITY1i ) + γ 22 ( SEVERITY6i ) + γ 32 ( DRUGi )


(5.33)
+ γ 42 ( SEVERITY0i ) + γ 52 ( MALEi ) + r2i

SEVERITY1i = γ 01 + γ11 ( SEVERITY6i ) + γ 21 ( SEVERITY0i )


+ γ 31 ( DRUGi ) + γ 41 ( MALEi ) + r1i

Figure 3.13 shows path diagram of the models for this example.
The PSRF (Gelman & Rubin, 1992) diagnostic indicated that the MCMC algorithm
converged in fewer than 400 iterations, so I continue using 11,000 total iterations with
a conservative 1,000-iteration burn-in period. Analysis scripts are available on the com-
panion website. Table 5.4 summarizes the posterior distributions of the model param-
eters with and without auxiliary variables. In the interest of space, I omit the auxiliary
variable and covariate model parameters, because they are not the substantive focus.
Although the auxiliary variables change the numerical results, they do not affect the
interpretation of the focal model parameters; the intercept coefficient is placebo group
mean at the 6-week follow-­up (Mdnβ0 = 4.41, SD = 0.16), and the posterior median of β1
gives the group mean difference for the medication condition (Mdnβ1 = –1.45, SD = 0.18),
controlling for covariates. Perhaps not surprisingly, maximum likelihood estimates
were numerically equivalent, albeit with frequentist interpretations (see Table 3.10).
Conditioning on the auxiliary variables had a substantial impact on key parameters
estimates; the intercept coefficients (placebo group means) differed by nearly three-­
fourths of a posterior standard deviation, and the slope coefficients (medication group
mean differences) differed by more than one standard deviation. Although the natural
inclination is to favor the analysis with auxiliary variables, there is no way to know for

TABLE 5.4. Posterior Summary of Regression Parameters


with and without Auxiliary Variables
No AVs AVs
Effect Mdn SD Mdn SD
Intercept (β0) 4.29 0.17 4.41 0.16
DRUG (β1) –1.24 0.19 –1.45 0.19
SEVERITY0 (β2) 0.31 0.09 0.27 0.09
MALE (β3) 0.22 0.15 0.22 0.15
σε2 1.91 0.15 2.03 0.16
R2   .16 .04   .19 .04
Bayesian Estimation with Missing Data 217

sure which is more correct, as conditioning on the wrong set of variables can exacerbate
nonresponse bias, at least hypothetically (Thoemmes & Rose, 2014). Nevertheless, the
differences are consistent with the shift from an MNAR-by-­omission mechanism to a
more MAR-like process. The fact that the auxiliary variables have strong semipartial
correlations with the dependent variable (e.g., greater than .40) and uniquely predict its
missingness reinforces this conclusion.

5.9 MULTIVARIATE NORMAL DATA

The multivariate normal distribution is often a reasonable way to assign a distribution


to variables that wouldn’t otherwise need one, and it is foundational to some but not
all Bayesian structural equation modeling approaches (Asparouhov & Muthén, 2010a;
Kaplan & Depaoili, 2012). Bayesian estimation for multivariate normal data is also his-
torically significant in the missing data literature, because it is the basis for the joint
model multiple imputation framework popularized by Joe Schafer (1997). Although the
mathematical machinery for joint model imputation is identical to what I describe in
this section, our current focus is estimating and interpreting the parameter values them-
selves, and imputations are just a means to that end. I return to multiple imputation in
Chapter 7.
I use the employee data from the companion website to provide a substantive con-
text. The data set includes several workplace-­related variables (e.g., work satisfaction,
turnover intention, employee–­supervisor relationship quality) for a sample of N = 630
employees. The illustration uses a 7-point work satisfaction rating (1 = extremely dis-
satisfied to 7 = extremely satisfied) and two composite scores that measure employee
empowerment and a construct known as the leader–­member exchange scale (the quality
of an employee’s relationship with his or her supervisor). For now, I treat work satisfac-
tion as a normally distributed variable, because it has a sufficient number of response
options and a symmetrical distribution (Rhemtulla et al., 2012). The work satisfaction
ratings have a 4.8% missing data rate, the employee empowerment variable has 16.2% of
its scores missing, and 4.1% of the leader–­member exchange values are incomplete. The
Appendix describes the data set and variable definitions.
The empty regression models for the multivariate analysis are as follows:

 WORKSATi   Y1i   μ1   ε1i 


       
Yi =  EMPOWER i  =  Y2i  +  μ 2  +  ε 2i  = μ + ε (5.34)
 LMX  Y  μ  ε 
 i   3i   3   3i 
  μ   σ12 σ12 σ13  
 1   
Yi ~ N 3   μ 2  ,  σ21 σ22 σ23  
   
  μ 3   σ 31 σ 32 σ23  
  
As a reminder, N3 denotes a trivariate normal distribution, and the first and second terms
inside the normal distribution function are the mean vector and covariance matrix,
218 Applied Missing Data Analysis

μ and Σ. Writing the model this way emphasizes that the analysis comprises entirely
dependent variables, and there are no incomplete predictors to worry about.

Missing Data Imputation


Every iteration of the MCMC algorithm performs the following operations: (1) estimates
the mean vector conditional on the current covariance matrix and the filled-­in data,
(2) estimates the covariance matrix conditional on the new mean vector and the cur-
rent data, and (3) updates the missing values conditional on the current estimates of
the model parameters. The estimation steps for μ and Σ are the same as those in Section
4.10, because they leverage a complete data set. All that’s left is to figure out the final
imputation step.
Coming full circle back to the beginning of the chapter, recall that the dependent
variable’s imputations equal a predicted value plus a random noise term. The same is
true here, but we need to convert the estimated mean vector and covariance matrix into
a series of regression models, one for each missing data pattern. To illustrate, consider
participants with missing data on Y1. Imputation for this pattern requires the regression
of Y1 on Y2 and Y3 (e.g., the regression of work satisfaction on employee empowerment
and leader–­member exchange). Because this pattern has only one incomplete variable,
the posterior predictive distribution of the missing values is a univariate normal curve
with a predicted value and residual variance defining its center and spread.

Y1i = γ 0 + γ1Y2i + γ 3Y3i + ri = E ( Y1i | Y2i , Y3i ) + ri (5.35)

( ) (
parameters,data N1 E ( Y1i | Y2i , Y3i ) , σ2r
f Y1i( mis ) |= )
Importantly, the regression coefficients and residual variance are a deterministic trans-
formation of the elements in μ(t) and Σ(t) rather than estimated parameters (the equations
are given below). As a second example, consider participants with missing data on Y1
and Y3. Imputation for this pattern requires the multivariate regression of the incom-
plete pair on Y2 (e.g., the regression of work satisfaction and leader–­member exchange
on empowerment). The following bivariate normal distribution generates correlated
pairs of imputations:

 Y1i   E ( Y1i | Y2i ) 


  = γ 0 + γ 1 ( Y2i ) + ri =   + ri (5.36)
 Y3i   E ( Y3i | Y2i ) 
  E ( Y1i | Y2i )  
( )
f Y1i( mis ) , Y2i( mis ) | parameters,data = N 2  
 E ( Y3i | Y2i )  r
,S 

  
where γ0 and γ1 contain a pair of intercepts and slopes, respectively, one per incomplete
variable, and ri is a pair of correlated residuals. Although the distribution of missing val-
ues is now bivariate normal, the construction of the imputations is the same as before,
with one minor modification—­each imputation equals a predicted value plus a cor-
Bayesian Estimation with Missing Data 219

related noise term that preserves the unexplained part of its association with the other
missing score.
To reiterate, the regression parameters that define the distributions of imputations
are just transformations of the estimated parameters, which are the mean vector and
covariance matrix. For completeness, the remainder of this section shows how to con-
vert μ and Σ into the necessary quantities. Readers who are not interested in these fine-­
grained details can skip to the analysis example without losing important information.
To begin, consider the regression of Y1 on Y2 and Y3 in Equation 5.35. To get these regres-
sion parameters, MCMC partitions the mean vector and covariance matrix at iteration
t into blocks, such that μ(com) and Σ(com) are submatrices corresponding to the complete
variables for this pattern, μ(mis) and Σ(mis) are the parameters of the missing variables, and
Σ(mc) contains covariances between the missing and complete variables. The partitions
for the univariate pattern are as follows:

 μ ( mis )   S ( mis ) S ( mc ) 
=μ =  S   (5.37)
 μ ( com )   S ( cm ) S ( com ) 
   
 μ2 
μ ( com ) =   μ ( mis ) = ( μ1 )
 μ3 
 σ22 σ23 
S ( com ) =

σ 2 
σ3 
 S ( mis ) =σ12 ( ) ( σ12
S ( mc ) = σ13 )
 32
Similarly, the multivariate regression in Equation 5.36 requires the following partition:

 μ1 
( μ2 ) μ ( mis ) =
μ ( com ) =   (5.38)
 μ3 
 σ12 σ13   σ12 
( )
σ22 S ( mis ) =
S ( com ) = 
σ 2
 S ( mc ) =
  
 σ 32 
 31 σ 3 
Finally, the regression model parameters are transformations of these submatrices:

(
γ = S ( mc ) S (−com
1
) )

(5.39)

γ 0 = μ ( mis ) − μ ( com ) γ ′
S r S ( mis ) − γ ′S ( com ) γ
=

where γ contains regression slopes, γ0 contains intercepts, and Σr is the residual covari-
ance matrix (or variance in patterns with a single incomplete variable).

Analysis Example
Continuing with the employee data example, I apply Bayesian missing data handling
to estimate the mean vector and variance–­covariance matrix from the model in Equa-
220 Applied Missing Data Analysis

tion 5.34. To explore the influence of different prior distributions, I implemented the
­Wishart specifications described in Section 4.10 and a separation strategy that specifies
distinct priors for variances and correlations (Merkle & Rosseel, 2018). Following earlier
examples, I used PSRFs to determine the burn-in period, and I based the final analyses
on 10,000 MCMC iterations. Estimation scripts are available on the companion website.
Table 5.5 gives Bayesian summaries of the means, standard deviations, variances
and covariances, and correlations. Because the choice of prior had little impact (e.g.,
differences were similar to those in Table 4.4), Table 5.5 shows results from an improper
inverse Wishart prior with S0 = 0 and df0 = –V – 1 (Asparouhov & Muthén, 2010a).
Note that the standard deviations and correlations are deterministic functions of the
estimated variances and covariances at each iteration (e.g., a correlation is a covariance
divided by square root of the product of two variances). As a comparison, Table 3.2 gives
the corresponding maximum likelihood estimates. Consistent with other examples, the
two estimators produced similar numerical results, albeit with different perspectives on
inference. This won’t necessarily be true in smaller samples, where the choice of prior
distribution could be more impactful.

TABLE 5.5. Posterior Summary of Descriptives and Bivariate Associations


Effect Mdn SD LCL UCL
Means
Work Satisfaction 3.98 0.05 3.88 4.08
Empowerment 28.59 0.19 28.22 28.98
LMX 9.62 0.12 9.38 9.86

Standard deviations
Work Satisfaction 1.27 0.04 1.21 1.35
Empowerment 4.45 0.14 4.19 4.74
LMX 3.04 0.09 2.87 3.22

Variances and covariances


Work Satisfaction 1.62 0.09 1.45 1.82
Empowerment 19.80 1.25 17.52 22.44
LMX 9.22 0.53 8.26 10.34
Work Satisfaction ↔ Empowerment 1.73 0.26 1.23 2.27
Work Satisfaction ↔ LMX 1.63 0.17 1.31 1.99
Empowerment ↔ LMX 5.71 0.66 4.48 7.08

Correlations
Work Satisfaction ↔ Empowerment   .31 .04   .22   .38
Work Satisfaction ↔ LMX   .42 .03   .35   .49
Empowerment ↔ LMX   .42 .04   .35   .50

Note. LCL, lower credible limit; UCL, upper credible limit; LMX, leader–member exchange.
Bayesian Estimation with Missing Data 221

5.10 SUMMARY AND RECOMMENDED READINGS

This chapter has illustrated missing data handling in the Bayesian framework. Follow-
ing ideas established in Chapter 4, MCMC breaks estimation problems into a series of
steps that address one parameter (or block of similar parameters) at a time, while treat-
ing other parameters as known constants. With missing data, the algorithm then uses
the newly updated estimates to construct model-­predicted distributions of the missing
data, from which it samples imputations. A Bayesian analysis is like maximum likeli-
hood in the sense that model parameters are the focus, but all missing data handling is
done via imputation; missing values are just another unknown for MCMC to estimate.
Much of the chapter has focused on specifying models as factored regression equa-
tions (Ibrahim et al., 2002; Lüdtke et al., 2020b). I introduced this approach in Chapter
3, where it was a flexible strategy for assigning distributions to incomplete predictor
variables. The procedure is similarly flexible for Bayesian estimation and works in much
the same way. The missing data distributions defined under this approach are usually
quite complicated and may depend on multiple sets of model parameters (e.g., the distri-
bution of an incomplete predictor is always the product of two or more distributions and
corresponding model parameters). Nevertheless, imputations always have a straightfor-
ward interpretation as the sum of a predicted value plus a normally distributed noise
term.
Looking ahead, Chapter 6 describes Bayesian estimation for binary, ordinal, and
multicategorical nominal variables. The procedure builds on the sequential specification
from this chapter, but probit regressions with latent response variables replace linear
models with continuous outcomes. As you will see, the latent response variable frame-
work is convenient, because it reuses MCMC estimation steps for continuous variables.
Finally, I recommend the following articles for readers who want additional details on
topics from this chapter:

Ibrahim, J. G., Chen, M. H., & Lipsitz, S. R. (2002). Bayesian methods for generalized linear
models with covariates missing at random. Canadian Journal of Statistics, 30, 55–78.

Kim, S., Sugar, C. A., & Belin, T. R. (2015). Evaluating model-based imputation methods for
missing covariates in regression models with interactions. Statistics in Medicine, 34, 1876–
1888.

Lüdtke, O., Robitzsch, A., & West, S. G. (2020). Regression models involving nonlinear effects
with missing data: A sequential modeling approach using Bayesian estimation. Psychologi-
cal Methods, 25, 157–181.

McNeish, D. (2016). On using Bayesian methods to address small sample problems. Structural
Equation Modeling: A Multidisciplinary Journal, 23, 750–773.

Zhang, Q., & Wang, L. (2017). Moderation analysis with missing data in the predictors. Psycho-
logical Methods, 22, 649–666.
6

Bayesian Estimation
for Categorical Variables

6.1 CHAPTER OVERVIEW

In the not so distant past, the predominant method for dealing with incomplete cat-
egorical variables was to impute them as though they were normally distributed and
apply a rounding scheme to convert continuous imputes to discrete values (Allison,
2002, 2005; Bernaards, Belin, & Schafer, 2007; Horton, Lipsitz, & Parzen, 2003; Yucel,
He, & Zaslavsky, 2008, 2011). Fortunately, these ad hoc approaches are unnecessary at
this point in the evolutionary tree, as user-­friendly tools for conducting Bayesian analy-
ses with categorical variables are widely available. This chapter describes estimation
and missing data handling for binary, ordinal, and multicategorical nominal variables. I
focus primarily on the probit regression framework that views categorical responses as
originating from one or more latent response variables. This approach is widely cited in
the missing data literature and readily integrates with the Bayesian estimation routines
from Chapters 4 and 5.
A good deal of methodological work on Bayesian estimation for categorical vari-
ables traces to a seminal paper by Albert and Chib (1993). They describe a data augmen-
tation approach that supplements the categorical scores with latent response variable
estimates. The appeal of their method is that given a full sample of latent scores, MCMC
can simply recycle estimation steps for continuous variables. This makes dealing with
categorical variables straightforward, because we only need to learn how to create the
underlying latent response scores. As you will see, data augmentation is essentially an
extreme form of imputation in which 100% of the sample has missing data on these
variables. As an aside, the literature also describes latent variable data augmentation
for logistic regression models (Asparouhov & Muthén, 2021b; Frühwirth-­Schnatter
& Früwirth, 2010; Holmes & Held, 2006; O’Brien & Dunson, 2004; Polson, Scott, &
Windle, 2013), but the probit model is currently the norm for missing data handling
(Asparouhov & Muthén, 2010c; Carpenter, Goldstein, & Kenward, 2011; Carpenter &
222
Bayesian Estimation for Categorical Variables 223

Kenward, 2013; Enders et al., 2020; Enders, Keller, & Levy, 2018; Goldstein, Carpenter,
Kenward, & Levin, 2009).
The chapter begins by describing the latent response formulation for a binary out-
come. This approach readily extends to ordinal outcomes with relatively little modifica-
tion, and it also provides a foundation for understanding the multinomial probit model
for multicategorical nominal variables. The chapter concludes with a brief discussion
of logistic regression. Extending ideas from earlier chapters, missing data imputation
for categorical explanatory variables requires a distribution for these variables. The fac-
tored regression modeling strategy used throughout the book also applies to categorical
variables, and the only change is that probit models replace linear regressions.

6.2 LATENT RESPONSE FORMULATION


FOR CATEGORICAL VARIABLES

I use the employee data on the companion website to illustrate the latent variable for-
mulation for binary and ordinal variables. The data set includes several work-related
variables (e.g., work satisfaction, turnover intention, employee– supervisor relationship
quality) for a sample of N = 630 employees. I begin with a dichotomous measure of
turnover intention that equals 0 if an employee has no plan to leave his or her position
and 1 if the employee has intentions of quitting. The bar graph in Figure 6.1 shows the
distribution of discrete responses.
100
80
60
Percent
40
20
0

0 = No 1 = Yes
Turnover Intention

FIGURE 6.1. Bar graph of the dichotomous measure of turnover intention (0 = an employee has
no plan to leave her or his position, and 1 = the employee has intentions of quitting).
224 APPLIED MISSING DATA ANALYSIS

Probit regression envision the binary scores originating from an underlying latent
response variable that represents one’s underlying proclivity or propensity to endorse
the highest category (Agresti, 2012; Johnson & Albert, 1999). Applied to the turnover
intention measure, this latent variable represents an unobserved, continuous dimen-
sion of intentions to quit. To illustrate, Figure 6.2 shows the latent variable distribution
for the bar graph in Figure 6.1. The vertical line represents the precise cutoff point or
threshold in the latent distribution where discrete scores switch from 0 to 1 (or more
generally, from the lowest code to the highest code). The areas under the curve above
and below this threshold correspond to the category proportions in the bar chart; that
is, 69% of the area under the curve falls below the threshold, and 31% falls above in the
shaded region. Using generic notation, the link between the latent scores and categorical
responses is

0 if Yi* ≤ τ
Yi =  *
(6.1)
1 if Yi > τ

where Yi is the binary outcome for individual i, Yi* is the corresponding latent response
score, and τ is the threshold parameter (the vertical line in Figure 6.2).
I introduced the following notation for the probit model earlier in the book:

Y=0 Y=1
Relative Probability

–4 –3 –2 –1 0 1 2 3 4
Latent Variable Score

FIGURE 6.2. Latent response variable distribution for a binary variable. The vertical line at
0 is a threshold parameter that divides the latent distribution into two discrete segments. The
shaded region represents the proportion of employees who intend to quit.
Bayesian Estimation for Categorical Variables 225

( )
Yi* = β0 + ε i = E Yi* + ε i (6.2)

Yi* ~ N ( E ( Y ) ,1)
1 i
*

To refresh, E(Yi*) is the predicted latent response score (i.e., conditional mean), N1 denotes
a univariate normal distribution, and the first and second terms inside the normal dis-
tribution function are its mean and variance, respectively. Because the latent scores
are completely missing, the probit model requires two identification constraints that
establish a metric for the latent response scores. First, fixing either the latent response
variable’s mean or threshold to 0 establishes the mean structure. I always adopt the
strategy of fixing the threshold and estimating the mean. With no explanatory variables
in the model, β0 is simply the grand mean of the latent response variable, which you can
see is approximately located at –0.50 in Figure 6.2. Second, the model scales the latent
response variable as a z-score by fixing its variance to 1. The second term in the normal
distribution function reflects this constraint.
The probit model for ordered categorical variables incorporates additional thresh-
old parameters but is otherwise identical to the binary model. Continuing with the
employee data set, Figure 6.3 shows a bar graph of the 7-point work satisfaction rating
scale (1 = extremely dissatisfied to 7 = extremely satisfied). The latent variable regression
model is the same as Equation 6.2. This variable requires six threshold parameters to
carve the latent response distribution into seven discrete regions. More generally, the
40
30
Percent
20
10
0

1 2 3 4 5 6 7
Work Satisfaction

FIGURE 6.3. Bar graph of a 7-point work satisfaction rating scale ranging from 1 = extremely
dissatisfied to 7 = extremely satisfied.
226 Applied Missing Data Analysis

number of thresholds is always one fewer than the number of response options. The link
function that relates the latent response scores to the discrete categories is shown below:
1 if τ0 < Yi* ≤ τ1
 *
2 if τ1 < Yi ≤ τ2
Yi =  (6.3)

C if τ < Y * ≤ τ
 C −1 i C

For notational convenience, it is useful to define two additional faux thresholds, τ0 = –∞


and τC = ∞, that bound the lowest and highest response categories. The equation says
that latent scores are constrained to a particular region of the normal distribution, con-
tingent on the categorical response. Like the binary model, the mean structure requires
an identification constraint, and I always anchor the latent distribution by fixing τ1 at 0
and estimating the grand mean.
In some situations, the latent response variable is just a mathematical device for
representing discrete scores in an analysis or imputation model. For example, you could
use the binary probit model to impute an incomplete demographic characteristic such
as biological sex, even though the underlying latent scores have no real-world meaning.
In other situations, the latent response variable could be substantively meaningful, and
it could even be used in lieu of the categorical variable. For example, Gelman (2004)
describes a Presidential opinion poll where modeling a latent candidate preference
revealed interesting insights during an election cycle, and Muthén et al. (2016) discuss
the possibility of replacing a binary mediator with its underlying latent response scores.
Chapter 10 illustrates an application of multiple imputation that creates latent variable
scores for use in another analysis.

6.3 REGRESSION WITH A BINARY OUTCOME

Having covered the basics, we can now add explanatory variables to the latent response
model. To illustrate, consider a simple regression with leader–­ member exchange
(employee–­supervisor relationship quality) predicting turnover intention. The latent
variable regression model is as follows:

TURNOVER i* = β0 + β1 ( LMX i ) + ε i (6.4)

( )
Yi* = β0 + β1 X i + ε i = E Yi* | X i + ε i

((
Yi* ~ N1 E Yi* | X i ,1) )
The bottom row of the expression says that the latent distribution for participant i is
now centered at a predicted value, and the residual variance is fixed at 1 to establish a
metric (i.e., the conditional distribution of the latent scores is scaled as a z-score). As
before, the model features a fixed threshold parameter that divides the latent variable
distribution into two segments (see Equation 6.1).
Bayesian Estimation for Categorical Variables 227

6
4
Latent Turnover Intention Score
2
0
–2
–4
–6

0 5 10 15 20
Leader-Member Exchange (Relationship Quality)

FIGURE 6.4. Latent variable distributions for three participants with different values of
leader–member exchange. The black dots represent the means of the latent distributions, and
the area above the threshold parameter (shaded in gray) conveys predicted probabilities.

Figure 6.4 shows the latent response distributions at three values of the explana-
tory variable. The black dots represent predicted values, and the contour rings convey
the perspective of a drone hovering over the peak of a bivariate normal distribution,
with smaller contours denoting higher elevation (and vice versa). The area above the
threshold parameter (shaded in gray) in each distribution is the predicted probability of
quitting (see Equation 2.67). Figure 6.4 shows that the likelihood of quitting decreases
as relationship quality (leader–member exchange) increases along the horizontal axis
(i.e., the β1 coefficient is negative).

Likelihood, Prior, and Posterior Distribution


Chapter 2 showed a probit likelihood function where each person’s contribution to esti-
mation is an area under the normal curve (see Equation 2.69). This chapter uses a dif-
ferent likelihood expression based on the estimated latent response scores, which are
a natural by-product of MCMC estimation. I previously adopted noninformative prior
distributions for means and coefficients that are flat across the parameter’s entire range,
β) ∝ 1. The residual variance is a fixed constant
and I do the same here. This prior is f(β
228 Applied Missing Data Analysis

and does not require a prior distribution. The posterior distribution—­the product of
the likelihood and the prior—is a multivariate function that describes the relative prob-
ability of different combinations of the coefficients and latent response scores, given the
data.
N
( ) ∏exp  − 2 ( Y
 1
)
2
= f (β )×
f β, Y* | data i
*
− ( β0 + β1 X i ) 
i =1  (6.5)

(( ) ( )
× I Yi* ≤ τ I ( Yi= 0 ) + I Yi* > τ I ( Yi= 1) )
The likelihood expression to the right of the product operator mimics the one for linear
regression (see Equation 4.21) but features the latent response scores as the outcome
(the residual variance also vanishes, because it is fixed at 1). Visually, the kernel of the
normal curve represents the height of the latent variable distributions in Figure 6.4. The
I(⋅) terms on the right side of the expression are indicator functions that encode the
categorization scheme from Equation 6.1. The function works like a true–false state-
ment, such that each I(⋅) takes on a value of 1 if the condition in parentheses is true and
0 otherwise. The indicator functions are there to ensure that an observation contributes
to the likelihood only if its latent score falls in the region prescribed by the categorical
response.

MCMC Algorithm and Conditional Distributions


Revisiting concepts from previous chapters, the Gibbs sampler algorithm sequentially
estimates each unknown quantity in the posterior by drawing random numbers from
a probability distribution that treats all other parameters as known constants. MCMC
estimation for linear regression follows a two-step recipe: Estimate the coefficients in
β as a block given the current values of the latent data, then update the latent response
scores given the new coefficients (again, the order of the steps typically doesn’t matter).
The recipe below summarizes the algorithmic steps.

Assign starting values to all parameters, latent data, and missing values.
Do for t = 1 to T iterations.
> Estimate coefficients conditional on the latent data.
> Estimate latent response scores conditional on the updated coefficients.
Repeat.

Each estimation step draws synthetic parameter values at random from a probability
distribution. Mechanically, you get these full conditional distributions by multiplying
the prior and the likelihood, then doing some tedious algebra to express the product as
a function of a single unknown. I give these distributions below and point readers to
specialized Bayesian texts for additional details on their derivations (e.g., Hoff, 2009;
Lynch, 2007).
First, the MCMC algorithm estimates regression coefficients by drawing a vector
of random numbers from the multivariate normal conditional distribution that follows:
Bayesian Estimation for Categorical Variables 229

( ) (
f β | Y* ,data ∝ N K +1 βˆ , S βˆ ) (6.6)
−1
βˆ = ( X ′X ) X ′Y *

−1
S βˆ = ( X ′X )

where K is the number of predictors, NK+1 denotes a normal distribution with K + 1


dimensions or variables, Y* is the vector of N latent response scores, and X denotes the
N × (K + 1) matrix of explanatory variables that includes a column of ones for the inter-
cept. The distribution features latent response scores as the dependent variable but is
otherwise identical linear regression (see Equation 4.23).
The procedure that generates the latent response scores is essentially an imputa-
tion step for a variable with 100% missing data! Returning to Figure 6.4, employees who
intend to quit (i.e., Y = 1) must have latent scores in the shaded region above the thresh-
old, and employees with no quitting intentions (i.e., Y = 0) necessarily have latent scores
below the cutoff point. MCMC honors this restriction by drawing latent scores from a
particular region of the curve. More formally, the algorithm samples latent variable scores
from a truncated normal distribution, which is just a normal curve with one or both of
its tails cut off (e.g., in Figure 6.4, the part of each distribution above the threshold is a
normal distribution with its lower tail truncated at z = 0). The situation changes if the
categorical response is missing, because the algorithm cannot assign the latent imputa-
tion to a particular region of the curve. Instead, MCMC lifts the range restriction and
draws imputations from the entire range of the normal distribution. The full conditional
distribution of the latent scores formalizes these ideas in an equation:

 1 i ((i i ) ) (
N E Y * | X ,1 × I Y * > τ I ( Y =1)
i )
(

) ((
f Yi*|β,data N1 E Yi* | X i ,1=
=

) ) (
× I Y * ≤ τ I ( Yi 0 ) ) (6.7)

((
N1 E Yi* | X i ,1 × I ( Yi =
 ) )
missing )

Although the indicator functions make it look more complicated than it is, the equation
simply says to draw latent scores from one of two truncated normal curves if the discrete
response is observed and an unrestricted normal distribution otherwise. Specialized
algorithms are available that generate random numbers from truncated normal curves
with no trial and error (Robert, 1995).
To illustrate imputation, Figure 6.5 shows the distribution of unrestricted latent
imputations at three values of leader–­member exchange. The contour rings convey the
perspective of a drone hovering over the peak of a bivariate normal distribution, with
smaller contours denoting higher elevation (and vice versa). Candidate imputations fall
exactly on the vertical hashmarks, but I added horizontal jitter to emphasize that more
scores are located at higher contours near the regression line. The MCMC algorithm
generates an imputation by randomly selecting a value from the candidate scores along
the vertical line; for cases with complete data, imputes are sampled from the areas above
or below the threshold, and they are unrestricted if the person’s discrete response is
missing. Figure 6.6 shows a complete set of latent imputations, with black crosshairs
230 APPLIED MISSING DATA ANALYSIS

6
4
Latent Turnover Intention Score
2
0
–2
–4
–6

0 5 10 15 20
Leader-Member Exchange (Relationship Quality)

FIGURE 6.5. Distributions of latent imputations at three values of employee– supervisor rela-
tionship quality. Candidate imputations fall exactly on vertical hashmarks, but I added horizon-
tal jitter to emphasize that more scores are located near the regression line.

and gray circles denoting the two discrete responses. Although we don’t need them right
now, Figure 6.6 highlights that the location of the latent imputations relative to the
threshold defines a corresponding set of discrete imputes; imputes above and below the
threshold are classified as 1’s and 0’s, respectively. As such, missing data imputation for
categorical variables can be viewed as drawing latent and discrete imputes in matched
pairs. The discrete imputes will come into play later with categorical regressors.

Analysis Example
Expanding on the employee turnover example, I fit a binary probit model that features
leader–member exchange, employee empowerment, and a gender dummy code (0 =
female, 1 = male) as predictors of binary turnover intention.

TURNOVER i* = β0 + β1 ( LMX i ) + β2 ( EMPOWER i ) + β3 ( MALEi ) + ε i (6.8)

The missing data rates were approximately 5.1% for the turnover intention indicator,
4.1% for the employee– supervisor relationship quality scale, and 16.2% for the empow-
erment scale.
Bayesian Estimation for Categorical Variables 231

6
4
Latent Turnover Intention Score
2
0
–2
–4
–6

0 5 10 15 20
Leader-Member Exchange (Relationship Quality)

FIGURE 6.6. Scatterplot of imputed latent response data. The dots represent latent scores for
employees who plan to stay at their job (i.e., TURNOVER = 0), the crosshair symbols denote cases
who intend to quit (i.e., TURNOVER = 1), and the dashed horizontal line at 0 is the threshold
parameter.

Applying established ideas, I used a factored regression specification to assign dis-


tributions to the incomplete predictors. A fully sequential specification uses the follow-
ing factorization:

(
f TURNOVER * | LMX, EMPOWER, MALE ) (6.9)
(
× f ( LMX | EMPOWER, MALE ) × f ( EMPOWER | MALE ) × f MALE* )
whereas the partially factored model instead assigns a multivariate distribution to the
predictors as follows:

( ) (
f TURNOVER * | LMX, EMPOWER, MALE × f LMX, EMPOWER, MALE* ) (6.10)

The gender dummy code appears as a latent response variable in models where it func-
tions as a dependent variable (the rightmost terms in both factorizations), but it does not
require a distribution and can be treated as a fixed constant (i.e., change the rightmost
term in 6.10 to a bivariate distribution where leader–member exchange and empower-
ment condition on gender).
232 Applied Missing Data Analysis

TABLE 6.1. Posterior Summary from the Binary Probit Regression


Parameter Mdn SD LCL UCL
β0 1.09 0.38 0.33 1.79
β1 (LMX) –0.07 0.02 –0.11 –0.03
β2 (EMPOWER) –0.03 0.02 –0.06 –0.001
β3 (MALE) –0.05 0.11 –0.27 0.17
R2   .09 .03   .04   .15

Note. LCL, lower credible limit; UCL, upper credible limit.

Categorical (especially ordinal) variables often require long burn-in periods, so


performing a preliminary diagnostic run to monitor convergence is more important
than ever. The potential scale reduction factor (Gelman & Rubin, 1992) diagnostic indi-
cated that the MCMC algorithm converged in fewer than 300 iterations, so I used a con-
servative burn-in period of 1,000 iterations with 10,000 analysis cycles. The sequential
specification and partially factored regression model are theoretically equivalent in this
example and produce identical results, so I focus on the latter. Analysis scripts are avail-
able on the companion website.
Table 6.1 gives posterior summaries of the regression model parameters. In the inter-
est of space, I omit the covariate model parameters, because they are not the focus. The
intercept coefficient is the predicted z-score of quitting for a female employee with 0’s
on the numerical predictors (essentially, the lowest possible value of the leader–­member
exchange and empowerment scales). Each slope coefficient reflects the expected z-score
change in the latent response variable for a one-unit increase in the predictor, controlling
for other regressors. For example, the leader–­member exchange coefficient indicates that
a one-unit increase in relationship quality is expected to decrease the latent proclivity to
quit by 0.04 z-score units (Mdnβ1 = –0.04, SD = 0.02), holding other predictors constant.
Finally, the R2 statistic for the overall model indicates that the set of predictors explained
9% of the variation in the latent response scores (McKelvey & Zavoina, 1975). As a com-
parison, Table 3.11 gives the maximum likelihood estimates for the same model. Follow-
ing earlier analysis examples, the two sets of results are numerically equivalent.

6.4 REGRESSION WITH AN ORDINAL OUTCOME

The probit model for ordered categorical variables incorporates additional threshold
parameters but is otherwise identical to the binary model. Continuing with the employee
data, consider a model that features leader–­member exchange predicting the 7-point
work satisfaction scale (see Figure 6.3). The latent variable regression model is as follows:

WORKSATi* = β0 + β1 ( LMX i ) + ε i (6.11)

Yi* = β0 + β1 X i + ε i = E ( Yi* )
| Xi + εi
Yi* ((
~ N1 E Yi* ) )
| X i ,1
Bayesian Estimation for Categorical Variables 233

To illustrate the model, Figure 6.7 shows the latent response distribution at three values
of leader–member exchange. The black dots represent predicted scores or conditional
means, and the horizontal dashed lines are threshold parameters (z-score cutoff points)
that carve the continuous distribution into discrete segments. The fact that the thresh-
olds are approximately equidistant is a consequence of the discrete distribution’s sym-
metry and is not an inherent feature of the model. In general, distances between cutoff
points can be quite different, particularly if the discrete distribution is asymmetrical.
Finally, note that the first (lowest) threshold is fixed at z = 0 to anchor the latent mean
structure.
The regression model parameters provide a predicted probability for each categori-
cal response (or equivalently, a set of cumulative probabilities). Visually, the probability
of a categorical response c is the area under a normal curve between two adjacent thresh-
olds in Figure 6.7. More formally, the expression for a predicted probability is

( (
Pr ( Yi =c ) =Φ τc − E Yi* | X i )) − Φ ( τ c −1 (
− E Yi* | X i )) (6.12)

where E(Yi*|Xi) is the predicted z-score based on a set of regressors in X, and Φ(·) is the
7
6

Y=7
5
Latent Work Satisfaction Score

Y=6
4

Y=5
3

Y=4
2

Y=3
1

Y=2
0

Y=1
–1
–2
–3

0 5 10 15 20
Leader-Member Exchange (Relationship Quality)

FIGURE 6.7. Latent variable distributions for three participants with different values of
leader–member exchange. The black dots represent the means of the latent distributions, and
the horizontal dashed lines are threshold parameters that represent the z-score cutoff points in
the continuous distribution where discrete scores switch from one category to the next.
234 Applied Missing Data Analysis

cumulative distribution function of the standard normal curve. The subtraction inside
each function centers the threshold at an individual’s conditional mean, and the func-
tion Φ returns the area below that result in a standard normal distribution. Subtracting
two lower-­tailed probabilities gives the area between thresholds. Equation 2.67 gives the
comparable expression for a binary outcome.

Likelihood, Prior, and Posterior Distribution


The posterior distribution is a multivariate function that describes the relative probabil-
ity of different combinations of the regression coefficients, threshold parameters, and
latent response scores, given the data.

( )
= f (β )× f ( τ )
f β, τ, Y* | data
N (6.13)
 1
( ) ( )
2

×
i =1
exp  − Yi* − ( β0 + β1 X i )  × I τc −1 < Yi* ≤ τc I ( Yi = c )
 2 

The expression is like the one for the binary regression model, but the true–false indi-
cator functions change to accommodate additional response options and threshold
parameters. I adopt noninformative prior distributions for the regression coefficients
and thresholds (i.e., f(β) ∝ 1 and f(τ) ∝ 1.), and the residual variance does not require a
prior, because its value is fixed.

MCMC Algorithm and Full Conditional Distributions


MCMC estimation for the binary and ordinal probit models is similar, but the latter
requires an additional step that estimates the threshold parameters. The recipe below
summarizes the algorithmic steps.

Assign starting values to all parameters, latent data, and missing values.
Do for t = 1 to T iterations.
> Estimate coefficients conditional on the latent data.
> Estimate thresholds conditional on coefficients and latent data.
> Estimate latent response scores conditional on the updated coefficients and
thresholds.
Repeat.

The estimation step for the coefficients draws random numbers from the multivariate
normal conditional distribution from Equation 6.6, and the final imputation step for the
latent response scores also mimics the binary model; MCMC samples latent imputations
from a specific region of the normal curve if the categorical response is observed (i.e.,
latent response scores are sampled from a truncated normal distribution), and it draws
Bayesian Estimation for Categorical Variables 235

scores from the entire distribution if the discrete response is missing. The posterior
predictive distribution below formalizes this idea in an equation:

(( ) ) ( )
N E Y * | X ,1 × I τ < Y * ≤ τ I ( Y =c )
( )  1 i i c −1 i c i
f Yi*|β, τ,data = (6.14)
(( ) )
N1 E Yi* | X i ,1 × I ( Yi =

missing )

Figure 6.8 shows a complete set of latent imputations, and a unique symbol denotes each
discrete response. As noted earlier, missing data imputation for categorical variables can
be viewed as drawing imputes in matched pairs, as the location of each latent score rela-
tive to the threshold parameters implies a corresponding discrete value.
The estimation step for the threshold parameters is the main new detail. A seminal
work of Albert and Chib (1993) describes an algorithm that draws threshold parameters
from a uniform distribution bounded on the low end by the highest latent score from the
region below the threshold and bounded on the high end by the lowest latent score from
the region above. For example, returning to Figure 6.8, their procedure draws τ2 (the
second dashed line from the bottom) from a uniform distribution spanning the narrow
7
6

Y=7
5
Latent Work Satisfaction Score

Y=6
4

Y=5
3

Y=4
2

Y=3
1

Y=2
0

Y=1
–1
–2

0 5 10 15 20
Leader-Member Exchange (Relationship Quality)

FIGURE 6.8. Scatterplot of a set of imputed latent scores for a 7-point rating scale. Latent
scores for a given discrete response fall between two threshold parameters, denoted as horizontal
dashed lines.
236 Applied Missing Data Analysis

vertical interval between the highest circle in the Y = 2 region and the lowest crosshair
from the Y = 3 region. Albert and Chib’s procedure converges and mixes slowly, because
the widths of the uniform intervals tend to be very small, thus limiting the amount
that thresholds can change from one iteration to the next (Cowles, 1996; Johnson &
Albert, 1999; Nandram & Chen, 1996). Fortunately, other algorithms provide much
better performance (Cowles, 1996; Nandram & Chen, 1996). I describe the procedure
from Cowles (1996), because it is common in software packages. Readers who are not
interested in these technical details can skip to the analysis example without losing
important information.
Cowles (1996) described an algorithm that combines Gibbs sampling and a
Metropolis–­Hastings step (Gilks et al., 1996; Hastings, 1970) like the one described in
Section 5.6. Her algorithm first partitions the posterior distribution into two blocks of
unknowns: Regression coefficients form one set, and threshold parameters and latent
scores form the second. She further factors the distribution of τ and Y * into the product
of two univariate distributions as follows:

( )
f τ, Y * | β,data ∝ f (Y *|τ, β,data) × f ( τ | β,data ) (6.15)

Adopting a flat prior distribution for the thresholds (i.e., f(τ) ∝ 1) gives the conditional
posterior distribution of τ:

∏ (Φ (τ )) − Φ ( τ )))
N
f ( τ | β,data ) ∝ Yi (
− E Yi* | X i Yi −1 (
− E Yi* | X i (6.16)
i =1
where τYi and τYi –1 are the upper and lower threshold boundaries for person i’s categorical
response (e.g., if Y = 2, then τYi = τ2 and τYi –1 = τ1). The terms to the right of the product
operator correspond to the predicted probability expression from Equation 6.12, and
their product is an alternative expression for the likelihood that doesn’t require latent
scores.
Chapter 5 used the Metropolis–­Hastings algorithm to draw imputations from a
complex distribution, and Cowles uses the same approach to draw threshold parameters
from the previous distribution. The algorithm performs the following steps: (1) draws
candidate threshold parameters from a normal proposal distribution, (2) computes an
importance ratio that captures the relative height of the target function (Equation 6.16)
evaluated at the candidate and current threshold values, and (3) uses a random number
to accept or reject the candidate parameter values.
The algorithm’s first step draws candidate thresholds one at a time, each from a nor-
mal proposal distribution centered at the current estimate. I refer to each pair of thresh-
olds as τc(new) and τc(old), respectively. The proposal distribution’s standard deviation is
fixed at (or adaptively tuned to) a value that accepts new estimates at a rate of 25–50%
(Gelman et al., 2014; Johnson & Albert, 1999; Lynch, 2007). Finally, to ensure that the
thresholds maintain the correct rank order, the lower tail of τc’s proposal distribution is
truncated at the next lowest threshold (i.e., τc–1(new)), and its upper tail cannot exceed the
next highest threshold (i.e., τc+1(old)).
After drawing a set of candidate thresholds one at a time, the algorithm computes
the importance ratio as follows (recall that τ0 = –∞ and τC = ∞):
Bayesian Estimation for Categorical Variables 237

(Φ (τ
N Yi ( new ) (
− E Yi* | X i )) − Φ ( τ Yi −1( new ) (
− E Yi* | X i )))
IR ∏ ×
(Φ (τ
i =1 Yi ( old ) − E (Y i
*
| X )) − Φ ( τ
i Yi −1( old ) − E (Yi
*
| X )))
i
(6.17)
(Φ (( τ
C −1 ) ) ((
c +1( old ) − τc ( old ) / σ MH − Φ τc −1( new ) − τc ( old ) / σ MH ) ))

(Φ (( τ
c =2
c +1( new ) − τc( new ) ) /σ ) − Φ (( τ
MH c −1( old ) − τc( new ) ) / σ ))
MH

The first term is computed by substituting the candidate and current thresholds into the
likelihood expression from Equation 6.16, and the second term (which is often equal
to or close to 1) adjusts for the proposal distribution’s truncation points. Visually, the
importance ratio is the relative height of the target distribution at two sets of threshold
values. A ratio greater than 1 implies that the candidate thresholds in τ(new) are located
at a higher elevation on the target distribution than those in τ(old) (i.e., a more populated
region of the curve), and a ratio less than unity indicates that the candidate values have
moved to a lower elevation. To decide whether to keep the trial parameters, the algo-
rithm generates a random number from a binomial distribution with success rate equal
to the importance ratio. If the random draw is a “success,” the candidate thresholds
become the current parameters. Otherwise, τ(old) is used for another iteration.

Analysis Example
Expanding on the employee data example, I fit an ordered probit model that features
leader–­member exchange, employee empowerment, and a gender dummy code (0 =
female, 1 = male) as predictors of work satisfaction ratings. The latent variable regression
model is as follows:

WORKSATi* = β0 + β1 ( LMX i ) + β2 ( EMPOWER i ) + β3 ( MALEi ) + ε i (6.18)

The missing data rates are approximately 4.8% for the work satisfaction ratings, 4.1% for
the employee–­supervisor relationship quality scale, and 16.2% for the employee empow-
erment scale. Consistent with previous example, I used a factored regression specifica-
tion for the incomplete regressors (see Equations 6.9 and 6.10). Analysis scripts are
available on the companion website, including a custom R program for readers who are
interested in learning to code the algorithm by hand.
This example is useful, because it highlights the importance of checking conver-
gence and mixing. Cowles’s (1996) method for updating threshold parameters is a sub-
stantial improvement over Albert and Chib’s (1993) classic approach, but these param-
eters still require very long burn-in periods. For most of the models we’ve worked with
thus far, MCMC converged very quickly, usually in fewer than 500 iterations. That is not
the case here. To illustrate, Figure 6.9 shows a trace plot of τ4 from the first 1,000 itera-
tions of two MCMC chains. The solid horizontal line is the posterior median from the
final 10,000 iterations, and the dashed lines are the corresponding 95% credible interval
limits. Overall, the trace plot is a far cry from some of the ideal graphs in Chapter 4. For
one, estimates are constrained to a narrow range and are not yet oscillating around a
238 APPLIED MISSING DATA ANALYSIS

3.2
3.0
2.8
Threshold 4
2.6
2.4
2.2

0 200 400 600 800 1000


Iteration

FIGURE 6.9. Trace plot of threshold τ4 from two MCMC chains that comprise 1,000 itera-
tions each. The solid horizontal line is the median estimate from the final posterior distribution,
and the dashed lines are the corresponding 95% credible interval limits. Comparing the second
halves of each chain yields a potential scale reduction factor of PSRF = 1.36.

stable mean. You can also see that the first halves of the two chains exhibit flat plateaus
where the Metropolis–Hastings algorithm is not accepting new estimates at the desired
rate. These flat spots dissipate after about 500 iterations, suggesting that the tuning
mechanism is producing a better acceptance rate.
Using the split-chain method (Gelman et al., 2014, pp. 284–285) to compare the
second halves of each chain in Figure 6.9 gives a potential scale reduction factor (PSRF)
value of PSRF = 1.36 (Gelman & Rubin, 1992), well above recommended cutoffs. Diag-
nostic runs indicated that all PSRF values dropped below 1.05 somewhere between
15,000 and 20,000 iterations, so I specified 30,000 computational cycles with a burn-in
period of 20,000 iterations. As a practical aside, the slow convergence is almost certainly
exacerbated by the relatively small number of observations in the lowest and highest
categories (see Figure 6.3). In practice, it may be necessary to collapse adjacent catego-
ries to minimize the impact of sparse data. Furthermore, data screening is especially
important for models with multiple categorical variables, because sparse contingency
tables are a common cause of convergence failures.
Table 6.2 gives the posterior summaries for the regression model parameters. In
the interest of space, I omit the supporting regressor model parameters, because they
are not the substantive focus. The intercept coefficient is the predicted latent work sat-
Bayesian Estimation for Categorical Variables 239

TABLE 6.2. Posterior Summary from an Ordinal Probit Regression


Parameter Mdn SD LCL UCL
β0 –0.09 0.31 –0.68 0.52
β1 (LMX) 0.14 0.02 0.11 0.17
β2 (EMPOWER) 0.04 0.01 0.01 0.06
β3 (MALE) 0.18 0.09   0.003 0.35
R2   .22 .03   .16 .28
τ1 0 — — —
τ2 0.98 0.12 0.75 1.23
τ3 1.95 0.13 1.70 2.19
τ4 2.92 0.14 2.63 3.17
τ5 3.72 0.15 3.42 3.99
τ6 4.54 0.17 4.19 4.87

Note. LCL, lower credible limit; UCL, upper credible limit.

isfaction score for a female employee with 0’s on the numeric predictors (essentially, the
lowest possible value of the leader–­member exchange and empowerment scales). Each
slope coefficient reflects the expected z-score change in the latent response variable for
a one-unit increase in the predictor, controlling for other regressors. For example, the
leader–­member exchange coefficient indicates that a one-unit increase in relationship
quality is expected to increase latent work satisfaction by 0.14 z-score units (Mdnβ1 =
0.14, SD = 0.02), holding other predictors constant. The R2 statistic for the overall model
indicates that the set of predictors explained approximately 22% of the variation in the
latent work satisfaction scores (McKelvey & Zavoina, 1975). Finally, although they are
not necessarily of substantive interest, Table 6.2 also summarizes the threshold param-
eters or z-score cutoffs that divide the latent response distribution into seven segments.

6.5 BINARY AND ORDINAL PREDICTOR VARIABLES

Revisiting ideas from Chapter 5, the factored regression specification defines the distri-
bution of an incomplete predictor variable as a composite function of two or more sets of
model parameters. The same is true for binary and ordinal predictors, but probit regres-
sions replace linear models. Switching gears to a different substantive context, I use
the smoking data from the companion website to illustrate missing data handling for a
binary predictor. The data set includes several sociodemographic correlates of smoking
intensity from a survey of N = 2000 young adults (e.g., age, whether a parent smoked,
gender, income). The model uses a parental smoking indicator (0 = parents did not smoke,
1 = parent smoked), age, and income to predict smoking intensity (a function of the num-
ber of cigarettes smoked per day). The model and its generic counterpart are shown on
top of the next page:
240 Applied Missing Data Analysis

INTENSITYi = β0 + β1 ( PARSMOKEi ) + β2 ( INCOMEi − μ 2 ) + β3 ( AGEi − μ 3 ) + ε i (6.19)

Yi = β0 + β1 X1i + β2 X 2i + β3 X 3i + ε i

(
ε i ~ N1 0, σ2ε )
I centered the income and age variables at their grand means to define the intercept as
the expected smoking intensity score for a respondent whose parents did not smoke.
The smoking intensity variable has 21.2% missing data, 3.6% of the parental smoking
indicator scores are missing, and 11.4% of the income values are unknown.

Factored Regression Specification


The factored regression specification used throughout the book readily accommodates
mixtures of categorical and continuous variables. Applying a sequential specification to
the smoking intensity analysis gives the following factorization:

f ( INTENSITY | PARSMOKE, INCOME, AGE ) ×


(6.20)
( )
f PARSMOKE* | INCOME, AGE × f ( INCOME | AGE ) × f ( AGE )

A probit regression defines the conditional distribution of the parental smoking indica-
tor, and the income and age distributions are linear models, as follows:

PARSMOKEi* = γ 01 + γ11 ( INCOMEi ) + γ 21 ( AGEi ) + r1i (6.21)

INCOMEi = γ 02 + γ12 ( AGEi ) + r2i


AGEi =γ 03 + r3i

As before, the variance of r1 is fixed at 1 to establish a metric, and the parental smok-
ing model also requires a single, fixed threshold parameter. Figure 6.10a shows a path
diagram of this specification. Following diagramming conventions from Edwards et al.
(2012), I use an oval and a rectangle to differentiate the latent variable and its categori-
cal indicator, respectively, and the broken arrow connecting the two is the link function
that maps the unobserved continuum to the discrete responses (e.g., the broken arrow
reflects the idea that latent and discrete scores interact via threshold parameters). Notice
that the latent response variable links to other regressors, but the binary parental smok-
ing indicator predicts the outcome.
The partially factored model specification instead assigns a multivariate normal
distribution to the regressors.

(
f ( INTENSITY | PARSMOKE, INCOME, AGE ) × f PARSMOKE* , INCOME | AGE (6.22) )
When predictors have different metrics, the normality assumption applies to numerical
and latent response variables. For example, the trivariate normal distribution for the
smoking intensity analysis is as follows:
Bayesian Estimation for Categorical Variables 241

 PARSMOKEi*   X1*i   μ1   r1i 


       
 INCOMEi  = X 2i  = μ 2  +  r2i  (6.23)
 AGEi   X  μ  r 
   3i   3   3i 

 X1*i    μ   1.0 σ12 σ13  


   1   
 X 2i  ~ N 3   μ 2  ,  σ 21 σ22 σ 23  
X    μ 3   σ σ 32

σ 23  
 3i    31

Notice that the first element in the covariance matrix is fixed at 1 to establish the
latent variable’s metric (the model also incorporates a single, fixed threshold for the

(a) Factored regression (sequential) specification

PARSMOKE* PARSMOKE

INCOME INTENSITY

AGE

(b) Partially factored regression specification

PARSMOKE* PARSMOKE

INCOME INTENSITY

AGE

FIGURE 6.10. Path diagram of a factored regression (sequential) specification and partially
factored model specification.
242 Applied Missing Data Analysis

latent response variable). Figure 6.10b shows a path diagram of this specification, with
curved arrows (correlated residuals) connecting predictors instead of direct pathways.
The latent response variable again connects to the other predictors, whereas the binary
parental smoking indicator predicts the outcome.
Modeling the multivariate normal distribution is not so easy here, because one
element of the covariance matrix is a fixed constant. More generally, this matrix could
contain fixed parameters, variances and covariances, and correlations between pairs of
latent response variables. This mixture of odds and ends makes it difficult to specify a
prior distribution and update the covariance matrix in a single step. One option is to
use a Metropolis–­Hastings algorithm to update covariance matrix elements one at a
time or in blocks (Asparouhov & Muthén, 2010a; Browne, 2006; Carpenter & Kenward,
2013), and another is to leverage a well-known property that a multivariate normal dis-
tribution’s parameters can be expressed as an equivalent set of linear regression models
(Arnold et al., 2001; Liu et al., 2014). The set of round-robin regression equations below
is an alternative parameterization for the multivariate distribution:

PARSMOKEi* = μ1 + γ11 ( INCOMEi − μ 2 ) + γ 21 ( AGEi − μ 3 ) + r1i (6.24)

( )
INCOMEi = μ 2 + γ12 ( AGEi − μ 3 ) + γ 22 PARSMOKEi* − μ1 + r2i
AGEi = μ 3 + (
γ13 PARSMOKEi* )
− μ1 + γ 23 ( INCOMEi − μ 2 ) + r3i

Finally, note that the age variable is complete and does not require a distribution in
either the sequential or partially factored specifications.

MCMC Algorithm and Distribution of Missing Values


The MCMC algorithm has many moving parts, all of which we’ve seen before. Using
the filled-­in data from the previous iteration, each new computational cycle begins with
an estimation sequence that updates the focal model parameters. Depending on the
outcome’s metric, this model could be a linear or probit regression. Next, MCMC cycles
through the incomplete predictors one at a time, estimating the supporting regression
models for each covariate. The composition of these models depends on whether you
adopt a sequential or partially factored specification, but each incomplete predictor
always appears as the outcome in one of these models (see Equations 6.21 and 6.24).
After updating all model parameters, MCMC performs an imputation step for every
incomplete variable. The focal model alone defines the distribution of the missing out-
come variable, and a Metropolis algorithm samples missing regressor scores from com-
plex multipart posterior predictive distributions.
Following ideas established in Chapter 5, the distribution of an incomplete cat-
egorical predictor depends on every model in which it appears. To illustrate, consider
the posterior predictive distribution of the parental smoking indicator. Using generic
notation, the distribution of missing values that conditions on all other variables is pro-
portional to the product of two univariate normal distributions, each of which aligns
with a regression model from Equation 6.21 or 6.24.
Bayesian Estimation for Categorical Variables 243

( )
f ( X1 | Y , X 2 , X 3 ) ∝ f ( Y | X1 , X 2 , X 3 ) × f X1* | X 2 , X 3 =
(6.25)
( ) (( ) )
N1 E ( Yi |X1i , X 2i , X 3i ) , σ2ε × N1 E X1*i | X 2i , X 3i ,1
Dropping unnecessary scaling terms and substituting the normal curve’s kernels into
the right side of the expression gives the following:

( )
f ( Yi | X1i , X 2i , X 3i ) × f X1*i | X 2i , X 3i ∝

1 ( Yi − ( β0 + β1 X1i + β2 X 2i + β3 X 3i ) )
 2 

exp − ×
 2 σ2ε  (6.26)
 
 1
( )
2
exp  − X1*i − ( γ 01 + γ11 X 2i + γ11 X 3i ) 
 2 

Notice that missing values are like quantum objects that simultaneously exist in two
different states—X1 is a dummy code in the focal analysis model and a latent response
variable in its own model. Each latent score is unconstrained and can fall anywhere in
the normal curve, but its location relative to the threshold parameter fully determines
a corresponding discrete impute. For example, a latent imputation above the threshold
induces a discrete response of X1(mis) = 1 (e.g., parent was a smoker), and a latent impu-
tation below the threshold creates X1(mis) = 0 (e.g., parents were nonsmokers). The dual
nature of a categorical predictor makes deriving an analytic expression for the missing
values even more arduous and intractable than it was with continuous predictors, but
this isn’t a problem for the Metropolis–­Hastings algorithm.
Applying ideas from Section 5.6, the algorithm draws imputations by manipulating
the simpler component distributions (e.g., two normal curves). For each missing obser-
vation, the algorithm performs four steps. First, it samples a candidate latent imputation
from a normal proposal distribution centered at a person’s current latent response score.
As before, the proposal distribution’s standard deviation is fixed at or adaptively tuned
to a value that accepts candidate imputations at an optimal rate of 25–50% (Gelman et
al., 2014; Johnson & Albert, 1999; Lynch, 2007). Second, the algorithm uses the current
threshold estimates to convert the candidate latent impute to a discrete value (this exam-
ple requires a single threshold fixed at 0). For example, a latent imputation of X1*(new) =
1.27 converts to X1(new) = 1 (i.e., parent was a smoker), because it is above the threshold,
whereas a candidate value of X1*(new) = –.43 converts to X1(new) = 0 (i.e., parents were non-
smokers). Next, the importance ratio is a fraction that features the target function from
Equation 6.26 in both the numerator and denominator. The algorithm substitutes the
matched pair of candidate imputations (along with the necessary parameter and data
values) into the numerator, and it substitutes the current imputations into the denomi-
nator. Visually, the resulting ratio reflects the relative height of the target distribution
when evaluated at the candidate and current imputes. Finally, the algorithm generates a
random number from a binomial distribution with success rate equal to the importance
ratio. If the random draw is a “success,” the candidate imputations become new data for
the next iteration; otherwise, a participant’s current data carry forward for another cycle.
244 Applied Missing Data Analysis

TABLE 6.3. Posterior Summary of Regression with an Incomplete


Binary Predictor
Parameter Mdn SD LCL UCL
β0 8.78 0.11 8.56 9.01
β1 (PARSMOKE) 2.65 0.17 2.31 2.98
β2 (INCOME) –0.13 0.03 –0.18 –0.08
β3 (AGE) 0.58 0.04 0.51 0.66
σε2 11.25 0.41 10.52 12.10
R2   .25 .02   .22   .29

Note. LCL, lower credible limit; UCL, upper credible limit.

Analysis Example
Continuing with the smoking data example, I used Bayesian estimation to fit the linear
regression model from Equation 6.19. As explained in Section 5.3, a partially factored
specification that assigns a multivariate distribution to the predictors (or equivalently,
round-robin regressions) is ideally suited for models with centered predictors, because
the grand means are estimated parameters (see Equations 6.23 and 6.24). Following
earlier examples, I did not estimate a supporting model for respondent age, because this
variable is complete and does not require a distribution. The potential scale reduction
factors (Gelman & Rubin, 1992) from a preliminary diagnostic run indicated that the
MCMC algorithm converged in fewer than 300 iterations, so I used 11,000 total itera-
tions with a conservative burn-in period of 1,000 iterations. Analysis scripts are avail-
able on the companion website.
Table 6.3 gives the posterior summaries for the regression model parameters. In
the interest of space, I omit the supporting regressor model parameters, because they
are not the substantive focus. For a comparison, Table 3.6 shows corresponding maxi-
mum likelihood estimates, which were numerically equivalent. Centering the income
and age variables at their grand means defined the intercept as the expected smoking
intensity score for a respondent whose parents did not smoke (Mdnβ0 = 8.78, SDβ0 = 0.11),
and the interpretation of the slopes is unaffected by the categorical regressor, which
functions as a dummy code in the focal model. For example, the β1 coefficient indicates
that respondents with parents who smoked have smoking intensity scores that are 2.65
points (cigarettes per day) higher on average, controlling for income and age (Mdnβ1 =
2.65, SDβ1 = 0.17).

6.6 LATENT RESPONSE FORMULATION


FOR NOMINAL VARIABLES

The multinomial probit model extends the latent response variable framework to mul-
ticategorical nominal variables. The seminal work on this topic traces to the maximum
indicant model of Aitchison and Bennett (1970), and a mature body of research has
investigated data augmentation procedures that parallel binary and ordinal models
Bayesian Estimation for Categorical Variables 245

(Albert & Chib, 1993; Dyklevych, 2014; Imai & van Dyk, 2005; Mcculloch & Rossi,
1994; McCulloch, Polson, & Rossi, 2000; Zhang, Boscardin, & Belin, 2008). The mul-
tinomial model is a popular method for handling incomplete nominal variables in the
multiple imputation framework (Carpenter et al., 2011; Carpenter & Kenward, 2013;
Enders et al., 2020; Enders, Keller, et al., 2018; Goldstein et al., 2009; Quartagno & Car-
penter, 2019), and the Bayesian estimation procedure described in this section and the
next is the backbone of that application.
Switching substantive gears, I use the chronic pain data on the companion website
to illustrate the latent formulation for nominal variables. The data set includes psy-
chological correlates of pain severity (e.g., depression, pain interference with daily life,
perceived control) for a sample of N = 275 individuals with chronic pain. I focus on a
categorical pain severity rating with C = 3 groups, indexed c = 1, 2, and 3. The first cat-
egory comprises participants who reported none, very little, or little pain (20.8%), the sec-
ond group comprises individuals with moderate pain (47.5%), and the third bin includes
participants with severe or very severe pain (31.8%). Although the categories are ordered,
treating this variable as nominal makes sense, because the bins likely reflect quantita-
tive and qualitative differences.
The multinomial model specifies an underlying latent variable—­called indicants
or utilities—for each categorical response. The set of empty latent variable regression
models for this example is

U1*i = μ1 + ζ1i (6.27)

U 2*i = μ 2 + ζ 2i
U 3*i = μ 3 + ζ 3i

where Uc*i is latent indicant c for participant i (c = 1, 2, . . . , C), μc is the latent grand
mean of utility variable c, and ζci is a residual. Applied to the chronic pain data, the U *’s
represent a participant’s latent propensity to endorse each pain severity rating. The indi-
cants are often uncorrelated by assumption with variances fixed at some arbitrary value
to establish a metric (Carpenter & Kenward, 2013). I adopt the following distribution
for the indicants:
 U1*i    μ1   0.5 0 0 
 *     
 U 2i  ~ N 3   μ 2  ,  0 0.5 0   (6.28)
 *  μ   0
 U 3i   3   0 0.5  
 
As you will see, setting the diagonal elements of the variance–­covariance matrix to 0.5
gives a convenient scaling result that links to the binary and ordinal models.
Unlike models for ordered (or binary) categories, the multinomial model does not
incorporate threshold parameters. Rather, the maximum utility score or maximum indi-
cant determines a participant’s categorical response. For example, a participant with no
or little pain (i.e., c = 1) must have a U1* score that exceeds U2* and U3*, an individual
with moderate pain must have U2* as the highest utility score, and U3* is the maximum
indicant for participants with severe pain. Formally, this rule is
246 Applied Missing Data Analysis

= (
Yi c if max U1*i ,…= *
, U Ci U ci*) (6.29)

where Y is the discrete variable, and c indexes its response options.


Modeling the full set of indicant variables is unnecessary and redundant. Following
the logic of dummy and effect coding, the utilities can be cast as a set of C – 1 latent dif-
ference scores. Consistent with a discrete coding scheme, the difference scores require
a reference or base category. I arbitrarily use the first group as the reference (i.e., none,
very little, or little pain), which gives the following latent difference scores:

*
D=
2i U 2*i − U1*i (6.30)
*
D=
3i U 3*i − U1*i

Substantively, the difference scores reflect the underlying propensity to endorse the
second or third category relative to the reference. For example, a positive value of
D2* implies that a participant is more likely to report moderate pain than no or little
pain, whereas a negative difference score indicates that the reference category is more
likely.
Because subtracting two normally distributed variables gives another normal vari-
able, the latent difference scores also follow a multivariate normal distribution. The fol-
lowing pair of empty regression models summarize the latent difference scores:

 D2*i   β02   ε 2i 
D*i =
=  +  (6.31)
 *  
 D2i   β03   ε 3i 
 D2*i    β02   1.0 .50  
 *  ~ N 2   ,  
D 
 3i    β 03   .50 1.0  

Notice that the average difference scores, β02 and β03, are unknown parameters, and
the entire covariance matrix is fixed to establish a metric. The values in the variance–­
covariance matrix Ŝ D* are a consequence of adopting the diagonal covariance matrix for
the utility variables in Equation 6.28. This result is convenient, because it mimics the
scaling of the binary and ordinal probit models.
The categorization rule in Equation 6.29—the maximum utility determines one’s
discrete response—­readily translates to the difference score metric, although the rule
is slightly more complicated and depends on whether a respondent belongs to the ref-
erence group. To refresh, I assigned the lowest category as the reference. Returning to
Equation 6.30, you can see that all latent difference scores must be negative for this
group, because U1* is the highest utility score. In contrast, members of the other groups
must have at least one positive difference score, and the maximum of the difference
scores determines the categorical response. For individuals with moderate pain, D2*
must be positive, because U2* > U1* and it must exceed D3*, because U3* < U2* (i.e., U2*
– U1* > U3* – U1*). Similarly, for respondents with severe pain, D3* must be positive and
greater than D2*. The linkage between the latent and discrete responses is summarized
as follows:
Bayesian Estimation for Categorical Variables 247


(
1 if max D2*i ,…, DCi
*
<0 )
2 if D* =

Yi =  2i (
max D2*i ,…, DCi
*
)
and D2*i > 0
(6.32)

 *
C if DCi = (
max D2*i ,…, DCi
* *
and DCi )
>0

As an aside, these rules highlight that the multinomial specification is equivalent to the
binary probit model when there are only two groups, in which case a single latent differ-
ence variable is negative for the reference group and positive for the comparison group
(i.e., latent scores below and above the threshold, respectively).
The binary and ordinal probit models defined category proportions as an area under
the normal curve between two z-score cutoffs. The nominal framework is more compli-
cated, because areas under a multivariate normal distribution define the group propor-
tions (i.e., multidimensional integrals). To illustrate, consider a hypothetical analysis
with three groups and mean difference scores equal to 0 (i.e., β02 = β03 = 0). Figure
6.11 shows the contour plot of the bivariate normal difference score distribution. The
4
2
Latent Difference Score D 3*

0
–2
–4

–4 –2 0 2 4
Latent Difference Score D 2*

FIGURE 6.11. Contour plot of the bivariate normal difference score distribution. The vertical
and horizontal dashed lines show the location of the grand means, and the distribution’s peak is
located at their intersection. The reference group has latent scores in the dark-shaded region, the
second group has latent scores in the lightly shaded area, and the third group has latent scores in
the unshaded region of the distribution.
248 Applied Missing Data Analysis

graph conveys the perspective of a drone hovering over the peak of a three-­dimensional
bell curve, with smaller contours denoting higher elevation. The vertical and horizontal
dashed lines show the location of the grand means, and the distribution’s peak is located
at their intersection. The group proportions correspond to different areas under the
three-­dimensional surface. Following the rules in Equation 6.32, the reference group’s
probability corresponds to the area under the dark-­shaded region of the surface where
both difference scores are negative. This probability is Pr(D2* < 0 & D3* < 0) = .25.
The second category’s probability, Pr(D2* > 0 & D2* > D3*) = .375, is the area under the
lightly shaded region of the surface, and the third group’s proportion corresponds to the
unshaded area, Pr(D3* > 0 & D3* > D2*) = .375. A number of algorithms are available for
computing areas under the multivariate normal distribution (Genz, 1993; Genz et al.,
2019; Mi, Miwa, & Hothorn, 2009), and I used an R function to determine these values.

6.7 REGRESSION WITH A NOMINAL OUTCOME

Having established the key ideas behind the latent formulation for multicategorical
nominal variables, I extend the previous example to include exercise frequency, per-
ceived control over pain, and a gender dummy code (0 = female, 1 = male) as predictors
of the categorical pain ratings. The multivariate regression model and its generic coun-
terpart are shown below:

 MODERATEi*   β02   β12  β 


 = +
 SEVERE*   β03   β13 
( EXERCISEi ) +  22  (CONTROLi )
 i   β23  (6.33)
 β 32   ε1i 
+  ( MALEi ) +  
 β 33   ε 2i 
 D2*i   β02   β12   β22   β32   ε1i 
 * =
D  
β
+
β
 03   13 
 X1i + 
β
 23 
 X 2i + 
β
 33  ε
 2i 
(*
)
 X 3i +   = E D i | X i + ε i
 2i 
 D2*i   1.0 0.5  
D  (
 *  ~ N 2  E D*i | X i ) , 0.5 
1.0  
 3i  

As before, MODERATE * and SEVERE * are latent difference scores contrasting a continu-
ous proclivity for moderate and severe pain ratings relative to a no or little pain rating.
Notice that the regressors exert a unique influence on each latent difference score, and
the residual covariance matrix is now fixed at deterministic values to scale the latent
response variables. The missing data rates for the pain severity and exercise frequency
variables are 7.3 and 1.8%, respectively, and the remaining predictors are complete.

MCMC Algorithm and Full Conditional Distributions


The posterior distribution is a multivariate function that describes the relative prob-
ability of different combinations of the coefficients and latent response scores given the
data and fixed covariance matrix.
Bayesian Estimation for Categorical Variables 249

∏exp  − 2 (D ))′ S (D ))


N
 1 
( )
f D*, β | S D* ,data ∝ f ( β ) × *
i (
− E D*i | X i −1
D*
*
i (
− E D*i | X i 

 (6.34)
i =1

(( ( ) )
× I max D*i < 0 I ( Yi =
1) + I max D*i = ( ( ) )
Dc* I( Dc* > 0) I ( Yi > 1) )
The term to the right of the product indicator is the kernel of the multivariate normal
distribution (i.e., the likelihood sans unnecessary scaling constants), and the collection
of true–false indicator functions enforce the categorization rule from Equation 6.32 and
ensure that an observation contributes to the likelihood only if its latent scores possess
the correct magnitude and rank order. As always, I adopt a flat prior for the coefficients
(i.e., f(β) ∝ 1).
MCMC estimation follows a predictable two-step recipe that mimics the one from
Section 6.3: Estimate the regression coefficients given the current latent data, then
update the latent scores given the discrete responses and new coefficients. For complete-
ness, I give the full conditional distributions below, and readers who are not interested
in these technical details can skip to the analysis example without losing important
information.
First, the MCMC algorithm estimates regression coefficients by drawing a vector of
random numbers from the multivariate normal conditional distribution:

( ) ( () )
f β | D* , X ∝ N ( K +1)(C −1) vec βˆ , S βˆ (6.35)
−1
βˆ = ( X ′X ) X ′D*
−1
S βˆ S D* ⊗ ( X ′X )
=

where K is the number of predictors, N(K+1)(C–1) denotes a normal distribution with (K +


1) × (C – 1) dimensions or variables (i.e., the total number of coefficients in the multi-
variate regression), D * is the N × (C – 1) matrix of latent scores, and X denotes the N ×
(K + 1) matrix of explanatory variables that includes a column of ones for the intercept.
The equations work as follows: First, the expression for β̂ yields a matrix of coef-
ficients with one column per latent difference score. The normal distribution’s dimen-
sions, (K + 1) × (C – 1), arise from applying a “vec” operation that stacks the columns of
β̂ into a single vector (e.g., for this example, a column vector with eight elements). The
covariance matrix of the coefficients is usually computed by multiplying the residual
variance by the inverse of the sum of squares and cross-­products matrix. Instead, the ⊗
symbol in the bottom equation is a Kronecker product that multiplies each element of
the residual covariance matrix by this inverse matrix. The result is a covariance matrix
with (K + 1) × (C – 1) rows and columns. Finally, after drawing a vector of coefficients,
the algorithm unpacks the updated estimates into a new β matrix.
If the discrete response is observed, new latent difference scores are drawn from a
bivariate normal distribution shown in the bottom row of Equation 6.33, subject to the
ordering constraints from Equation 6.32 (i.e., each pair of latent imputations must be
logically consistent with the categorical response). Consistent with binary and ordinal
models where latent imputes are restricted to a particular region of the normal curve,
the imputation step for a multinomial model samples pairs of latent difference scores
from a particular region under a multivariate surface like that in Figure 6.11. The MCMC
250 Applied Missing Data Analysis

algorithm can’t specify the magnitude and rank ordering of the latent difference scores
if the discrete response is missing, so it instead draws pairs of latent difference scores
with no restrictions. The relative magnitude and configuration of the latent imputations
again induces a corresponding discrete impute. For example, a participant with imputed
values of D2*(mis) = 0.24 and D3*(mis) = 1.26 would be classified in the severe pain group
(i.e., Y(mis) = 3), because D3*(mis) is positive and larger than D2*(mis).

Analysis Example
Continuing with the chronic pain data example, I used Bayesian estimation to fit the
multinomial regression model in Equation 6.33. Applying established ideas, I used a
factored regression specification to assign a distribution to the incomplete predictor. A
fully sequential specification uses the following factorization:

(
f PAIN * | EXECISE, CONTROL, MALE × ) (6.36)
f ( EXERCISE | CONTROL, MALE ) × f ( CONTROL | MALE ) × f MALE*( )
The exercise distribution translates into the following linear regression model:

EXERCISEi = γ 01 + γ11 ( CONTROLi ) + γ 21 ( MALEi ) + r1i (6.37)

The perceived control over pain and gender dummy codes (the rightmost pair of terms)
are complete and do not require a distribution. The potential scale reduction factor
(­Gelman & Rubin, 1992) diagnostic indicated that the MCMC algorithm converged in
fewer than 500 iterations, so I used 11,000 total iterations with a conservative burn-in
period of 1,000 iterations. Analysis scripts are available on the companion website.
Table 6.4 gives the posterior summaries for the focal analysis model. In the interest
of space, I omit the supporting regressor model parameters, because they are not the
substantive focus. To facilitate graphing, I centered the predictors such that the latent
difference score means are marginal or overall effects. Figure 6.12 shows the contour
plot of the bivariate normal distribution with the estimated latent differences scores
from one iteration overlaid on the surface. The graph conveys the perspective of a drone
hovering over the peak of a three-­dimensional bell curve, with smaller contours denot-
ing higher elevation. The vertical and horizontal dashed lines show the location of the
grand means, and the distribution’s peak is located at their intersection. Latent scores
for the reference group are located in the dark-­shaded region of the surface where both
difference scores are negative. The group with moderate pain severity has latent differ-
ence scores in the lightly shaded region of the graph (i.e., the area where D2* > 0 & D2* >
D3*), and the high-­severity group’s latent scores are in the unshaded area of the surface
(i.e., the region where D3* > 0 & D3* > D2*).
The posterior medians of the D2* and D3* grand means were Mdnβ02 = 0.50 (SD =
0.09) and Mdnβ03 = 0.17 (SDβ03 = 0.11), respectively. Because the latent response variables
are scaled as z-scores, the positive mean values indicate that moderate and severe pain
ratings are more likely than mild pain ratings (e.g., the latent propensity for indicating a
moderate pain rating is approximately 0.50 z-score units higher than that of the reference
group). As mentioned previously, the model-­predicted group proportions correspond to
Bayesian Estimation for Categorical Variables 251

TABLE 6.4. Posterior Summary from a Nominal Probit Regression


Parameter Mdn SD LCL UCL
Moderate versus no/little pain
β02 0.50 0.09 0.32 0.69
β12 (EXERCISE) –0.09 0.05 –0.19 0.02
β22 (CONTROL) –0.03 0.02 –0.06 0.01
β32 (MALE) –0.13 0.19 –0.50 0.25

Severe versus no/little pain


β03 0.17 0.11 –0.05 0.39
β13 (EXERCISE) –0.28 0.07 –0.41 –0.15
β23 (CONTROL) –0.06 0.02 –0.10 –0.02
β33 (MALE) 0.48 0.21 0.07 0.88

Note. LCL, lower credible limit; UCL, upper credible limit.


4
Latent Difference Score D 3* (Severe vs. Little)

2
0
–2
–4

–4 –2 0 2 4
Latent Difference Score D 2* (Moderate vs. Little)

FIGURE 6.12. Contour plot of the bivariate normal distribution of difference scores with the
estimated latent variable scores from one iteration overlaid on the surface. The vertical and hori-
zontal dashed lines show the location of the grand means, and the distribution’s peak is located
at their intersection. The shaded regions partition the distribution into segments that contain the
latent scores for the three pain severity groups.
252 Applied Missing Data Analysis

different areas under a bivariate normal distribution with these means. Following the
rules in Equation 6.32, the reference group’s probability corresponds to the area under
the dark-­shaded region of the surface where both difference scores are negative, Pr(D2* <
0 & D3* < 0) = .21. The second category’s probability, Pr(D2* > 0 & D2* > D3*) = .51, is the
area under the lightly shaded region of the surface, and the third group’s proportion cor-
responds to the unshaded area, Pr(D3* > 0 & D3* > D2*) = .29. A number of algorithms are
available for computing areas under the multivariate normal distribution (Genz, 1993;
Genz et al., 2019; Mi et al., 2009), and I used an R function to determine these values.
Each slope coefficient reflects the expected z-score change in the latent difference
variable for a one-unit increase in the predictor, controlling for other regressors. The
largest effects were associated with the comparison of severe pain versus little or no
pain. For example, male respondents had an average latent difference score that was
0.48 z-score units higher than that of females (Mdnβ33 = 0.40, SD = 0.21), meaning that
men were more likely to report severe pain than women. The negative slope coefficients
for exercise frequency and perceived control over pain mean that an increase in either
variable is associated with a lower probability of a severe pain rating.

6.8 NOMINAL PREDICTOR VARIABLES

The procedure for imputing multicategorical predictor variables parallels that for binary
and ordinal covariates (and continuous explanatory variables, for that matter). Return-
ing to the smoking intensity analysis from Section 6.5, I modify the linear regression
model by replacing the respondent’s household income with a three-­category education
variable (1 = less than high school, 2 = high school or some college, and 3 = bachelor’s degree
or higher). The model and its generic counterpart are shown below:

INTENSITYi = β0 + β1 ( PARSMOKEi ) + β2 ( HSi ) + β3 ( BACH i ) + β4 ( AGEi ) + ε i (6.38)


Yi = β0 + β1 X1i + β2 D2i + β3 D3i + β4 X 4 i + ε i

(
ε i ~ N1 0, σ2ε )
where HS and BACH (or D2 and D3) are dummy codes contrasting the high school and
bachelor’s degree groups with the less-than-high-­school comparison group. The smok-
ing intensity variable has 21.2% missing data, 3.6% of the parental smoking indicator
scores are missing, and 5.4% of the education values are unknown.

Factored Regression Specification


The factored regression specification in use throughout the book readily accommodates
nominal predictors. Applying a sequential specification to the smoking intensity analy-
sis gives the following factorization:
f ( INTENSITY | PARSMOKE, HS, BACH, AGE ) ×
(6.39)
( ) (
f PARSMOKE* | HS, BACH, AGE × f HS* , BACH * | AGE × f ( AGE ) )
Bayesian Estimation for Categorical Variables 253

These generic expressions translate into a binary probit model for the parental smok-
ing indicator, a multinomial regression for the educational attainment categories, and a
linear model for age.

PARSMOKEi* = γ 01 + γ11 ( HSi ) + γ 21 ( BACH i ) + γ 31 ( AGEi ) + r1i (6.40)


 HSi*   γ 02   γ12   r2i 

 *
 =
  +   ( AGE i ) +  
 BACH i   γ 03   γ13   r3i 
AGEi =γ 04 + r4

Paralleling the dummy variable coding scheme, HS * and BACH * are latent difference
scores contrasting the two higher categories with the less-than-high-­school comparison
group.
The partially factored model specification instead uses a multivariate normal dis-
tribution for the predictors.

f ( INTENSITY | PARSMOKE, HS, BACH, AGE ) ×


(6.41)
(
f PARSMOKE* , HS* , BACH * , AGE )
Following an earlier example, the multivariate normality assumption applies to numer-
ical and latent response variables. The four-­dimensional normal distribution for the
regressors is as follows:

 PARSMOKEi*   X1*i   μ1   r1i 


       
 HSi*
=
  D2*i   μ 2  +  r2i 
 *  = *   μ 3   r3i 
(6.42)
 BACH   D3 i 
i    
 AGEi  X   μ 4   r4 i 
   4i 
 X1*i    μ1   1.0 ρ12 ρ13 σ14  
 *     
 D2i    μ 2   ρ21 1.0 0.5 σ24  
 *  ~ N 4 ,
μ  ρ 0.5 1.0 σ 34  
 D3 i    3   31 
X   
  μ4   σ σ σ σ 2 
 4i    41 42 43 4  

Notice that the three latent response variables have their variances fixed at 1 to establish
a scale, and the latent difference score correlation is fixed to 0.5 like before.
As mentioned previously, modeling the distribution’s covariance matrix is not
straightforward, because it contains a mixture of fixed parameters, variances and
covariances, and correlations between pairs of latent response variables. One option
is to use a Metropolis–­Hastings algorithm to update covariance matrix elements one
at a time or in blocks (Asparouhov & Muthén, 2010a; Browne, 2006; Carpenter &
Kenward, 2013), and another is to parameterize the multivariate normal distribution
as a set of round-robin regression equations (Bartlett et al., 2015; Enders et al., 2020;
Goldstein et al., 2014).
254 Applied Missing Data Analysis

( ) ( )
X1*i = μ1 + γ11 D2*i − μ 2 + γ 21 D3*i − μ 3 + γ 31 ( X 4 i − μ 4 ) + r1i (6.43)
 D2*i   μ2  γ  γ   r2i 
 +  (X − μ ) +  (X − μ )+
12 * 22
 =   
 D*   μ3  γ 
1i
γ 
1 4i 4
 r3i 
 3i  13 23

X 4 i = μ 4 + γ14 ( X − μ ) + γ (D − μ ) + γ (D
*
1i 1 24
*
2i 2 34
*
3i )
− μ 3 + r4 i

Following Sections 6.3 and 6.7, residual variances and covariances for the latent response
variables are fixed quantities, and the binary probit model additionally requires a single,
fixed threshold.

MCMC Algorithm and Distribution of Missing Values


Although the specific models change, the MCMC algorithm for imputing multicategori-
cal predictors is the same as that for binary and ordinal covariates. Using the filled-­in
data from the previous iteration, each new computational cycle begins with an estima-
tion sequence that updates the focal model parameters. Next, MCMC cycles through the
incomplete predictors one at a time, estimating the supporting regression models for each
covariate. The composition of these models depends on whether you adopt a sequential
or partially factored specification, but each incomplete predictor always appears as the
outcome in one of these models. After updating all model parameters, MCMC performs
an imputation step for every incomplete variable. The focal model alone defines the dis-
tribution of the missing outcome variable, and a Metropolis algorithm samples missing
regressor scores from complex multipart posterior predictive distributions.
Following established ideas, the distribution of an incomplete nominal predictor
depends on every model in which it appears. To illustrate, consider the posterior pre-
dictive distribution of the educational attainment scores. To keep the notation simple,
I focus on the partially factored specification, as each predictor’s distribution depends
on just two models. Using generic notation, the distribution of missing values that con-
ditions on all other variables is proportional to the product of the univariate normal
distribution induced by the analysis model from Equation 6.38 and the bivariate normal
distribution spawned by the multinomial probit model in Equation 6.43.

(
f ( D2 , D3 | Y , X1 , X 4 ) ∝ f ( Y | X1 , D2 , D3 , X 4 ) × f D2* , D3* | X1* , X 4 = ) (6.44)
( ) ((
N1 E ( Yi |X1i , D2i , D3i , X 4 i ) , σ2ε × N 2 E D2*i , D3*i | X1*i , X 4 i , S D* ) )
Dropping unnecessary scaling terms and substituting the appropriate kernels into the
right side of the expression gives the following:

(
f ( Yi | X1i , D2i , D3i , X 4 i ) × f D2*i , D3*i | X1*i , X 4 i ∝ )
1 ( Yi − ( β0 + β1 X1i + β2 D2i + β3 D3i + β4 X 4 i ) ) 
 2 

exp − ×
 2 σ2ε  (6.45)
 
 1 *
( ( ′
)) ( (
exp  − Di − E D*i | X1*i , X 4 i S D−1* D*i − E D*i | X1*i , X 4 i 
 2


))
Bayesian Estimation for Categorical Variables 255

where E(Di* | X1i, X4i) is the vector of predicted latent variable difference scores from the
probit model in Equation 6.43:
 μ 2   γ12  *  γ 22 
( )
E D*i | X1*i , X=
4i (
  +   X1i − μ1 + 
 μ 3   γ13 
)  ( X 4i − μ 4 )
 γ 23 
(6.46)

The missing values appear as dummy codes in the focal regression and latent dif-
ference scores in the probit model. The process of sampling imputations is like that for
binary and ordinal predictors. For each missing observation, the Metropolis algorithm
performs four steps. First, it samples a candidate pair of latent imputations from a mul-
tivariate normal proposal distribution centered at a person’s current latent response
scores. As before, the proposal distribution’s covariance matrix is fixed at or adaptively
tuned to accept candidate imputations at an optimal rate of 25–50% (Gelman et al., 2014;
Johnson & Albert, 1999; Lynch, 2007). Second, the algorithm uses the classification
rule from Equation 6.32 to convert the candidate imputes to a discrete value (or equiva-
lently, a pair of dummy codes). For example, a participant with trial values of D2*(new) =
–0.15 and D3*(new) = –0.86 would be classified as a member of the comparison group (i.e.,
less than high school; D2(new) = 0 and D3(new) = 0), because both latent difference scores
are negative. Next, the importance ratio is a fraction that features the target function
from Equation 6.45 in both the numerator and the denominator. The algorithm substi-
tutes the matched pair of candidate imputations (along with the necessary parameter
and data values) into the numerator, and it substitutes the current imputations into the
denominator. Visually, the resulting ratio reflects the relative height of the target dis-
tribution when evaluated at the candidate and current imputes. Finally, the algorithm
generates a random number from a binomial distribution with success rate equal to the
importance ratio. If the random draw is a “success,” the candidate imputations become
new data for the next iteration; otherwise, a participant’s current data carry forward for
another cycle.

Analysis Example
Continuing with the smoking data example, I used Bayesian estimation to fit the regres-
sion model in Equation 6.38. To facilitate interpretation of the intercept, I centered
respondent age at the grand mean, and I did not estimate a supporting model for this
variable, because it is complete and does not require a distribution. There is no reason
to prefer a sequential or partially factored specification in this analysis, so I used the lat-
ter. After inspecting the potential scale reduction factors (Gelman & Rubin, 1992) from
a preliminary diagnostic run, I specified an MCMC process with 11,000 total iterations
and a burn-in period of 1,000 iterations. Analysis scripts are available on the companion
website.
Table 6.5 gives the posterior summaries of the focal model parameters. In the inter-
est of space, I omit the supporting regressor models, because they are not the substan-
tive focus. Because of centering, the intercept reflects the expected smoking intensity
score (cigarettes smoked per day) for a respondent whose parents did not smoke and
attained less than a high school education (Mdnβ0 = 9.49, SDβ0 = 0.33). The interpretation
of the dummy code slope coefficients is the same as any linear regression model. For
256 Applied Missing Data Analysis

TABLE 6.5. Posterior Summary of Regression with an Incomplete


Nominal Predictor
Parameter Mdn SD LCL UCL
Focal analysis model
β0 9.49 0.33 8.85 10.16
β1 (PARSMOKE) 2.65 0.17 2.30 2.99
β2 (HS) –0.56 0.34 –1.25 0.10
β3 (BACH) –1.30 0.36 –2.01 –0.61
β4 (AGE) 0.59 0.04 0.51 0.66
σε2 11.29 0.40 10.56 12.13
R2   .25 .02   .22   .29

Note. LCL, lower credible limit; UCL, upper credible limit.

example, the β3 coefficient indicates that respondents who received a bachelor’s (or more
advanced) degree smoked 1.30 fewer cigarettes per day, on average, than the comparison
group (Mdnβ3 = –1.30, SD = 0.36), controlling for other predictors.

6.9 LOGISTIC REGRESSION

My nearly exclusive emphasis on probit regression in this chapter largely reflects the
state of the literature, where much of the methodological work on Bayesian estimation
for categorical variables has focused on normally distributed latent response variables.
The appeal of this modeling approach is that MCMC can apply standard estimation steps
for linear regression models. Recent work has extended Albert and Chib’s (1993) data
augmentation strategy to logistic regression (Asparouhov & Muthén, 2021b; Frühwirth-­
Schnatter & Früwirth, 2010; Holmes & Held, 2006; O’Brien & Dunson, 2004; Polson
et al., 2013), and these routines are beginning to appear in statistical software packages
(Keller & Enders, 2021; Muthén & Muthén, 1998–2017). This section summarizes this
approach and provides a data analysis example.
Like the probit model, logistic regression can be viewed through a latent variable
lens where binary scores originate from an underlying continuous dimension (Agresti,
2012; Johnson & Albert, 1999). For example, the simple logistic regression model

 Pr ( Yi = 1) 
ln 
 1 − Pr ( Y =  = β0 + β1 ( X i ) (6.47)
 i 1) 

can also be expressed as the following latent variable regression:

Yi* = b0 + b1X1 + ϵi (6.48)

ϵi ~ Logistic(0,1)
Bayesian Estimation for Categorical Variables 257

The key difference between logistic and probit regression is the distribution of the
residual term—the probit model defines this as a standard normal variable, whereas
logistic regression defines the residual as a standard logistic variable (the function’s
inputs, 0 and 1, are the location and scale parameters).
The logistic error distribution is challenging, because it does not lead to a simple
expression for the conditional distribution of the regression coefficients. Polson et al.
(2013) described an exact approach that weights each person’s data to rescale the logis-
tic regression as a probit-­like model with normally distributed errors. Asparouhov and
Muthén (2021b) extended this procedure to a broad range of structural equation models
with normally distributed predictors. Integrating Polson’s method with factored regres-
sion models is somewhat more flexible, because it accommodates categorical regressors
and interactive or nonlinear effects with missing data (Keller & Enders, 2021).
The MCMC algorithm for Polson’s procedure cycles between two steps: Estimate
person-­specific weights that determine the latent response variable scores, then esti-
mate the regression coefficients given the current latent scores and weighted data. To
begin, the algorithm samples person-­specific weights from a so-­called Pólya-gamma
distribution (a variable with a value determined by an infinite sum of gamma random
variables).

Wi ~ PG (1, X iβ ) (6.49)

The function’s first argument represents the number of binomial trials (e.g., the 1 indi-
cates that each person has a single score), and the second argument is a predicted value
from the logistic regression equation. Visually, the Pólya-gamma function looks like
the right-­skewed inverse gamma distribution from Chapter 4 (see Figure 4.6), with the
predicted value determining spread and peakedness (as Xiβ gets larger, the spread and
peakedness decrease and increase, respectively, and the distribution looks more like a
point mass). After sampling the weights, MCMC deterministically computes the latent
response variable as follows:

Yi* = Yi – 0.5 (6.50)

The latent response scores do not have the clear interpretation that they do in the probit
framework but are simply a mathematical device that simplifies estimation of the β’s.
The key feature of Polson et al.’s (2013) approach is that the weights rescale the
logistic regression as a probit-­like model with normally distributed but heteroscedastic
errors. This reparameterization leads to a multivariate normal posterior distribution for
the regression coefficients

 1 
(  2
) ′
( )
f ( β | W,data ) ∝ f ( β ) = f ( β ) × exp  − Y * − Xβ W −1 Y * − Xβ 

(6.51)

where W is an N × N diagonal matrix containing the person-­specific weights, Y* is the


vector of N latent response scores, and X denotes the N × (K + 1) matrix of explanatory
variables that includes a column of ones for the intercept. Adopting a flat prior for the
258 Applied Missing Data Analysis

coefficients, MCMC updates the elements in β by drawing a vector of random numbers


from a multivariate normal conditional distribution.

(
f ( β | W, Y, X ) ∝ N K +1 βˆ , S βˆ ) (6.52)
−1
βˆ = ( X ′WX ) X ′WY*
−1
S βˆ = ( X ′WX )

where K is the number of predictors, and NK+1 denotes a normal distribution with K +
1 dimensions or variables. This expression is just a weighted version of Equation 6.6.
Importantly, the resulting parameter estimates are logistic regression coefficients, and
the latent scores and weights are essentially a rescaling trick that simplifies estimation.

Missing Data Imputation


Most aspects of missing data imputation for logistic regression carry forward from the
probit model. For example, imputing the dependent variable requires only the logistic
model parameters, and the distribution of an incomplete regressor is a composite func-
tion that depends on one or more supporting models. MCMC samples binary outcome
scores from a binomial distribution with success rate equal to the predicted probability
from the logistic model.

exp ( X iβ )
Pr(Yi = 1| β,data) = = πi (6.53)
1 + exp ( X iβ )
Yi( mis ) ~ Binomial (1, πi )

The 1 in the function’s first argument indicates that everyone has a single score, and πi is
the predicted probability that Y = 1. Conceptually, drawing binary random numbers is
akin to tossing a biased coin where the probability of a head (e.g., Y(mis) = 1) equals πi and
the probability of a tail (e.g., Y(mis) = 0) equals 1 – πi. Following established procedures,
the Metropolis algorithm imputes missing predictor variables by pairing the binomial
distribution with one or more supporting models.

Analysis Example
Returning to the employee turnover analysis from Section 6.3, I use Bayesian estima-
tion to fit a binary logistic regression model that features leader–­member exchange,
employee empowerment, and a gender dummy code (0 = female, 1 = male) as predictors
of turnover intention (TURNOVER = 0 if an employee has no plan to leave her or his
position, and TURNOVER = 1 if the employee has intentions of quitting). The logistic
regression model is as follows:

 Pr (TURNOVER i = 1) 
ln 
 1 − Pr (TURNOVER =  = β0 + β1 ( LMX i ) + β2 ( EMPOWER i ) + β3 ( MALEi ) (6.54)
 i 1) 
Bayesian Estimation for Categorical Variables 259

The missing data rates were approximately 5.1% for the turnover intention indicator,
4.1% for the employee–­supervisor relationship quality scale, and 16.2 % for the empow-
erment scale.
A fully sequential specification uses the following factorization (dropping the dis-
tribution of gender, which is complete):

f (TURNOVER | LMX, EMPOWER, MALE )


(6.55)
× f ( LMX | EMPOWER, MALE ) × f ( EMPOWER | MALE )

whereas the partially factored model instead assigns a conditional bivariate distribution
to the incomplete predictors, as follows:

f (TURNOVER | LMX, EMPOWER, MALE ) × f ( LMX, EMPOWER | MALE ) (6.56)

There is no compelling reason to prefer one specification to the other, and both param-
eterizations produced the same results. The potential scale reduction factors (Gelman
& Rubin, 1992) from a preliminary diagnostic run indicated that the MCMC algorithm
converged in fewer than 200 iterations, so I used 11,000 total iterations with a conserva-
tive burn-in period of 1,000 iterations. Analysis scripts are available on the companion
website.
Table 6.6 gives posterior summaries of the regression model parameters. In the
interest of space, I omit the covariate model parameters, because they are not the sub-
stantive focus. As a comparison, the bottom panel of Table 3.11 gives the maximum like-
lihood estimates for the same model. The slope coefficients reflect the expected change
in the log odds of quitting for a one-unit increase in the predictor, holding other covari-
ates constant. For example, the leader–­member exchange slope indicates that a one-
unit increase in relationship quality decreases the log odds of quitting by 0.12 (Mdnβ1 =
–0.12, SD = 0.04), controlling for employee empowerment and gender. Consistent with
complete-­data analyses, exponentiating each slope gives an odds ratio that reflects the
multiplicative change in the odds for a one-unit increase in a predictor. For example, a
one-point increase on the leader–­member exchange scale multiplies the odds of quitting
by 0.89.

TABLE 6.6. Posterior Summary from the Binary Logistic Regression


Parameter Mdn SD LCL UCL OR
β0 1.69 0.62 0.49 2.93 —
β1 (LMX) –0.12 0.04 –0.19 –0.05 0.89
β2 (EMPOWER) –0.05 0.03 –0.09 0.00 0.96
β3 (MALE) –0.10 0.19 –0.46 0.27 0.91
R2   .07 .03   .03   .13 —

Note. LCL, lower credible limit; UCL, upper credible limit; OR, odds ratio.
260 Applied Missing Data Analysis

6.10 SUMMARY AND RECOMMENDED READINGS

This chapter has described Bayesian estimation for binary, ordinal, and multicategori-
cal nominal variables. The chapter focused primarily on a probit regression framework
that envisions discrete scores originating from one or more normally distributed latent
response variables. The data augmentation approach pioneered by Albert and Chib
(1993) supplements discrete scores with estimates of these latent variables. The appeal
of their approach is that, given a full sample of latent scores, MCMC can apply stan-
dard estimation steps for linear regression models. As such, the main new wrinkle for
this chapter was learning how to impute the latent response variables, which are 100%
missing. Discrete imputes are simple functions of the filled-­in latent data. Recent exten-
sions of this data augmentation strategy accommodate logistic regression and negative
binomial regression for count variables (Asparouhov & Muthén, 2021b; Polson et al.,
2013). The last section of the chapter has described logistic regression, and Chapter 10
illustrates imputation for an incomplete count variable.
Looking back on the last two chapters, Bayesian analyses are like maximum likeli-
hood in the sense that model parameters are the focus; missing data handling happens
behind the scenes, and the goal is to construct temporary imputations that service a
particular analysis model. The focus shifts in Chapter 7, where the Bayesian machinery
is a mathematical device that creates suitable imputations for reanalysis in the frequen-
tist framework. As you will see, multiple imputation co-opts the MCMC algorithms from
the last two chapters, so much of the new content in Chapter 7 focuses on saving and
analyzing multiply imputed data sets and summarizing the results. Finally, I recom-
mend the following articles for readers who want additional details on topics from this
chapter:

Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data.
Journal of the American Statistical Association, 88, 669–679.

Asparouhov, T., & Muthén, B. (2021). Expanding the Bayesian structural equation, multilevel
and mixture models to logit, negative-­binomial and nominal variables. Structural Equation
Modeling: A Multidisciplinary Journal, 28, 622–637.

Cowles, M. K. (1996). Accelerating Monte Carlo Markov chain convergence for cumulative-­link
generalized linear models. Statistics and Computing, 6, 101–111.

Johnson, V. E., & Albert, J. H. (1999). Ordinal data modeling. New York: Springer.
7

Multiple Imputation

7.1 CHAPTER OVERVIEW

Reflecting on the procedures we have covered so far, the primary goal of a maximum
likelihood or Bayesian analysis is to fit a model to the observed data and use the result-
ing estimates to inform one’s substantive research questions. When confronted with
missing values, maximum likelihood uses the normal curve to deduce the missing parts
of the data as it iterates to a solution, and Bayesian estimation imputes the missing
values en route to getting the parameters. In both cases, missing data handling hap-
pens behind the scenes, and imputation—­implicit or explicit—­is just a means to a more
important end, which is to learn something from the estimates. In contrast, multiple
imputation puts the filled-­in data front and center, and the goal is to create suitable
imputations for later analysis.
A typical application of multiple imputation comprises three major steps. The first
step is to specify an imputation model and deploy an MCMC algorithm that creates sev-
eral copies of the data, each containing different estimates of the missing values. As you
will see, this step co-opts the MCMC algorithms from Chapters 5 and 6, usually algo-
rithms for regression models and covariance matrices. The next step is to perform one or
more analyses on the M complete data sets and get point estimates and standard errors
from each set of imputations. The multiply imputed data sets are compatible with the
frequentist statistical paradigm (Rubin, 1987), so this stage leverages a familiar inferen-
tial framework. The final step uses “Rubin’s rules” (Little & Rubin, 2020; Rubin, 1987)
to combine estimates and standard errors into a single package of results.
Multiple imputation has a long history that began in 1977, when Donald Rubin pro-
posed the procedure to the Social Security Administration and Census Bureau as a solu-
tion for missing survey data. Scheuren (2005) and van Buuren (2012; pp. 25–28) give
interesting historical accounts of the procedure’s development and subsequent growth,
and Rubin’s 1977 report is available in the American Statistician (Rubin, 2004). Rubin
published his seminal multiple imputation text (Rubin, 1987) a decade after his original
report, and a number of excellent imputation books have followed in the years since
261
262 Applied Missing Data Analysis

(Carpenter & Kenward, 2013; Schafer, 1997; van Buuren, 2012). Not surprisingly, the
number of published applications of multiple imputation has exploded in recent years,
and software options abound.
In the 40 years or so since its inception, multiple imputation has developed into a
big tent that includes a diverse collection of procedures. I restrict the focus to strategies
that piggyback on the Bayesian estimation routines for regression models and multivari-
ate normal data (including latent data for categorical variables). This focus includes the
two predominant imputation frameworks, joint model imputation (Schafer, 1997, 1999;
Schafer & Olsen, 1998) and fully conditional specification (Raghunathan, Lepkowski,
Van Hoewyk, & Solenberger, 2001; van Buuren, 2012; van Buuren, Brand, Groothuis-­
Oudshoorn, & Rubin, 2006), and it also includes newer model-based imputation strate-
gies for analyses with interactive or curvilinear effects (Bartlett et al., 2015; Enders et al.,
2020; Goldstein et al., 2014; Zhang & Wang, 2017). Because the initial imputation stage
recycles familiar MCMC algorithms, much of the new content in this chapter focuses on
analyzing multiply imputed data sets and summarizing the results.

7.2 AGNOSTIC VERSUS MODEL‑BASED MULTIPLE IMPUTATION

Multiple imputation involves two rounds of estimation and modeling. The first step
fits an imputation model (often a regression model or a multivariate regression model)
to the observed data and uses the resulting estimates to create imputed data sets. As
mentioned, this step co-opts MCMC algorithms from earlier chapters. The second step
fits the focal analysis model (or models) to each complete data set, after which the esti-
mates and standard errors are aggregated into a single package of results. Importantly,
the imputation and analysis steps need not apply the same statistical models; in some
applications, the imputation and analysis models could be identical, and in others they
might be unrecognizably different. As an organizational tool, I classify the procedures
in this chapter into two buckets according to the degree of similarity between the
imputation and analysis models: An agnostic imputation strategy deploys a model that
differs from the substantive analysis, and a model-based imputation procedure invokes
the same focal model as the secondary analysis (perhaps with additional auxiliary
variables).
Note that my definition of agnostic does not imply that the imputation process is
somehow blind to or ignores how the data will be analyzed. Quite the contrary, the
imputation model should be flexible enough to preserve important features of the sec-
ondary analyses, and it should not impose restrictions that conflict with the focal anal-
yses. Rather, my definition conveys that agnostic imputation applies a nonrestrictive
model that is not dedicated to one specific analysis, and the resulting data sets could be
suitable for several purposes. In contrast, a model-based approach tailors imputations
around one and only one analysis model.
Van Buuren (2012, p. 40) offers a similar taxonomy that classifies imputation
schemes according to their scope—broad, intermediate, or narrow. A broad scope pro-
cedure creates imputes that support all analyses that could be performed on the data.
Public-­use data sets are a prototypical example, where the same imputations could serve
Multiple Imputation 263

many different researchers with diverse substantive goals. Multiple imputation applica-
tions in the social and behavioral sciences are usually intermediate or narrow in scope.
An intermediate scope is one in which a researcher creates a set of imputations for a
family of analyses contained within a single project. A research manuscript is a good
example, where it is often possible to create one set of imputations for all descriptive
summaries and inferential procedures contained within a paper. Finally, a narrow scope
is one in which each analysis for a project requires different imputations.
My categories overlap with van Buuren’s to some degree. For example, model-based
imputation is necessarily narrow in scope, because imputations are tailored to one anal-
ysis, and agnostic imputation schemes are often (but not necessarily) intermediate in
scope. By classifying imputation problems according to the match or mismatch between
the imputation and analysis models, my goal is to emphasize that an analysis model’s
composition—­in particular, whether it includes nonlinear effects such as interactions,
polynomial terms, or random effects—­determines the type of imputation strategy that
works best. Model-based imputation is usually ideal for these types of nonlinearities,
whereas agnostic imputation is well suited for analyses that do not include these special
features. It is perfectly acceptable to use both procedures within the same project or
paper.

7.3 JOINT MODEL IMPUTATION

Joint model imputation derives from applying a multivariate distribution to a set of


incomplete variables. Joe Schafer is arguably responsible for popularizing the approach
(Schafer, 1997, 1999; Schafer & Olsen, 1998; Schafer & Yucel, 2002), and his seminal
text described procedures based on the multivariate normal distribution and the multi-
nomial distribution for incomplete categorical variables. At the time, methods for mix-
tures of categorical and continuous variables were mostly limited to a general location
model that allows the means of incomplete normally distributed variables to vary within
subgroups of complete categorical variables (Belin, Hu, Young, & Grusky, 1999; Olkin
& Tate, 1961). Fast-­forward to today, and contemporary variants of joint model imputa-
tion use the latent response formulation to accommodate incomplete binary, ordinal,
and nominal variables (Asparouhov & Muthén, 2010c; Carpenter & Kenward, 2013;
Quartagno & Carpenter, 2019). Recent innovations have extended the latent response
framework to count and other types of outcomes (Asparouhov & Muthén, 2021b; Polson
et al., 2013).
I use the math achievement data from the companion website to illustrate joint
model imputation. The data set includes pretest and posttest math and academic-­related
variables (e.g., math self-­efficacy, anxiety, standardized test scores, sociodemographic
variables) for a sample of N = 250 students. I use the two waves of achievement data
to illustrate imputation for a within-­subjects analysis that examines whether scores
improve over time. A change or difference score that captures the increase or decrease
between the two test administrations is the analytic focus.

CHANGEi = Y2i − Y1i = MATHPOSTi − MATHPREi (7.1)


264 Applied Missing Data Analysis

Pretest scores are complete, but 16.8% of the posttest scores (and thus the difference
scores) are missing. Casting the analysis as a regression model is useful when using
general-­use statistical software to analyze multiply imputed data sets, because most pro-
grams offer this functionality. The analysis can be expressed as an empty regression
model for the difference scores:

CHANGEi = β0 + ε i (7.2)

(
CHANGEi ~ N1 β0 , σ2ε )
where β0 is the average change score, N1 denotes the univariate normal distribution,
and σε2 captures variation among the difference scores. A familiar paired-­samples t-test
evaluates the null hypothesis that β0 equals 0.
One compelling reason to use multiple imputation is that it readily accommodates
an inclusive analysis strategy that leverages information from a set of auxiliary variables
(Collins et al., 2001). To this end, the imputation model also includes standardized
reading scores and a binary indicator that measures whether a student is eligible for free
or reduced-­priced lunch (0 = no assistance, 1 = eligible for free or reduced-­price lunch).
Although neither variable predicts whether math posttest scores are missing, including
these variables could improve power, because they increment the explained variance in
the posttest scores by about 16% above and beyond the pretest measure (Collins et al.
refer to these as “Type B” auxiliary variables). Note that the auxiliary variables are also
incomplete; 10.4% of the standardized reading scores are missing (e.g., because a student
is new to the district), and 5.2% of the lunch assistance indicator codes are missing.
Incomplete auxiliary variables are still useful, but their utility diminishes if they are
concurrently missing with the analysis variables (Enders, 2008).
The joint imputation model includes pretest and posttest math scores, standardized
reading test scores, and the lunch assistance indicator. To simplify the ensuing notation,
I refer to these variables as Y1 to Y4. Applying ideas from Chapter 6, the dummy code
appears as a latent response variable, which I denote as Y4*. The joint imputation model
invokes an empty multivariate regression model for the continuous variables and latent
scores, and a mean vector and variance–­covariance matrix are the imputation model
parameters.

 MATHPREi   Y1i   μ1   r1i 


       
 MATHPOST  μ 2  +  r2i 
i =Y2i 
= (7.3)
 STANREADi   Y3i   μ 3   r3i 
   *    
 FRLUNCH *  Y 
 i   4i   μ 4   r4 i 

 Y1i    μ   σ12 σ12 σ13 σ14  


    
1

Y  μ
 2i  ~ N  2  ,  σ 21 σ 22 σ 23 σ 24  
4 
 Y3i   
 *   μ 3   σ 31 σ 32 σ 23 σ 34  
 
 Y4 i    μ 4   σ σ 42 σ 43 1.0  
  41
Multiple Imputation 265

Y1

Y2

Y3

Y 4* Y4

FIGURE 7.1. Path diagram of a joint imputation model with four variables. The oval and rect-
angle differentiate the latent variable and its binary indicator, respectively, and the broken arrow
connecting the two is the link function that maps the unobserved continuum to the discrete
responses.

As a reminder, N4 denotes a four-­dimensional normal distribution, and the first and sec-
ond terms inside the normal distribution function are the mean vector and covariance
matrix, μ and Σ. The fixed value on the diagonal of the covariance matrix establishes a
metric for the latent normal variable, and the model also incorporates fixed threshold
parameters (some variants of the joint model instead estimate the threshold and fix the
latent mean to 0).
Figure 7.1 shows a path diagram of the imputation model. Following diagramming
conventions from Edwards et al. (2012), I use an oval and rectangle to differentiate the
latent variable and its binary indicator, respectively, and the broken arrow connecting the
two is the link function that maps the unobserved continuum to the discrete responses
(e.g., the broken arrow reflects the idea that scores above and below the threshold equal
1 and 0, respectively). The residual terms pointing to the rectangles indicate that all
variables have a distribution, and the curved arrows illustrate that the variables link
via covariances. To reduce visual clutter, I omit triangle symbols that researchers some-
times use to denote grand means or intercepts. I characterize this imputation model as
agnostic, because it looks nothing like the analysis model in Equation 7.2. This differ-
ence is not a problem, because the multivariate normal data structure does not conflict
with the analytic model.

Missing Data Imputation


Revisiting ideas from Section 5.9, imputing multivariate normal data involves convert-
ing the mean vector and variance–­covariance matrix (the estimated parameters) into a
series of regression models, one for each missing data pattern. I paint the broad brush-
strokes here for review. First, consider the subgroup of participants with missing values
only on Y2 (e.g., posttest math achievement). The distribution of missing values requires
the regression of Y2 on the other three variables. Equations from Section 5.9 show how to
convert the elements of μ and Σ into regression model parameters, and MCMC uses the
266 Applied Missing Data Analysis

resulting quantities to construct a normal distribution of missing values, from which it


samples imputations.

(( )
Y2i( mis ) ~ N1 E Y2 | Y1 , Y3 , Y4* , σ22|134 ) (7.4)

To refresh notation, N1 denotes a univariate normal distribution, the expected value is a


predicted score that defines the mean of the missing values, and the residual variance in
the second term defines their spread.
As a second example, consider the subgroup of participants missing Y2 and Y4*
(e.g., math posttest scores and lunch assistance indicator). Imputation for this pattern
requires the multivariate regression of the incomplete variables on Y1 and Y3. After con-
verting the mean vector and covariance matrix to a multivariate regression equation,
MCMC samples imputations from the following bivariate normal distribution:

 Y2i( mis )    E ( Y2 | Y1 , Y3 )  
  ~ N2    , S  (7.5)
 Y4*i( mis ) 
   (
  E Y4* | Y1 , Y3  )
24|13


It’s still correct to view imputations as predicted values plus noise, but the noise terms
correlate via the off-­diagonal elements in the residual covariance matrix. Notice that
the binary variable’s imputations are on the latent metric. You may recall from Chapter
6 that location of the latent scores relative to the threshold parameter induces a corre-
sponding set of discrete imputes (e.g., drawing a latent imputation above the threshold is
consistent with Y4(mis) = 1, and sampling a latent score below the threshold implies that
Y4(mis) = 0). The dichotomous imputations play no role in estimation and are only needed
when saving a data set for later analysis. Importantly, the categorical values result from
a functional link between the latent and discrete scores and not an ad hoc rounding
scheme from the earlier days of multiple imputation (Ake, 2005; Allison, 2002, 2005;
Demirtas & Hedeker, 2008a, 2008b; Horton et al., 2003).

Saving Filled‑In Data Sets for Later Analysis


Joint model imputation is essentially a special application of Bayesian estimation for
multivariate normal data where the main goal is to save filled-­in data sets rather than
interpret parameter estimates. Following procedures from earlier chapters, the MCMC
sequence for joint model imputation consists of three major operations: (1) Estimate the
mean vector conditional on the current covariance matrix and filled-­in data set, (2) esti-
mate the covariance matrix conditional on the new mean vector and the filled-­in data,
and (3) update the missing values (including the latent response scores) conditional on
the current estimates of μ and Σ. Ordinal variables require an additional step that esti-
mates threshold parameters.
The only new procedural wrinkle is that we need to save a relatively small number
of M complete-­data sets (20 is a common recommendation) from a much longer MCMC
process. It is widely known that MCMC estimation yields highly correlated results from
one iteration to the next. We often don’t need to worry about this autocorrelation in a
Bayesian analysis that summarizes estimates over many thousands of computational
Multiple Imputation 267

cycles, but serial dependencies among a small number of imputed data sets can attenu-
ate multiple imputation standard errors. For this reason, you can’t simply save the M
data sets from successive iterations following the burn-in period. There are two ways
to set up the algorithm to avoid this problem: Save imputed data sets at prespecified
intervals within a single MCMC process, or save the filled-­in data from the last iteration
of separate MCMC processes. The latter strategy naturally avoids autocorrelation. Like
other Bayesian estimation problems, it is important to evaluate whether MCMC has
converged and is mixing well prior to saving the first data set, and trace plots (Schafer &
Olsen, 1998) and potential scale reduction factor diagnostics (Gelman & Rubin, 1992)
are familiar tools for this purpose.
A sequential imputation chain invokes a single MCMC process consisting of M ×
T iterations, and it saves data sets in intervals of T iterations; that is, the first data set is
saved after an initial burn-in period of T iterations, and the remaining M – 1 data sets
are saved every T iterations thereafter. The literature refers to the interval separating the
save operations as the thinning interval or between-­imputation interval. The recipe for
this approach is as follows:

Assign starting values to all parameters, latent scores, and missing data.
Do for m = 1 to M imputations.
Do for t = 1 to T iterations.
> Estimate model parameters conditional on all imputations.
> Estimate missing values conditional on the model parameters.
Repeat.
> Save the filled-­in data for later analysis.
Repeat.

Notice that the recipe features two repetitive loops. The inner block consists of estima-
tion and imputation steps that repeat for T iterations, and the outer loop repeats the
estimation block M times, saving a complete data set with categorized imputes after
each block of T iterations. The recipe reflects a simple strategy that sets the thinning
interval equal to the burn-in period, but the two intervals need not be the same. Schafer
(1997; Schafer & Olsen, 1998) describes graphical tools for assessing the magnitude of
the autocorrelations across iterations, and these plots provide an alternative method for
choosing a thinning interval.
Specifying parallel imputation chains initiates a unique MCMC process (usually
with random starting values) for each of the M data sets, with each chain producing a
single filled-­in data set at the conclusion of a burn-in period consisting of T iterations.
The recipe for this approach is as follows:

Do for m = 1 to M imputations.
Assign starting values to all parameters, latent scores, and missing data.
Do for t = 1 to T iterations.
> Estimate model parameters conditional on all imputations.
268 Applied Missing Data Analysis

> Estimate missing values conditional on the model parameters.


Repeat.
> Save the filled-­in data for later analysis.
Repeat.

Notice that the initialization step that assigns preliminary values to the unknowns
moves inside the outer loop, forcing MCMC to start from the beginning with new start-
ing values after completing T iterations. This hard reset negates any autocorrelation,
because the M sets of imputations arise from independent chains with unique starting
values.

How Many Imputations Are Needed?


The goal of a Bayesian analysis is to summarize the posterior distribution of the param-
eter estimates over thousands of realizations of the missing values (e.g., a distribution of
estimates from 10,000 MCMC iterations), whereas the goal of multiple imputation is to
perform frequentist data analyses on a much smaller collection of filled-­in data sets. This
begs an important question: How many imputed data sets are needed for this secondary
analysis phase? Early recommendations from seminal resources suggest that M = 3 to 5
data sets are sufficient (Rubin, 1987; Schafer, 1997; Schafer & Olsen, 1998). This rule of
thumb stems from the fact that the statistical precision (the average squared error of an
estimate around its true value) obtained from three to five imputations is only slightly
worse than the precision from an infinite number of imputations (Rubin, 1987, p. 114).
Graham, Olchowski, and Gilreath (2007) used computer simulations to examine
how the number of data sets impacts the sensitivity to detect a small effect size, and
they found that a relative efficiency-­based rule of thumb doesn’t necessarily maximize
statistical power. Rather, nontrivial power gains can be achieved by analyzing 20 to 100
data sets, with the optimal M increasing as the proportion of missing information (a
quantity that tracks closely with the missing data rates) increases. Power considerations
aside, other studies suggest that 100 or more imputations may be necessary to reduce
the impact of Monte Carlo simulation noise on standard errors and get good estimates
of confidence interval half-­w idths, probability values, and the proportion of missing
information (Bodner, 2008; Harel, 2007; von Hippel, 2020). Graham et al.’s (2007) study
is now quite old, and Moore’s Law allows us to be less discriminating; from a practical
perspective, there is usually no reason not to use many data sets, because the additional
computations may require just a few more seconds. For this reason, I use M = 100 impu-
tations for most data analysis examples.

Analysis Example
Continuing with the math achievement example, I used the joint imputation model from
Equation 7.3 to create M = 100 data sets. The potential scale reduction factors (Gelman
& Rubin, 1992) from a preliminary diagnostic run indicated that MCMC converged in
Multiple Imputation 269

fewer than 200 iterations, so I created imputations by saving the filled-­in data from the
final iteration of 100 parallel MCMC chains, each with 1,000 iterations. As explained
previously, autocorrelated imputations are not a concern with this approach.
Table 7.1 gives the Bayesian point estimates (posterior medians) of the imputa-
tion model parameters. To emphasize, these are not the parameters of substantive inter-
est; they are average estimates from the MCMC process that created multiple imputa-
tions for the next round of analysis. As described in Section 5.5, it’s good practice to
inspect imputations to make sure they are plausible and reasonably like the observed
data. Graphing imputations next to the observed scores can provide a window into an
estimator’s inner machinery, as misspecifications such as applying a normal imputation
model to a highly skewed variable can produce out of range or implausible values (e.g.,
negative imputes for a strictly positive variable). I use simple bivariate scatterplots and
histograms to examine the filled-­in data, and Su, Gelman, Hill, and Yajima (2011) and
van Buuren (2012) illustrate other useful graphical displays for this purpose.
Figure 7.2 shows scatterplots for six joint model data sets, with light gray circles
representing observed data and black crosshair symbols indicating imputed data. A
practical consequence of the MAR assumption is that observed and imputed scores
share common model parameters. Although their distributions need not be the same,
the observed and imputed data should combine to form a reasonable looking distri-
bution. They do in this case, as the imputations blend in around the regression line
with relatively few outliers. Figure 7.3 shows overlayed histograms of the math posttest
scores, with the observed data as gray bars and the missing values as white bars with a
kernel density function (the graph reflects a stacked data set with all imputations in the
same file). As you can see, the observed data are relatively normal, and the distribution
of imputations is a close match. Figure 7.4 shows the corresponding plot of the stan-
dardized reading test scores. As you can see, the observed data are negatively skewed,
whereas the imputations follow a symmetric distribution; the imputed data are thus a
weighted mixture of the two.
A mismatch between the observed data distribution and the imputations isn’t nec-
essarily a problem, especially when the goal is to estimate means or regression coef-
ficients (Demirtas et al., 2008; Lee & Carlin, 2017; von Hippel, 2013; Yuan et al., 2012).
It is even less of a problem here, because the skewed variable is auxiliary to the main

TABLE 7.1. Joint Model Means, Covariances, and Correlations


Variable 1 2 3 4
1. MATHPRE 74.58 .51 .28 –.13
2. MATHPOST 39.94 83.78 .51 –.32
3. STANREAD 23.99 46.79 103.13 –.41
4. FRLUNCH * –1.10 –2.91 –4.17 1.00
Means 50.10 56.65 52.57 –0.24

Note. Correlations are in the upper diagonal.


270 APPLIED MISSING DATA ANALYSIS

80

80
70

70
60

60
Math Posttest

Math Posttest
50

50
40

40
30

30
20

20
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest
80

80
70

70
60

60
Math Posttest

Math Posttest
50

50
40

40
30

30
20

20

20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest
80

80
70

70
60

60
Math Posttest

Math Posttest
50

50
40

40
30

30
20

20

20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest

FIGURE 7.2. Scatterplots of pretest and posttest math achievement scores from six joint model
data sets. Gray circles are observed scores, and black crosshair symbols denote imputations.
Multiple Imputation 271

Frequency

20 30 40 50 60 70 80 90
Math Posttest

FIGURE 7.3. Histogram of observed and imputed math posttest scores. The observed data are
the gray bars, and the missing values are the white bars with a kernel density function.
Frequency

Standardized Reading

FIGURE 7.4. Histogram of observed and imputed standardized reading scores. The observed
data are the gray bars, and the missing values are the white bars with a kernel density function.
272 Applied Missing Data Analysis

analysis. Nevertheless, observing numerous out-of-­bounds or implausible imputes is


often a symptom that the normal distribution is not producing good results. Predictive
mean matching (discussed later) is one option, and generating skewed imputes with the
Yeo–­Johnson transformation (Lüdtke et al., 2020b; Yeo & Johnson, 2000) described in
Chapter 10 is another.

7.4 FULLY CONDITIONAL SPECIFICATION

Rather than working from a multivariate distribution, fully conditional specification


(also known as chained equations imputation) casts associations among a set of variables
as a sequence of regression models. The framework was popularized by Stef van Buuren
(2007, 2012; van Buuren et al., 2006) and his R package MICE (Multiple Imputation
by Chained Equations; van Buuren et al., 2021; van Buuren & Groothuis-­Oudshoorn,
2011), although others contributed to the method’s early development (­R aghunathan
et al., 2001). Fully conditional specification provides a solution for missing data prob-
lems where incomplete variables do not originate from a common multivariate distribu-
tion. Mixtures of categorical and continuous variables are one such example, and survey
applications with skip patterns or logical constraints are another.
As the chained equations moniker suggests, fully conditional specification imputes
variables one at a time by stringing together a series univariate regression models, each
of which features an incomplete variable regressed on all other variables (complete or
imputed). To illustrate, consider a typical round-robin imputation scheme involving a
set of V incomplete variables, Y = (Y1, . . ., Yv, . . ., Y V), and a set of K complete variables,
X = (X1, . . ., XK). For now, assume that the variables are normally distributed, although
they need not be. Importantly, a variable’s designation as a Y or an X makes no reference
to its role in the subsequent analysis; a given variable could be a predictor, an outcome,
or an auxiliary variable. Using shorthand notation from earlier chapters, the imputation
model for variable Yv is a linear regression:

Yvi( ) = E(Yvi |Y1(i ) ,…, Y((v −) 1)i , Y((v +1))i ,…, YVi( ) , X i ) + rvi
t t t t −1 t −1
(7.6)
t
(
Yvi( ) ~ N1 E(Yvi |Y1(i ) ,…, Y((v −) 1)i , Y((v +1))i ,…, YVi( ) , X i ), σ2rv
t t t −1 t −1
)
where Yvi is participant i’s score on Yv, the expectation E(Yvi| . . .) is a predicted score from
the regression of Yv on all other variables, and r vi is a normally distributed residual.
Notice that the predictors in the regression equation condition on previously imputed
variables, some taken from the current iteration, others taken from the previous itera-
tion. The bottom row conveys the familiar idea that imputations are sampled from a
normal distribution with a predicted value and residual variance defining its center and
spread. Like the joint model imputation, MCMC provides the mathematically machin-
ery for estimating the regression model parameters.
Fully conditional specification accommodates mixed response types by tailoring
each regression model in the sequence to the incomplete variable’s metric. For example,
the first step could employ a linear regression, the second, a logistic regression, the third,
Multiple Imputation 273

a probit regression, and so on. To connect with material from Chapter 6, I use a latent
response formulation and probit models to impute categorical variables. To illustrate,
reconsider the math achievement analysis. This problem requires a probit regression
model for the binary lunch assistance indicator and linear regressions for the posttest
math and standardized reading test scores. Using generic notation, the fully conditional
specification imputation models at MCMC iteration t are as follows, and each model
requires supporting steps that estimate the parameters:

MATHPOSTi( ) = Y2(i ) = γ 02 + γ12 Y1i + γ 22 Y3(i ) + γ 32 Y4(i ) + r2i


t t t −1 t −1
(7.7)

(
Y2(i()mis ) ~ N1 E ( Y2i | Y1i , Y3i , Y4 i ) , σ2r2
t
)
(t ) ( t −1)
(t )
+ γ 33 Y2(i ) + r3i
t
STANREADi = Y3i = γ 03 + γ13 Y1i + γ 23 Y4 i

(
Y3(i()mis ) ~ N1 E ( Y3i | Y1i , Y2i , Y4 i ) , σ2r3
t
)
FRLUNCH i ( ) Y4 i( ) = γ 04 + γ14 Y1i + γ 24 Y2i + γ 34 Y3(i ) + r4 i
(t )
*t *t t
=
Y4 i(( mis
)
*t *
((
) ~ N1 E Y4 i | Y1i , Y2i , Y3i ,1 ) )
The probit model also incorporates a fixed threshold parameter, and latent imputa-
tions above and below this cutoff convert to 1 and 0 values, respectively. Notice that
the binary imputes appear on the right side of each equation. Figure 7.5 shows a path
diagram of the imputation models. As before, I use an oval and rectangle to differen-
tiate the latent variable and its binary indicator, respectively, and the broken arrow
connecting the two is the link function that maps the unobserved continuum to the
discrete responses.

Fully Conditional Specification


with Latent Response Variables
Except for using a probit rather than logit model, the previous regressions are consis-
tent with the classic MICE formulation (van Buuren, 2012; van Buuren et al., 2021; van
Buuren & Groothuis-­Oudshoorn, 2011). Keller and Enders (2021) describe a variation of
fully conditional specification based on latent response scores. Using generic notation,
the fully conditional specification imputation models are as follows:

MATHPOSTi( ) = Y2(i ) = γ 02 + γ12 Y1i + γ 22 Y3(i ) + γ 32 Y4 i( ) + r2i


t t t −1 * t −1
(7.8)

((
Y2(i()mis ) ~ N1 E Y2i | Y1i , Y3i , Y4*i , σ2r2
t
) )
(t )
(t )
γ 23 Y4 i( ) + γ 33 Y2(i ) + r3i
* t −1 t
STANREADi = Y3i = γ 03 + γ13 Y1i +

((
Y3(i()mis ) ~ N1 E Y3i | Y1i , Y2i , Y4*i , σ2r3
t
) )
FRLUNCH i ( ) Y4 i( ) = γ 04 + γ14 Y1i + γ 24 Y2(i ) + γ 34 Y3(i ) + r4 i
*t *t t t
=
Y4 i(( mis
)
*t *
((
) ~ N1 E Y4 i | Y1i , Y2i , Y3i ,1 ) )
As before, the location of the latent variable imputations relative to the threshold param-
274 Applied Missing Data Analysis

Y1

Y3 Y2

Y4

Y1

Y4 Y3

Y2

Y1

Y2 Y 4*

Y3
Y4

FIGURE 7.5. Path diagram of fully conditional specification imputation models for four vari-
ables. Incomplete variables link via regressions in a round-robin scheme where each variable is
regressed on all others.

eter induces a corresponding set of dichotomous imputes; however, the categorical


imputations play no role in estimation and are only needed when saving a data set for
later analysis.
Figure 7.6 shows a path diagram of the previous imputation models, which is like
Figure 7.5 but features ovals (i.e., latent response variables) in every model. The justi-
fication for the latent variable approach is somewhat different from van Buuren’s MICE
algorithm, because it derives directly from (and is compatible with) the multivariate nor-
mal distribution in Equation 7.3. Simulation evidence favors fully latent imputation in
certain limited circumstances (Quartagno & Carpenter, 2019; Wu, Jia, & Enders, 2015),
but MCMC can encounter convergence difficulties if the number of latent response vari-
ables is large. In practice, the two specifications usually give the same results, but the
formulation in Equation 7.8 offers the interesting possibility saving and analyzing the
latent scores in lieu of the categorical variables. The example in Section 10.6 applies this
strategy to item-level factor analysis.
Multiple Imputation 275

Y1

Y3 Y2

Y 4*

Y1

Y 4* Y3

Y2

Y1

Y2 Y 4*

Y3
Y4

FIGURE 7.6. Path diagram of fully conditional specification imputation with latent response
variables. Incomplete variables link via regressions in a round-robin scheme where each variable
is regressed on all others, and latent variables replace manifest categorical variables.

Compatibility
An important issue with fully conditional specification is whether the imputation regres-
sion models are mutually compatible (Raghunathan et al., 2001; van Buuren, 2012; van
Buuren et al., 2006). Compatibility has a complex and precise mathematical definition
that can be found in work by Arnold and colleagues (Arnold, Castillo, & Sarabia, 1999;
Arnold et al., 2001; Arnold & Press, 1989) and more recently in Liu et al. (2014) and
Bartlett et al. (2015). I paint the broad brushstrokes here. Returning to the imputation
models in Equation 7.7, each regression induces a distribution for each incomplete vari-
able that conditions on all other variables. The essence of compatibility is whether these
conditional distributions are mutually valid in the sense that their parameters relate to
one another in a coherent way.
Conditional distributions such as those in the previous equations are compatible
if they are spawned by the same joint distribution. To illustrate, suppose Y1 and Y2 are
bivariate normal, as follows:
276 Applied Missing Data Analysis

 Y1i    μ1   σ12 σ12  


  ~ N 
2  ,  (7.9)
 Y2i    μ 2   σ σ22  
  21
The multivariate normal distribution is a good exemplar, because it is known to induce
a set of compatible linear regression models (Arnold et al., 2001; Liu et al., 2014).

Y1i = β0 + β1 ( Y2i ) + ε i = E ( Y1i | Y2i ) + ε i (7.10)

(
Y1i ~ N1 E ( Y1i | Y2i ) , σ2ε )
Y2i = γ 0 + γ1 ( Y1i ) + ri = E ( Y2i | Y1i ) + ri

(
Y2i ~ N1 E ( Y2i | Y1i ) , σ2r )
These models and their conditional distributions are compatible, because the param-
eters are functionally related to those of the bivariate distribution, as shown below:

σ12 / σ22 β0 =
β1 = μ1 − β1μ 2 σ2ε =
σ12 − σ12
2
/ σ22 (7.11)

σ12 / σ12 γ 0 =
γ1 = μ 2 − γ1μ1 σ2r =
σ22 − σ12
2
/ σ12

Because both regressions link to a common joint distribution, it follows that their param-
eters also link to one another (e.g., Y1’s regression model parameters are a function of Y2’s
regression parameters and vice versa).
From a practical perspective, deploying a set of compatible regression models is
optimal, because the resulting imputations are logically consistent with one another
(assuming the models are correctly specified). However, van Buuren et al. (2006) point
out that “compatibility is not an all-or-­nothing phenomenon” (p. 1053), as certain types
of incompatibilities have little or no impact on multiple imputation parameter estimates
(Gelman & Raghunathan, 2001; Raghunathan et al., 2001). As an example, using a logis-
tic rather than probit imputation model in Equation 7.7 would not satisfy compatibility,
because no common multivariate distribution could simultaneously spawn binomial and
normal conditional distributions. Yet, in practice, the difference between that approach
and using the compatible latent variable imputation models from Equation 7.8 is usu-
ally nil. On the other hand, incompatibilities that result from applying fully conditional
specification to models with interactive or nonlinear terms can lead to substantial biases
(Bartlett et al., 2015; Enders, Keller, et al., 2018; Grund, Lüdke, & Robitzsch, 2016a; Kim
et al., 2018; Seaman et al., 2012; Zhang & Wang, 2017). Model-based multiple imputa-
tion is usually a much better option.

Saving Filled‑In Data Sets


Fully conditional specification is essentially a special application of Bayesian estimation
for regression models where the main goal is to save filled-­in data sets rather than inter-
pret parameter estimates. Following procedures from Chapters 5 and 6, the MCMC algo-
rithm applies three major steps to each incomplete variable in the sequence: (1) Estimate
Multiple Imputation 277

regression coefficients given the current residual variance and filled-­in data, (2) esti-
mate the residual variance conditional on the new coefficients and the current data, and
(3) update the missing values (including the latent scores) conditional on the regression
model parameters. Again, ordinal variables require an additional step that estimates
threshold parameters.
Consistent with joint model imputation, you can either save imputed data sets at
prespecified intervals within a single MCMC process (sequential imputation chains) or
save the imputed data set from the last iteration of separate MCMC processes (parallel
imputation chains). The fully conditional specification recipe with parallel imputation
chains is as follows:

Do for m = 1 to M imputations.
Assign starting values to all parameters, latent scores, and missing data.
Do for t = 1 to T iterations.
Do for incomplete variable v = 1 to V.
> Estimate regression coefficients conditional on the residual vari-
ance and all imputations.
> Estimate the residual variance conditional on the coefficients and
all imputations.
Repeat.
Do for incomplete variable v = 1 to V.
> Estimate missing scores (including latent response variables) con-
ditional on the regression model parameters and all other data.
Repeat.
Repeat.
> Save the filled-­in data for later analysis.
Repeat.

Each MCMC cycle now consists of V estimation blocks (one for each incomplete
variable), and there are similarly V imputation steps. The recipe has a Gibbs sampler-­
esque construction, as each estimation sequence conditions on imputations. As a small
technical point, van Buuren’s (2007, 2012; van Buuren et al., 2006) MICE algorithm is
not a true Gibbs sampler, because each estimation step uses just the cases with data
on the target (dependent) variable. However, fully conditional specification with latent
variables does use a Gibbs sampler that conditions on the entire data set.

Predictive Mean Matching


The Bayesian estimation routines we are familiar with use Monte Carlo simulation to
draw synthetic missing values from normal distributions. A variant of fully conditional
specification known as predictive mean matching instead draws imputations from
278 Applied Missing Data Analysis

the observed data (Little, 1988a; van Buuren, 2012; Vink, Frank, Pannekoek, & van
Buuren, 2014). After estimating the variable’s regression model, the procedure identifies
a donor pool of individuals with similar predicted values of the missing variable (i.e.,
similar regressor score profiles). Instead of simulating a synthetic imputation from a
normal curve, the procedure instead imputes each missing value with an observed score
drawn at random from the donor pool. Van Buuren (2012) provides a detailed discus-
sion of predictive mean matching, and the procedure is available in his R package MICE
(van Buuren et al., 2021; van Buuren & Groothuis-­Oudshoorn, 2011). Predictive mean
matching is broadly applicable to missing data problems, and it is often recommended
for use with non-­normal data, because it doesn’t impose a normal distribution on the
imputations (Lee & Carlin, 2017).

Analysis Example
Continuing with the math achievement example, I use fully conditional specification
models from Equation 7.8 to create M = 100 data sets. Again, 100 data sets are likely
enough to serve a range of purposes such as maximizing power, minimizing the impact

TABLE 7.2. Posterior Summary of the FCS Imputation Models


Parameters Mdn SD LCL UCL
Math posttest model
γ02 56.64 0.56 55.58 57.75
γ12 (MATHPRE) 0.42 0.06 0.30 0.54
γ22 (STANREAD) 0.31 0.06 0.19 0.43
γ32 (FRLUNCH *) –1.04 0.65 –2.31 0.24
σr22 49.77 5.14 41.02 61.22

Standardized reading model


γ03 56.64 0.56 55.58 57.75
γ13 (MATHPRE) 0.42 0.06 0.30 0.54
γ23 (FRLUNCH *) 0.31 0.06 0.19 0.43
γ33 (MATHPOST) –1.04 0.65 –2.31 0.24
σr23 49.77 5.14 41.02 61.22

Lunch assistance latent response model


γ04 –0.27 0.09 –0.45 –0.09
γ14 (MATHPRE) 0.01 0.01 –0.02 0.03
γ24 (MATHPOST) –0.02 0.01 –0.05 0.00
γ34 (STANREAD) –0.04 0.01 –0.06 –0.02
σr24 1.00 — — —

Note. Predictors are centered; intercepts are grand means. LCL, lower credible limit; UCL, upper credible limit.
Multiple Imputation 279

of Monte Carlo error on standard errors, and obtaining precise inferences (Bodner, 2008;
Graham et al., 2007; Harel, 2007; von Hippel, 2020). The potential scale reduction fac-
tors (Gelman & Rubin, 1992) from a preliminary diagnostic run indicated that MCMC
converged in fewer than 200 iterations, so I created imputations by saving the filled-­in
data from the final iteration of 100 parallel MCMC chains, each with 1,000 iterations. As
explained previously, autocorrelated imputations are not a concern with this approach.
Table 7.2 shows the posterior summaries of the imputation regression models. The
latent imputation approach centers all predictors, such that the intercept coefficients
equal the target variable’s grand mean (Keller & Enders, 2021). To emphasize, these are
not the parameters of substantive interest; they are average estimates from the MCMC
process that created multiple imputations for the next round of analysis. As mentioned
previously, it’s a good practice to inspect imputations to make sure they are plausible
and reasonably similar to the observed data. Figure 7.7 shows bivariate scatterplots
for six imputed data sets, with light gray circles representing observed data and black
crosshair symbols indicating imputed data. Consistent with the corresponding joint
model imputation graphs, the filled-­in values blend in around the regression line with
relatively few outliers. The histograms of the observed and imputed data were also like
those in Figures 7.3 and 7.4, so I omit these graphs in the interest of space.

7.5 ANALYZING MULTIPLY IMPUTED DATA SETS

The product of the initial imputation phase is a set of M complete data sets (e.g., M =
100 for the achievement example). Although it might seem reasonable to do so, averag-
ing the imputations into a single data set is inappropriate, as is stacking the individual
files and analyzing a single aggregated data set. Although the latter strategy could give
unbiased point estimates in some cases, errors and confidence intervals would be wrong
(van Buuren, 2012). Rather, the correct way forward is to perform one or more secondary
analyses on each data set and combine multiple sets of estimates and standard errors
into one package of results. Repeating an analysis 100 times sounds incredibly tedious,
but most major software packages have built-in routines that automate this process.
Returning to the math achievement example, the primary analysis involves a test
of the within-­subject mean difference. A table of descriptive statistics would be stan-
dard fare with a such an analysis, and obtaining those summaries requires the means
and standard deviations from each of the 100 data sets. Table 7.3 shows the descriptive
statistics from a few of the 100 fully conditional specification data sets. Similarly, a
test of the within-­subjects mean difference requires the mean change score from each
data set. Table 7.4 gives the estimates and their standard errors for both imputation
approaches. The next few sections describe how to use Rubin’s rules (Little & Rubin,
2020; Rubin, 1987) to combine the various quantities from the tables into a single
package of results. Note that the same data can be used for any number of analyses
involving the set of imputation model variables. However, variables omitted from the
imputation model should not be analyzed, as those scores are uncorrelated with the
filled-­in data values.
280 APPLIED MISSING DATA ANALYSIS

80

80
70

70
60

60
Math Posttest

Math Posttest
50

50
40

40
30

30
20

20
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest
80

80
70

70
60

60
Math Posttest

Math Posttest
50

50
40

40
30

30
20

20

20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest
80

80
70

70
60

60
Math Posttest

Math Posttest
50

50
40

40
30

30
20

20

20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest

FIGURE 7.7. Scatterplots of pretest and posttest math achievement scores from six fully con-
ditional specification data sets. Gray circles are observed scores, and black crosshair symbols
denote imputations.
Multiple Imputation 281

TABLE 7.3. Imputation‑Specific Means and Standard Deviations


Pretest Posttest
Imputation M SD M SD
  1 50.096 8.659 56.122 9.423
  2 50.096 8.659 56.839 9.317
  3 50.096 8.659 57.087 9.183
  4 50.096 8.659 56.920 9.691
  5 50.096 8.659 56.875 9.324
... ... ... ... ...
96 50.096 8.659 56.453 9.440
97 50.096 8.659 56.322 9.484
98 50.096 8.659 56.835 9.422
99 50.096 8.659 56.830 9.382
100 50.096 8.659 56.378 9.097
Pooled 50.096 8.659 56.670 9.226

TABLE 7.4. Change Score Mean and Standard Error Estimates


Joint model FCS
Imputation Est. SE Est. SE
1 6.524 0.565 6.454 0.567
2 6.393 0.564 6.502 0.562
3 6.521 0.564 6.867 0.563
4 6.677 0.577 6.322 0.561
5 6.322 0.553 6.683 0.578
... ... ... ... ...
96 6.494 0.545 6.324 0.562
97 6.374 0.561 6.702 0.558
98 6.822 0.575 6.600 0.574
99 6.603 0.541 6.772 0.571
100 6.397 0.564 6.535 0.548
Pooled 6.532 0.593 6.551 0.595

Note. FCS, fully conditional specification.


282 Applied Missing Data Analysis

7.6 POOLING PARAMETER ESTIMATES

Analyzing multiply imputed data sets gives M estimates of each model parameter. Rubin
(1987) defined the multiple imputation point estimate as the arithmetic average of the
M estimates:
M
1
=θˆ ∑
M m =1
θˆ m (7.12)

where θ̂m is the estimate from data set m, and θ̂ is the pooled point estimate. To illustrate,
the bottom row of Table 7.3 shows the pooled means and standard deviations of the
pretest and posttest scores, and the bottom row of Table 7.4 gives the average within-­
subjects mean difference for each imputation method. Because the imputation proce-
dures use the same variables and make identical assumptions, we would expect them
to produce equivalent estimates, which they do (β̂0 = 6.53 vs. 6.57 for the joint model
and fully conditional specification, respectively). From a substantive perspective, the
estimate’s interpretation is no different than that of a complete-­data analysis—­on aver-
age, math scores improved by about six and one-half points between pretest and post-
test. Despite their Bayesian origins, the imputations are compatible with the frequentist
paradigm; β̂0 is an estimate of the true population parameter.
The arithmetic average is an intuitive way to combine estimates, and the formal
statistical rationale for the procedure that requires the estimand of interest to have a
normal sampling distribution (Rubin, 1987, Ch. 3; 1996). The normality assumption is
approximately true for common estimates such as means, regression coefficients, and
proportions (van Buuren, 2012), but not all estimates possess this property. Examples
include correlation coefficients, standard deviations and variances, variance explained
statistics, and odds ratios, to name a few. For these estimands, a common recom-
mendation is to (1) transform estimates to a metric that better approximates a nor-
mal curve, (2) apply the pooling rule from Equation 7.12 to the transformed estimates,
then (3) back-­transform the pooled estimate to its original metric. For example, Schafer
(1997) recommends pooling correlation coefficients following a Fisher’s z-transforma-
tion, then back-­transforming the average z-statistic to the correlation metric. Computer
simulations suggest that the impact of such transformations is usually only noticeable
in very small samples (e.g., less than 50; Hubbard & Enders, 2022). Finally, note that
test statistics and p-values should not be averaged, because they do not estimate a fixed
population parameter.

7.7 POOLING STANDARD ERRORS

Taking the arithmetic average of the standard errors is inappropriate, because each value
is based on complete data and therefore ignores the additional uncertainty that accrues
from imputation. Rather, multiple imputation standard errors combine two sources of
variation: the sampling error that would have resulted had there been no missing val-
ues (within-­imputation variance), and additional error or noise due to the missing data
Multiple Imputation 283

(between-­imputation variance). As you will see, estimating the second source of varia-
tion is the sole reason why we need more than one data set.
Returning to the math achievement analysis, Table 7.4 gives the within-­subject
mean difference and standard error estimates from a handful of imputed data sets, with
pooled values shown in bold typeface in the bottom row. Because they are derived from
complete data sets, each of the M standard errors estimates the sampling error that
would have resulted had there been no missing values. Averaging the squared standard
errors (i.e., sampling variances) gives a more stable estimate of sampling error known as
the within-­imputation variance.

M
1
VW = ∑
M m =1
SEm2 (7.13)

Averaging the squared standard errors from the achievement analysis gives a within-­
imputation variance estimate of V sW = 0.32. The square root of this value is the expected
standard error from a complete-­data analysis. Intuitively, this estimate is too small,
because each standard error treats the filled-­in values as real data. A second source of
variation corrects this problem.
A key feature of Table 7.4 is that the estimates vary across data sets; β̂0 equals 6.45 in
the first data set, 6.50 in the second, 6.87 in the third, and so on. This variation is due to
only one source—­differences among the imputations, or missing data uncertainty. The
variance of the M estimates around their pooled value captures the influence of miss-
ing data on precision. This between-­imputation variance is computed by applying the
familiar formula for the sample variance to the estimates.
M

∑( )
1 2
=VB θˆ − θˆ (7.14)
( − 1) m =1 m
M

The variance of the fully conditional specification estimates from Table 7.4 is VB = 0.04.
This term effectively functions as a correction factor, inflating the standard error to
compensate for missing information. Note that estimating between-­imputation varia-
tion is only possible when analyzing more than one data set. Single imputation pro-
cedures (e.g., mean imputation, regression imputation, hot deck imputation) produce
flawed inferential tests, because they lack this important source of variation.
Complete-­data sampling variation and the noise due to missing data combine to
form the total variance of an estimate, VT, the square root of which is the multiple impu-
tation standard error.

VB
SE= VW + VB + = VT (7.15)
M

The first two terms under the radical should come as no surprise, but you may be won-
dering about the rightmost term. Numerically, the fraction of VB over M is the squared
standard error of the pooled estimate from Equation 7.12. Because it captures the
expected difference between an estimate computed from M data sets and a hypothetical
284 Applied Missing Data Analysis

estimate based on an infinite number of imputations, this term essentially functions as


a correction factor for using a finite number of imputations
Returning to the within-­subjects mean difference, the multiple imputation stan-
dard error (shown in the bottom row of Table 7.4) is computed as follows:

0.037
SE = 0.316 + 0.037 + = 0.595 (7.16)
100

Notice that the pooled standard error is about 10% larger than the imputation-­specific
standard errors, the average of which is approximately 0.56. This difference owes to the
between-­imputation variance terms that inflate the standard error to compensate for the
missing data. A number of papers point out that multiple imputation standard errors
may be too large (Kim et al., 2006; Nielsen, 2003; Reiter & Raghunathan, 2007; Robins
& Wang, 2000; Wang & Robins, 1998), but Rubin (2003) argues that this conservative
tendency is unimportant, because confidence intervals and inferences (the things we
really care about) are often close to optimal.

Relative Increase in Variance and Fraction


of Missing Information
The standard error expression in Equation 7.15 shows that missing values decrease pre-
cision by increasing the standard error, as they should. Two variance ratios—­the rela-
tive increase in variance and fraction of missing information—­repackage the compo-
nents under the radical to express the influence of missing data in proportional terms.
Software packages that analyze multiply imputed data routinely report these quantities
alongside the pooled estimates and standard errors.
The relative increase in variance is a fraction that expresses missing data uncer-
tainty as a proportion of the complete-­data sampling variation. This ratio of between-­
imputation variance to within-­imputation variance is as follows:
VB
VB +
RIV = M (7.17)
VW
As you can see, the relative increase in variance’s minimum value of 0 occurs when the
between-­imputation variance equals 0, which would only happen if there were no miss-
ing values. Such is the case with the pretest means and standard deviations in Table 7.3,
where the M estimates are identical and both parameters have VB and relative increase
in variance equal to 0. The within- and between-­imputation variance estimates for the
within-­subjects mean difference give RIV = 0.12. This indicates that the additional
uncertainty due to missing data is about 12% as large as the expected sampling error
from a complete-­data analysis (i.e., because of the missing data, the squared standard
error incurs a 12% increase).
The fraction of missing information is a ratio comparing between-­imputation vari-
ation to the total sampling variation. This ratio is akin to an R2 statistic that quantifies
imputation noise as a proportion of the total squared standard error. The fraction is
Multiple Imputation 285

VB
VB +
FMI = M × dfR + 1 + 2 (7.18)
VT dfR + 3 dfR + 3
 1 
dfR =( M − 1)  1 +  (7.19)
 RIV 2 

where dfR is the classic degrees of freedom expression from Rubin (1987, Equation 3.1.6).
Equation 7.18 comprises two parts. The first term to the right of the equals sign—the
proportion of between-­imputation variance relative to the total variation—­is an approx-
imation that assumes an infinite number of imputations, and the terms involving the
degrees of freedom adjust the proportion to compensate for using a finite number of
imputations (Pan & Wei, 2016). Returning to the within-­subjects mean difference, the
fraction of missing information based on dfR = 8896.61 is approximately .11, meaning
that 11% of the squared standard error is due to missing data (i.e., the missing data
“explain” about 11% of the estimate’s variation). Although it is not a natural by-­product
of estimation, Savalei and Rhemtulla (2012) show how to obtain the fraction of missing
information from a maximum likelihood analysis.
Rubin (1987, p. 114) states that the fraction of missing information will equal the
missing data proportion in some limited situations, and he suggests it will often be
lower than the missing data rate, because some of the missing information is recouped
via the correlations among the variables in the imputation model. Such is the case in this
example, where 16.8% of the change scores are missing but 11% of the total variation in
the mean difference is attributable to missing data. Pan and Wei (2016) caution against
linking the fractions of missing information to the missing data rates, as they observed
that fraction of missing information values can be much smaller and sometimes even
larger than the missingness proportions. As an aside, getting good estimates of the frac-
tion of missing information usually requires a very large number of imputations, typi-
cally more than what might be required to maximize statistical power (Bodner, 2008;
Harel, 2007).

7.8 TEST STATISTIC AND CONFIDENCE INTERVALS

The familiar t-statistic provides a straightforward way to conduct significance tests with
multiply imputed data. The test statistic for a pooled point estimate is
θˆ − θ0
t= (7.20)
SE
where θ̂ and SE are a pooled estimate and standard error, respectively, and θ0 is the
hypothesized parameter value, typically 0. Returning to the math achievement analysis,
the test statistic for the within-­subjects mean difference is t = 11.02.
Rubin (1987) suggested a t reference distribution with the degrees of freedom
expression from Equation 7.19. An undesirable feature of this classic expression is
that dfR can exceed the sample size, as it does in the math achievement analysis (dfR
= 8896.61). With such a large degrees of freedom value, the t-statistic is effectively a
286 Applied Missing Data Analysis

z-score with a standard normal probability value. Barnard and Rubin (1999) proposed
an alternate expression that addresses this issue. Their degrees of freedom equation is
dfR dfobs
df BR = (7.21)
dfR + dfobs
 V 
VB + B
dfcom + 1  M 
dfobs= dfcom × × 1 −  (7.22)
dfcom + 3  VT 

 

where dfcom is the degrees of freedom value from an analysis with no missing data (e.g.,
a test of the within-­subjects mean difference has dfcom = N – 1), and dfobs is an observed-­
data degrees of freedom value that downweights the complete-­data degrees of freedom
commensurate with the missing information (Reiter & Raghunathan, 2007). The value
of df BR decreases as missing information increases (as it should, because the data contain
less information about the estimate), and it cannot exceed the complete-­data degrees of
freedom. Returning to the math achievement, the degrees of freedom for the within-­
subjects mean difference is df BR = 215.61. At face value, this result seems far more rea-
sonable given that dfcom = 249 and the fraction of missing information is about 11%.
Barnard and Rubin’s (1999) computer simulations suggest that df BR improves the accu-
racy of confidence intervals in small samples, and this correction is widely available in
software.
Some software programs that analyze multiply imputed data also define the ratio in
Equation 7.20 as a z-statistic and use a standard normal distribution to get a probability
(e.g., structural equation modeling programs; Asparouhov & Muthén, 2010b). In prac-
tice, the choice of reference distribution and degrees of freedom isn’t critical unless the
sample size is very small, in which case there may be an advantage to using a t-test with
Barnard and Rubin’s (1999) adjustment. The reference distribution made no difference
in this example, as z- and t-statistics were both significant at p < .001.
You might have already intuited that averaging confidence interval limits is inap-
propriate, because the standard errors that give rise to those limits are too small. Rather,
confidence interval limits are computed by multiplying the pooled standard error by the
appropriate t critical value, then adding and subtracting that product (i.e., the margin of
error or half-width) to the pooled estimate, as follows:

CI = θˆ ± tCV × SE (7.23)

The appropriate t critical value for the desired alpha level, tCV, requires one of the
previous degrees of freedom expressions. Using an alpha level of .05, Rubin’s (1987)
original expression gives tCV = 1.96 (essentially the same as the critical value from a
normal distribution), whereas Barnard and Rubin’s (1999) degrees of freedom formula
gives tCV = 1.97. The 95% confidence interval for the latter approach spans from a mean
difference of 5.38 to 7.72. Consistent with the test statistic, we can conclude that the
positive change is significant, because the null value of 0 falls well outside the confi-
dence interval.
Multiple Imputation 287

7.9 WHEN MIGHT MULTIPLE IMPUTATION GIVE


DIFFERENT ANSWERS?

Having completed a multiple imputation analysis from start to finish, let’s take a step
back and contrast the approach with maximum likelihood and Bayesian estimation.
When will the methods give similar results, and when might they differ? Maximum
likelihood and Bayesian estimation are single-­stage procedures that extract parameter
estimates directly from the observed data. Although they attack the missing data prob-
lem differently, numerous examples from earlier chapters show that the two procedures
are generally equivalent (at least numerically). In contrast, multiple imputation is a two-
stage approach that separates missing data handling from the analyses; the first stage
deploys an imputation model, the sole purpose of which is to fill in the data, and the sec-
ond stage involves analyzing the completed data sets. I previously emphasized that the
imputation and analysis stages need not employ the same statistical model. Returning
to the math achievement example, the analysis was a within-­subjects mean comparison,
but joint model imputation and fully conditional specification deployed different regres-
sion models with two extra auxiliary variables. It ends up that the type and degree of
mismatch between the imputation and analysis models dictate whether multiple impu-
tation differs from direct estimators like maximum likelihood and Bayesian estimation.
Collins et al. (2001) and Schafer (2003) outline three propositions or rules of thumb
that help us anticipate when multiple imputation will agree or disagree with direct esti-
mators such as maximum likelihood and Bayesian estimation. The first proposition
states that direct estimation and multiple imputation will be equivalent if they use the
same set of variables and invoke models that apply equivalent assumptions and struc-
ture to the data. Such models are said to be congenial (Meng, 1994). Returning to the
math achievement example, the analysis model in Equation 7.2 assumes that difference
scores are normally distributed. Had I omitted the auxiliary variables, the bivariate nor-
mal distribution from Equation 7.9 would have served as the joint imputation model.
The imputation and analysis models would be congenial, because they use the same
variables and apply the same assumptions. The fact that the difference score parameters
are functions of the bivariate normal distribution’s parameters demonstrates this point
(i.e., β0 = μ2 – μ1. and σε2 = σ12 + σ22 – 2σ21). The same conclusion holds for fully condi-
tional specification.
The second proposition applies to situations where direct estimation and multiple
imputation use the same set of variables, but the imputation model uses more param-
eters. An example of this proposition occurs in confirmatory factor analysis applications
where a researcher could use direct estimation to fit a factor analysis model that imposes
a particular pattern on the associations in the data, or first employ multiple imputation
with an unrestricted mean vector and covariance matrix (i.e., a saturated imputation
model). In this situation, Collins et al. (2001) predict that the two approaches will give
equivalent parameter estimates, but multiple imputation standard errors could be some-
what larger, because the first-stage imputation model uses more parameters than neces-
sary to represent the data.
The third proposition applies to uncongenial scenarios (Meng, 1994) where the
imputation stage uses different variables than direct estimation. Collins et al. (2001)
288 Applied Missing Data Analysis

suggest that multiple imputation can give different estimates and standard errors, even
when the second-­stage analysis matches that of the direct estimator. It is widely known
that excluding an analysis variable from imputation is detrimental, because the residual
correlation between the imputations and omitted variable is fixed to 0 (Rubin, 1996).
The resulting parameter estimates will be biased toward zero unless the association
is truly nil in the population. In contrast, uncongeniality that results from including
extra auxiliary variables is usually considered beneficial, because doing so can reduce
nonresponse bias and improve power (Collins et al., 2001; Enders, 2008; Howard et al.,
2015; Raykov & West, 2015; Rubin, 1996; Schafer, 2003), with a few exceptions (e.g., an
excessively large number of auxiliary variables, a peculiar pattern of associations; Hardt
et al., 2012; Thoemmes & Rose, 2014).
Returning to the math achievement example, a comparable maximum likelihood
analysis that ignored the auxiliary variables produced a mean difference of β̂0 = 6.45 (SE
= 0.62). Although the maximum estimate isn’t identical to imputation, it is close and dif-
fers by about 15% of a standard error unit. The multiple imputation standard errors are
somewhat smaller, which is what we might expect when using auxiliary variables that
contain a substantial proportion of unique variance, as they do here (the two additional
variables increment the explained variance in the math posttest scores by about 16%
above and beyond the pretest assessment). In general, differences that result from using
auxiliary variables should be most apparent when the missing data rates are very high
(e.g., greater than 25%; Collins et al., 2001).

7.10 INTERACTION AND CURVILINEAR EFFECTS REVISITED

The emergence of missing data-­handling methods for interactive and nonlinear effects
is an important recent development (Bartlett et al., 2015; Enders et al., 2020; Erler et
al., 2016; Kim et al., 2015, 2018; Lüdtke et al., 2020a, 2020b; Zhang & Wang, 2017),
and we’ve so far encountered these methods in the maximum likelihood and Bayesian
frameworks. A prototypical model features a focal predictor X, a moderator variable M,
and the product of the two:

Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + ε i (7.24)

(
ε i ~ N1 0, σ2ε )
In this model, β1 is a conditional effect that reflects the influence of X when M equals
zero, and β2 is the corresponding conditional effect of M when X equals zero. The β3
coefficient is usually of particular interest, since it captures the change in the β1 slope
for a one-unit increase in M (i.e., the amount by which X’s influence on Y is moderated
by M).
Two well-known imputation approaches for interactive and curvilinear effects—­
passive imputation and just-­another-­variable imputation—­warrant a brief discussion,
because they are easy to implement but known to introduce bias. A passive imputation
Multiple Imputation 289

model includes only the lower-order terms (e.g., Y, X, and M) in the first stage, and any
product or squared term is computed from the filled-­in predictors prior to analysis. Von
Hippel (2009) refers to this strategy as impute-­then-­transform, because transformations
such as product and squared terms are computed postimputation. Iterative variants of
passive imputation compute the transformed variable at the conclusion of each MCMC
iteration and include it as a predictor in the next round of imputation (Royston, 2005;
van Buuren, 2012). Several computer simulation studies report that passive imputation
estimates are biased (Kim et al., 2015, 2018; Seaman et al., 2012; von Hippel, 2009),
which makes intuitive sense given that the imputation model effectively ignores the
nonlinear effect.
In contrast, just-­another-­variable imputation treats a nonlinear term like any other
variable to be imputed. Von Hippel (2009) refers to this strategy as transform-­then-­
impute, because transformations such as products or squares are computed prior to
imputation. To apply this method to the moderated regression analysis in Equation
7.24, you would first compute an explicit product variable Z = X × M that is missing
when one of its components is missing. The product would then function like any other
variable in the imputation model. For example, the joint imputation framework would
employ a multivariate normal model with an unrestricted mean vector and covariance
matrix.

 Xi    μ   σ12 σ12 σ13 σ14  


   1   
 i  ~ N   μ 2  ,  σ21
M σ22 σ23 σ24  
(7.25)
 Zi   
  μ 3   σ 31
4
σ 32 σ23 σ 34  
    μ 4  
 Yi  
  σ 41 σ 42 σ 43 σ24  

The normality assumption for Z is problematic, because the product of two random
variables isn’t normal (Craig, 1936; Lomnicki, 1967; Springer & Thompson, 1966), and
the mean and variance are deterministic functions of the component variables (Aiken
& West, 1991; Bohrnstedt & Goldberger, 1969). Another undesirable consequence of
just-­another-­variable imputation is that the filled-­in Z values do not equal the product of
the imputed X and M scores. Von Hippel (2009) investigated a variant of the procedure
that resolves this inconsistency by deleting the imputed product variable at the end of
each MCMC iteration and recomputing it by multiplying the filled-­in X and M scores.
However, this transform, impute, then transform again approach appears to exacerbate
rather than mitigate bias. Seaman et al. (2012) give analytic arguments showing that
just-­another-­variable imputation is approximately unbiased when missingness is com-
pletely at random (i.e., does not depend on the data) but is biased if scores are condition-
ally MAR. Several simulation studies support this conclusion (Enders et al., 2014; Kim
et al., 2015, 2018; Lüdtke et al., 2020b; Seaman et al., 2012; von Hippel, 2009; Zhang &
Wang, 2017).
There are only two situations where we can safely apply joint model or fully condi-
tional specification imputation to interactive (or nonlinear) effects. First, when missing
values are restricted to the outcome variable, we can simply include the complete prod-
290 Applied Missing Data Analysis

uct term in the imputation model, as we would any other complete variable. Second, if
either X or M is complete and categorical, applying joint model imputation or fully con-
ditional specification within each subgroup preserves all possible two-way associations
with the grouping variable (Enders & Gottschall, 2011; Graham, 2009; van Buuren,
2012). Beyond these scenarios, model-based imputation is usually a much better option
for interactive and nonlinear effects (Bartlett et al., 2015; Enders et al., 2020; Erler et al.,
2016; Kim et al., 2015, 2018; Zhang & Wang, 2017).

7.11 MODEL‑BASED IMPUTATION

A defining feature of agnostic imputation is its reliance on a generic model that differs
from the primary analysis but preserves its main features. The appeal of this strategy is
that the resulting imputations are potentially appropriate for a variety of different analy-
ses (e.g., descriptive summaries, analyses involving different combinations or subsets
of variables from the imputation model). In contrast, model-based imputation tailors
imputes around one particular analysis. The literature also refers to this strategy as a
sequential specification, fully Bayesian imputation, and substantive model-­compatible
imputation (Bartlett et al., 2015; Erler et al., 2016, 2019; Lüdtke et al., 2020b; Zhang
& Wang, 2017). In fact, we already know how to implement model-based imputation,
because it piggybacks on a Bayesian analysis. The only new wrinkle is that we save
imputations and refit the analysis model to the filled-­in data. To reiterate, a major reason
for choosing model-based imputation is that it readily handles interactive and nonlinear
effects.
Switching gears to a different substantive context, I use the chronic pain data
to illustrate a moderated regression analysis with an interaction effect. The data set
includes psychological correlates of pain severity (e.g., depression, pain interference
with daily life, perceived control) for a sample of N = 275 individuals with chronic pain.
This example piggybacks on the maximum likelihood analysis from Section 3.8 and the
Bayesian analysis from Section 5.4. The motivating question is whether gender moder-
ates the influence of depression on psychosocial disability, a construct capturing pain’s
impact on emotional behaviors such as psychological autonomy and communication,
emotional stability, and so forth. The moderated regression model is

DISABILITYi = β0 + β1 ( DEPRESSi − μ1 ) + β2 ( MALEi )


(7.26)
+ β3 ( DEPRESSi − μ1 )( MALEi ) + β4 ( PAIN i ) + ε i

where DISABILITY and DEPRESS are scale scores measuring psychosocial disability and
depression, MALE is a gender dummy code (0 = female, 1 = male), and PAIN is a binary
severe pain indicator (0 = no, little, or moderate pain, 1 = severe pain). I centered depres-
sion scores at their grand mean to facilitate interpretation. The disability and depression
scores have 9.1 and 13.5% of their scores missing, respectively, and approximately 7.3%
of the binary pain ratings are missing. By extension, 13.5% of the sample is also missing
the product term.
Multiple Imputation 291

To refresh, Bayesian estimation uses a factored regression or sequential specifica-


tion that expresses a complicated joint distribution as the product of two or more uni-
variate distributions, each of which corresponds to a regression model. The sequential
specification for the moderated regression analysis applies the factorization below:

f ( DISABILITY , DEPRESS, MALE, PAIN ) =


f ( DISABILITY | DEPRESS, MALE, DEPRESS × MALE, PAIN ) × (7.27)

(
f ( DEPRESS | PAIN, MALE ) × f PAIN * | MALE × f MALE* ) ( )
The first term to the right of the equals sign corresponds to the analysis model from
Equation 7.26. Importantly, the product is not a variable with its own distribution, but
rather a deterministic function of depression and gender, either of which could be miss-
ing. The regressor models in the next three terms translate into a linear regression for
depression, a probit (or logistic) model for the severe pain indicator, and an empty probit
(or logistic) model for the marginal distribution of gender (which I ultimately ignore,
because the variable is complete).

DEPRESSi = γ 01 + γ11 ( PAIN i ) + γ 21 ( MALEi ) + r1i (7.28)

PAIN i* = γ 02 + γ12 ( MALEi ) + r2i


MALEi* =γ 03 + r3i

The asterisk superscripts reflect a latent response variable formulation for the binary
variables (see Section 6.3).
Following ideas established in Chapters 5 and 6, the distribution of an incomplete
predictor depends on every model in which it appears (e.g., see Equation 5.22 for the
conditional distribution of depression in this example). Deconstructing a variable’s dis-
tribution into multiple components ensures that the conditional models in Equations
7.26 and 7.28 are formally compatible. Importantly, the product term is not an imputed
variable. Rather, computing the product of the filled-­in covariate scores prior to analysis
appropriately preserves the interaction effect, because the imputation model effectively
anticipates that depression scores will be multiplied by gender. Procedurally, the algo-
rithmic recipe for model-based imputation is identical to the Bayesian analysis from Sec-
tion 5.4. The only modification is that we save the filled-­in data from the final iteration
of parallel MCMC chains.

Analysis Example
Continuing with the chronic pain example, I used model-based multiple imputation to
create filled-­in data sets for the moderated regression from Equation 7.26. To reiterate,
the imputation and analysis models are identical. To illustrate an inclusive imputation
strategy, I augmented model-based imputation with four auxiliary variables: anxiety,
stress, perceived control over pain, and pain interference with daily life. I introduced
the auxiliary variables using a sequential specification like the one from Section 5.8, but
292 Applied Missing Data Analysis

I could have simply added the variables as extra covariates in the first stage regression
model (because the initial estimates are not the focus, modifying the meaning of the
imputation model’s slope coefficients is not a concern). Following earlier examples, I
again created M = 100 imputed data sets. The potential scale reduction factors (Gelman
& Rubin, 1992) from a preliminary diagnostic run indicated that MCMC converged in
fewer than 200 iterations, so I created imputations by saving the filled-­in data from the
final iteration of 100 parallel MCMC chains, each with 1,000 iterations. As explained
previously, autocorrelated imputations are not a concern with this approach.
After creating the multiple imputations, I fit the moderated regression model from
Equation 7.26 to each data set and used Rubin’s (1987) rules to pool the parameter esti-
mates and standard errors. Note that the auxiliary variables played no role in the sec-
ondary analysis phase. To facilitate interpretation of the lower-order terms, I centered
depression scores at the pooled estimate of the grand mean prior to computing the prod-
uct term and fitting the model. Table 7.5 gives the pooled estimates and standard errors
from multiple imputation with the Barnard and Rubin (1999) degrees of freedom. Table
3.7 gives the corresponding maximum likelihood estimates, and Table 5.2 shows the
Bayesian analysis results. Figure 3.6 is also an accurate depiction of the multiple impu-
tation simple slopes, which were effectively identical to those of maximum likelihood.
Recall that lower-order terms are conditional effects that depend on scaling; β̂1 = 0.38 (SE
= 0.06) is the effect of depression on psychosocial disability for female participants, and
β̂2 = –0.77 (SE = 0.56) is the gender difference at the depression mean. The interaction
effect captures the slope difference for males. The negative coefficient (β̂3 = –0.23, SE =
0.09) indicates that the male depression slope was approximately 0.23 points lower than
the female slope (i.e., the male slope is β̂1 + β̂3 = 0.38 – 0.21 = 0.16). As you might expect
based on work from Collins et al. (2001) and others, the multiple imputation estimates
were numerically equivalent to those of maximum likelihood and Bayesian estimation.

TABLE 7.5. Model‑Based Multiple Imputation Estimates


from the Moderated Regression
Parameter Est. SE t df p FMI
β0 21.66 0.37 57.86 217.58 < .001 .16
β1 (DEPRESS) 0.38 0.06 6.11 187.55 < .001 .23
β2 (MALE) –0.77 0.56 –1.38 231.11 .17 .12
β3 (DEPRESS)(MALE) –0.23 0.09 –2.55 206.61 .01 .18
β4 (PAIN) 1.81 0.61 2.97 226.02 < .001 .13
σε2 16.93 — — — — .15
R2   .22 — — — — .18

Conditional effects (simple slopes by gender)


βFemale 0.38 0.06 6.11 231.46 < .001 .23
βMale 0.16 0.07 2.37 250.90 .02 .15

Note. FMI, fraction of missing information.


Multiple Imputation 293

7.12 MULTIVARIATE SIGNIFICANCE TESTS

The univariate t-statistic in Section 7.8 provides a familiar way to evaluate individual
parameters. Extending ideas from Chapter 2, you can use the Wald test and likelihood
ratio statistics to evaluate multiple parameters or compare two nested models. The Wald
test and likelihood ratio statistic can evaluate the same hypotheses, but they do so in dif-
ferent ways. The Wald test compares the discrepancy between the estimates and hypoth-
esized parameter values (usually zeros) to sampling variation. The simplest incarnation
of the test statistic is just a z-score or chi-­square. In contrast, the likelihood ratio sta-
tistic compares log-­likelihood values from two competing models, the simpler of which
aligns with the null hypothesis. The two tests are equivalent in large samples but can
give markedly different answers in small to moderate samples (Liu & Enders, 2017).
As you will see, the multiple imputation variants of these tests share a common logic,
as both include a component that depends on the pooled estimates and an additional
adjustment term that accounts for variation attributable to missing data.

Wald Test
A quick review of the maximum likelihood Wald test (Buse, 1982; Wald, 1943) sets the
stage for its multiple imputation counterpart. The test statistic is

TW = ( ′
) (
θˆ Q − θ 0 Sˆ θ−ˆ 1 θˆ Q − θ 0
Q
) (7.29)

where θ̂Q is a vector of Q estimates, θ0 is the corresponding vector of hypothesized


values (typically zeros), and Sˆ θˆ Q is a variance–­covariance matrix that contains Q rows
and columns from full parameter covariance matrix (or its robustified counterpart).
The numerical value of TW is the sum of squared standardized differences between the
estimates and their hypothesized values. If the null hypothesis is true, the test statis-
tic follows a central chi-­square distribution with Q degrees of freedom, and statistical
significance implies that one or more of the estimates in θ̂Q are different from their
hypothesized values.
The multiple imputation version of the Wald test similarly compares a vector of
pooled estimates in θ̂Q to a set of null values in θ0. The test statistic standardizes these
differences against a covariance matrix that incorporates additional noise from the
missing data. Returning to the moderated regression example, we can use the Wald test
to evaluate the null hypothesis that R2 = 0 by comparing the pooled regression slopes to
null values of 0. For this application, the parameter vector and null values are θ̂Q = (β̂1,
β̂2, β̂3) and θ0 = (0, 0, 0), respectively, and the variance–­covariance matrix of the esti-
mates a 3 × 3 matrix with squared standard errors on the diagonal.
Following Rubin’s pooling rule for standard errors, the variance–­covariance matrix
of the estimates contains two sources of variation: the sampling variation and covaria-
tion that would have resulted had the data been complete and additional error due to
missing data. Each of the M analyses yields a covariance matrix that estimates complete-­
data sampling error, and averaging these matrices gives a within-­imputation covariance
matrix that more precisely estimates this source of variation.
294 Applied Missing Data Analysis
M
1
SW = ∑ Sˆ ˆ
M m =1 θQ m (7.30)

The variance of the estimates around their pooled value captures the influence of the
missing data on precision, and these between-­imputation deviations can also covary.
The Wald test captures this variability with a between-­imputation covariance matrix
computed by applying the familiar formula for the sample covariance matrix to the
estimates.
M

∑( )( )′
1
=
SB θˆ m − θˆ Q θˆ m − θˆ Q (7.31)
( M − 1) m =1
The total covariance matrix combines complete-­data sampling variation (the within-­
imputation covariance matrix) and additional error or noise due to the missing data (the
between-­imputation covariance matrix).
SB
ST = S W + S B + (7.32)
M
This expression simplifies to VT when θ̂Q is a single estimate, in which case SW and ΣB
also reduce to V sW and VB (see Equations 7.13 and 7.14).
Numerous authors express concern about the previous covariance matrix, because
ΣB is very noisy (and potentially rank deficient) when the number of imputations is
small to moderate (Meng & Rubin, 1992; Reiter & Raghunathan, 2007; Schafer, 1997;
van Buuren, 2012). The usual solution is to impose a structure on the covariance matrix
(i.e., reduce the number of free moving parts) by assuming that the elements in ΣB are
proportional to those in SW. In practical terms, this simplification implies that missing
values impact the precision of all parameters by the same amount (e.g., missing values
uniformly increase the squared standard errors of the regression slopes by 10%).
Li, Raghunathan, and Rubin (1991) proposed a simplified estimate of Σ T that uses
the average relative increase in variance to adjust the within-­imputation covariance
matrix for missing data uncertainty. Recall from earlier in the chapter that the relative
increase in variance is between-­imputation variation expressed as a proportion of the
within-­imputation variation (e.g., RIV = 0.10 means that additional noise due to missing
data is about 10% as large as complete-­data sampling error). If the within- and between-­
imputation covariance matrices are proportional, then the average relative increase in
variance for the parameters in θ̂Q is
 1
(−1
 1 +  tr S B S W )
ARIVW =  
M
(7.33)
Q
and the variance–­covariance matrix of the estimates becomes

Sˆ T= (1 + ARIVW ) S W (7.34)

This simplified expression replaces a problematic covariance matrix ΣB with a single


Multiple Imputation 295

proportion (the ARIV W term) that reflects the overall impact of missing data on preci-
sion. Computer simulation studies suggest that this simplification is often innocuous
and improves the test statistic’s behavior in small samples (Grund, Lüdke, & Robitzsch,
2016c; Liu & Enders, 2017).
Finally, the multiple imputation Wald statistic is computed by substituting
imputation-­based components into Equation 7.29 as follows:

( )
θˆ Q − θ 0 S −W1 θˆ Q − θ 0 ( )
ˆ( )
′ ˆ −1 ˆ
(
TW = θQ − θ 0 S T θQ − θ 0 = )
1 + ARIVW
(7.35)

The right side of the equation decomposes the test statistic into a complete-­data compo-
nent based on pooled quantities (the numerator) and a correction term that deflates the
test statistic to compensate for missing data (the denominator). This composition paral-
lels that of the t-statistic in Equation 7.20, and TW simplifies to t2 when testing a single
parameter (i.e., Q = 1).
A probability value for the Wald test is obtained by referencing TW to a chi-­square
distribution with Q degrees of freedom (Asparouhov & Muthén, 2010b) or by referenc-
ing TW ÷ Q to the approximate F distribution with Q numerator degrees of freedom and
df W denominator degrees of freedom (Li, Raghunathan, et al., 1991).
2
  2 
  1 − 
  Q × ( M − 1)  
df W = 4 + ( Q × ( M − 1) − 4 )  1 +  (7.36)
 ARIVW
 

The authors give an alternative expression for rare situations where Q × (M – 1) ≤ 4.


Consistent with its univariate counterpart, df W often exceeds the total sample size. This
feature is obviously undesirable, because missing values should decrease, not increase,
the amount of information in the data. Extending ideas from Barnard and Rubin (1999),
Reiter (2007) proposed an alternative expression that is always less than or equal to
the degrees of freedom from a complete-­data analysis. In the interest of space, I refer
interested readers to the original source or to Reiter and Raghunathan (2007) for this
lengthy expression. Computer simulation studies suggest that significance tests based
on Reiter’s derivations are preferable with small samples, but the reference distribution
has very little impact if the sample size exceeds 200 or so (Grund et al., 2016c; Liu &
Enders, 2017; Reiter, 2007).

Likelihood Ratio Test


The likelihood ratio statistic evaluates the relative fit of two nested models. Nested mod-
els can take a variety of forms, but a common application compares the substantive
analysis to a more restrictive version of the model that fixes a subset of parameters to
0. Returning to the earlier moderated regression analysis, we could use the likelihood
ratio statistic to evaluate the null hypothesis that R2 = 0 by comparing the fit of the focal
model to that of an empty model that constrains the slope coefficients to 0. A slightly
296 Applied Missing Data Analysis

different application of the likelihood ratio test occurs in structural equation modeling
analyses where a researcher compares the fit of a saturated model (i.e., a model that
places no restrictions on the mean vector and covariance matrix) to that of a more parsi-
monious analysis model that imposes a structure on the data (e.g., a confirmatory factor
analysis model). In either scenario, the simpler model with Q fewer parameters aligns
with the null hypothesis, so I denote the restricted model’s parameters as θ0 and the full
model’s parameters as θ.
As a quick review, the likelihood ratio statistic from Chapter 2 is shown below:

TLR = ( ( ) (
−2 LL θˆ 0 | data − LL θˆ | data )) (7.37)

The log-­likelihood values for the two models, LL(θ̂0|data) and LL(θ̂|data), are com-
puted by substituting the sample data and the maximum likelihood estimates into
a distribution function such as the normal curve equation. The numerical value of
each log-­likelihood is the sum of N individual fit terms, each of which is scaled as a
negative number, with higher (i.e., less negative) values reflecting better fit. The more
complex model with additional parameters will always achieve better fit and a higher
log-­likelihood, but that improvement should be very small when the null hypothesis
is true.
The rest of this section describes the classic likelihood ratio statistic from Meng
and Rubin (1992). Chung and Cai (2019) and Lee and Cai (2012) propose comparable
test statistics in the structural equation modeling framework. Recall that the Wald test
incorporates a component based on pooled quantities and an adjustment term that cor-
rects the test statistic for missing data. The likelihood ratio statistic shares this struc-
ture, because Meng and Rubin (1992) used the asymptotic equivalence of the two tests
to derive expressions that essentially map the Wald statistic onto the log-­likelihood
metric. I continue using the moderated regression analysis to illustrate a test that evalu-
ates the null hypothesis that R2 = 0.
As a starting point, reconsider the part of the Wald test that depends on pooled
quantities, shown in the numerator of Equation 7.35. Assuming a large sample size, an
analogous quantity is computed by averaging likelihood ratio tests that evaluate the
relative fit of the imputed data sets to the pooled estimates in θ̂0 and θ̂.

∑ ( ( ))
M
1
TPooled =
M m =1
) (
− 2 LL θˆ 0 | data m − LL θˆ | data m (7.38)

In this example, θ̂ references the pooled estimates from the moderated regression, and
θ̂0 contains the pooled estimates from the empty regression model comprised of an
intercept (grand mean) and variance. The likelihood ratio statistic for data set m is com-
puted by substituting the pooled estimates and the filled-­in data into complete-­data log
likelihood functions from Chapter 2 (see Equation 2.33).
As you will see, the final test statistic attenuates T Pooled by an amount that depends
on the average relative increase in variance. Computing this correction term requires a
second set of test statistics that evaluate the relative fit of data set m to its own estimates.
This average is
Multiple Imputation 297

∑ ( ( ))
M
1
TLR =
M m =1
) (
− 2 LL θˆ 0 m | data m − LL θˆ m | data m (7.39)

where θ̂0m and θ̂m are imputation-­specific estimates rather than pooled values.
Meng and Rubin (1992) show that the likelihood equivalent of the ARIV W is com-
puted as follows:
M +1
ARIV= × (TPooled − TLR ) (7.40)
Q × ( M − 1)
LR

In this expression, the difference between T Pooled and T m captures between-­imputation


variation (the two averages are identical with no missing values, and their difference
increases with the missing data rates). Finally, the likelihood ratio statistic is as follows:
TPooled
TLR = (7.41)
1 + ARIVLR
Like the Wald test, the likelihood ratio statistic consists of a complete-­data component
that relies on pooled quantities (the numerator) and a correction term that deflates the
test statistic to compensate for missing data (the denominator). A probability value for
the likelihood ratio test is obtained by referencing TLR to a chi-­square distribution with
Q degrees of freedom (Asparouhov & Muthén, 2010b) or by referencing TLR ÷ Q to the
approximate F distribution given by Li, Raghunathan, et al. (1991) or Reiter (2007).
These are the same reference distributions as the Wald statistic.
Although theoretically implausible, TLR can be negative (Enders & Mansolf, 2018;
Liu & Enders, 2017; Schafer, 1997). The T Pooled term is usually responsible for this odd
result, as the nested model can achieve superior fit if the effect size is close to 0 and
the fractions of missing information are very large. This same issue can also cause the
difference between T Pooled and T LR (and thus the ARIV LR value) to be negative. This
result is also nonsensical, because it implies that the estimates have negative between-­
imputation variation. The likelihood ratio test should not be trusted when ARIV L is
negative, but it may be reasonable to set the statistic to 0 if ARIV LR is positive and TLR is
slightly negative (Schafer, 1997).

Pooling Chi‑Square Statistics


Rubin (1987) and Li, Meng, Raghunathan, and Rubin (1991) outlined a third statistic
that pools a set of M chi-­square tests. This so-­called D2 statistic is primarily useful when
statistical software doesn’t offer Wald or likelihood ratio statistics for multiply imputed
data. Its creators characterize D2 and its p-value as a rough approximation, but recent
studies are more encouraging. For example, Grund et al. (2016c) report that D2 generally
works well as a tool for combining ANOVA significance tests, although the Wald and
likelihood ratio tests are preferable when missing data rates are very high (e.g., 50% or
larger). Liu and Sriutaisuk (2019) used D2 to pool model fit statistics from a confirma-
tory factor analysis with ordinal indicators and weighted least squares estimation, and
they, too, found good results, particularly when the imputation procedure can leverage
information from variables with little to no missing data.
298 Applied Missing Data Analysis

The pooled chi-­square statistic is

( M + 1)
χ2 − Q × ARIVD
( M − 1) 2

TD2 = (7.42)
1 + ARIVD2

where χ2 is the arithmetic average of M chi-­square statistics, and the average relative
increase in variance is

∑( )
M 2
 1 1
ARIVD2 =  1 +  × χ 2m − χ 2 (7.43)
 M  ( − 1) m =1
M

A probability value for the pooled chi-­square test is obtained by referencing TD2 to a chi-­
square distribution with Q degrees of freedom or by referencing TD2 ÷ Q to an approxi-
mate F distribution with Q numerator degrees of freedom and dfD2 denominator degrees
of freedom (Li, Meng, et al., 1991).
2
 1 
df= Q −3/ M ( M − 1)  1 +  (7.44)
D2
 ARIV 
 D2 
Like other classic degrees of freedom expressions, dfD2 can exceed the sample size, but
this feature probably doesn’t matter if the sample size is reasonably large (e.g., greater
than 200).

Analysis Example
Returning to the moderated regression analysis, I used the three test statistics to evalu-
ate the null hypothesis that R2 equals 0. The F versions of the test statistics (i.e., T ÷
Q) values were FW(3, 240.79) = 23.74, FLR(3, 15652.44) = 20.62, and FD2(3,1895.75) =
20.79, and all probability values were less than .001. The corresponding average relative
increase in variance values were ARIV W = 0.12, ARIV LR = 0.15, and ARIVD2 = 0.26.
A few broad brushstroke observations are evident. First, the three test statistics are
relatively like one another, although the Wald statistic is slightly larger. Simulation stud-
ies suggest that the Wald test may be more powerful than the likelihood ratio statistic
in some cases (Grund et al., 2016c; Liu & Enders, 2017), but that explanation probably
doesn’t explain this difference, because the effect size is relatively large. Second, the
degrees of freedom are dramatically different, because I applied Reiter’s (2007) small-­
sample adjustment for the Wald test, and I used expressions from Li, Raghunathan, et al.
(1991) and Li, Meng, et al. (1991) for the likelihood ratio and D2 statistics, respectively.
The large degrees of freedom values for the latter approaches are troubling, because
they greatly exceed the sample size (i.e., missing values appear to increase rather than
decrease the amount of information in the data). That said, the choice of reference dis-
tribution doesn’t appear to matter with a sample size this large (Liu & Enders, 2017).
Third, the often-­derided TD2 statistic is well calibrated to the others, particularly the
likelihood ratio test. This is not entirely unexpected, as results from Grund et al. (2016c)
Multiple Imputation 299

suggest that the test performs well when the rates of missing information are not too
high and auxiliary information is available (as is the case here). The average relative
increase in variance for TD2 is dramatically higher than the others, but this difference
doesn’t appear to impact the test statistic itself.

7.13 SUMMARY AND RECOMMENDED READINGS

Maximum likelihood and Bayesian estimation are direct estimators that extract the
model parameters of interest directly from the observed data. In contrast, multiple
imputation puts the filled-­in data front and center, and the goal is to create suitable
imputations for later analysis. A typical application comprises three steps: Create several
filled-­in data sets, analyze the completed data, and pool and test the estimates. The first
step co-opts the MCMC algorithms from Chapters 5 and 6, and the analysis and pool-
ing stages use “Rubin’s rules” (Little & Rubin, 2020; Rubin, 1987) to combine estimates
and standard errors into a single package of results. Given the same input variables and
assumptions, multiple imputation estimates are usually indistinguishable from those of
maximum likelihood or Bayesian estimation.
As an organizational tool, I classified multiple imputation procedures into two
buckets according to the degree of similarity between the imputation and analysis mod-
els: An agnostic imputation strategy deploys a model that differs from the substantive
analysis, and a model-based imputation procedure invokes the same focal model as the
secondary analysis (perhaps with additional auxiliary variables). These classifications
emphasize that an analysis model’s composition—­in particular, whether it includes non-
linear effects such as interactions, polynomial terms, or random effects—­determines
the type of imputation strategy that works best. Model-based imputation is usually ideal
for these types of nonlinearities, whereas agnostic imputation is well suited for analyses
that do not include these special features. This distinction continues to be important in
Chapter 8, which covers multilevel missing data. As you will see, random coefficients
are yet another type of nonlinearity that requires a model-based missing data-­handling
strategy. Finally, I recommend the following articles for readers who want additional
details on topics from this chapter:

Bartlett, J. W., Seaman, S. R., White, I. R., & Carpenter, J. R. (2015). Multiple imputation of
covariates by fully conditional specification: Accommodating the substantive model. Statisti-
cal Methods in Medical Research, 24, 462–487.

Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical
Association, 91, 473–489.

Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8,


3–15.

Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-­data problems:
A data analyst’s perspective. Multivariate Behavioral Research, 33, 545–571.
300 Applied Missing Data Analysis

Scheuren, F. (2005). Multiple imputation: How it began and continues. American Statistician,
59, 315–319.

van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional
specification. Statistical Methods in Medical Research, 16, 219–242.

van Buuren, S., Brand, J. P. L., Groothuis-­Oudshoorn, C. G. M., & Rubin, D. B. (2006). Fully
conditional specification in multivariate imputation. Journal of Statistical Computation and
Simulation, 76, 1049–1064.

Zhang, Q., & Wang, L. (2017). Moderation analysis with missing data in the predictors. Psycho-
logical Methods, 22, 649–666.
8

Multilevel Missing Data

8.1 CHAPTER OVERVIEW

This chapter describes missing data-­handling procedures for hierarchically structured


or multilevel data sets that arise when observations are nested within higher-­level orga-
nizational units. Multilevel data are ubiquitous in a variety of disciplines, and examples
include repeated measurements nested within persons, students grouped in different
classrooms or schools, romantic partners paired within dyads, employees nested within
organizations, survey respondents grouped in different geographical regions, and clients
clustered within therapists, to name a few. The nesting can also extend to three levels
(e.g., repeated measurements nested within students, students nested in schools). A vari-
ety of data analytic tools are available for hierarchically structured data sets (Hamaker
& Muthén, 2020; McNeish & Kelley, 2019; McNeish, Stapleton, & Silverman, 2017),
and I focus on multilevel regression models with random effects, because they are an
exceedingly common data analytic tool in a wide range of disciplines. Several excellent
resources are available to readers who want additional information about these models
(Gelman & Hill, 2007; Hox, Moerbeek, & Van de Schoot, 2017; Raudenbush & Bryk,
2002; Snijders & Bosker, 2012; Verbeke & Molenberghs, 2000).
The emergence of missing data-­handling methods for multilevel models is an impor-
tant recent development (Carpenter et al., 2011; Carpenter & Kenward, 2013; Enders
et al., 2020; Enders, Keller, et al., 2018; Erler et al., 2016, 2019; Goldstein et al., 2009;
Goldstein et al., 2014; Quartagno & Carpenter, 2016; Shin, 2013; Shin & R ­ audenbush,
2007, 2013; van Buuren, 2011; Yucel, 2008, 2011). I devote much of the chapter to Bayes-
ian estimation and model-based imputation, because they are arguably more adept at
handling multilevel missing data problems than current maximum likelihood estima-
tors. As you will see, missing data estimation for multilevel regression models builds
on previously established ideas; an MCMC algorithm estimates a factored regression
that includes the focal model and one or more supporting regressor models, after which
it uses the resulting estimates to construct distributions of missing values. As always,
imputations are just predicted values plus random noise, albeit from a more complex
301
302 Applied Missing Data Analysis

multilevel model. The chapter also describes multilevel extensions of joint model impu-
tation, fully conditional specification, and maximum likelihood.

8.2 RANDOM INTERCEPT REGRESSION MODELS

To begin, I use the math problem-­solving data set from the companion website to illus-
trate Bayesian missing data handling for a two-level regression model. The data come
from an educational experiment where J = 29 schools were randomly assigned to an
experimental or comparison condition. There was an average of nj = 33.86 students per
school. Following the parlance of the multilevel modeling literature, students are level-1
units and schools are level-2 units (or clusters). The comparison condition (i.e., control
schools) implemented the district’s standard mathematics curriculum, and the interven-
tion schools implemented a new curriculum designed to enhance math problem-­solving
skills. The dependent variable is an end-of-year math problem-­solving assessment with
item response theory (IRT)-scaled scores ranging between 37 and 65. The data set and
the variable definitions are described in the Appendix.
A key feature of hierarchical data is that variation and covariation can exist at both
levels of the data hierarchy. Applied to the educational data, student-­level variables such
as problem-­solving test scores naturally vary across individuals within a given school,
and schools also differ in their average levels of these variables. The dependent variable’s
intraclass correlation is approximately .26, meaning that school-­level mean differences
in average problem-­solving account for roughly 26% of the total variation; this value is
typical for cluster-­randomized designs (Hedges & Hedberg, 2007; Spybrook et al., 2011).
It is important to keep in mind that level-1 regressors can possess the same sources of
variation. As you will see, Bayesian estimation creates missing values that preserve this
important feature of the data.
As the name implies, a random intercept model is a regression with group-­specific
intercept coefficients. These models are noteworthy, because they are amenable to a
variety of missing data-­handling options, including agnostic imputation schemes and
maximum likelihood estimation. As a starting point, consider a model that features
standardized math scores as a student-­level predictor and teacher experience (in years)
as a school-­level covariate. The within-­cluster regression model describes score varia-
tion among students in the same school. The model and its generic counterpart are

(
PROBSOLVEij = β0 j + β1 STANMATH ij + ε ij) (8.1)

Yij = β0 j + β1 X1ij + ε ij
(
ε ij ~ N1 0, σ2ε )
where Yij represents the outcome score for student i in school j, X1ij is that student’s
covariate value (e.g., standardized math test score), β0j is the random intercept coef-
ficient for school j, β1 is a common slope coefficient, and εij is a within-­cluster residual
that captures unexplained variation in the dependent variable. Residuals are normally
distributed by assumption with constant variation across all schools. To illustrate the
Multilevel Missing Data 303

equation, the dashed lines in Figure 8.1 depict the group-specific regression lines for
the 29 schools, and the solid line is the average trajectory. The association between
standardized math test scores and problem solving is constant across schools, but the
vertical separation of the regression lines reflects between-cluster variation in the aver-
age levels of problem solving (i.e., random intercept variation).
Because multilevel models view level-2 groups (e.g., schools) as a random sample
from a larger population of higher-level clusters, each school-specific mean or inter-
cept β0j functions as a latent variable or random effect with its own distribution. The
between-cluster model expresses these school-level differences (the vertical elevation of
the regression lines) as a function of teacher experience, as follows:

(
β0 j = β0 + β2 TEACHEXPj + b0 j ) (8.2)

β 0 j = β 0 + β 2 X 2 j + b0 j
(
b0 j ~ N1 0, σ2b0 )
where β0 is the mean intercept across all schools, β2 gives the influence of the level-2
covariate (average years of teaching experience for school j) on the average level of the
dependent variable, and b0j is a residual capturing unexplained between-cluster varia-
80
70
End-of-Year Problem Solving
60
50
40
30

0 20 40 60 80 100
Standardized Math Achievement

FIGURE 8.1. Within-cluster regressions for 29 schools. The vertical separation of the regres-
sion lines reflects random intercept variation, and all schools share a common slope.
304 Applied Missing Data Analysis

tion. The level-2 residuals are also normal by assumption, and a between-­cluster vari-
ance σb20 quantifies this variation.
Finally, replacing the β0j term from the within-­cluster model with the right side of
its level-2 equation combines the two models into a single equation.

( )
PROBSOLVEij = β0 j + β1 STANMATH ij + β2 TEACHEXPj + ε ij ( ) (8.3)

( ) ( )
Yij = β0 + b0 j + β1 X1ij + β2 X 2 j + ε ij = E Yij | X1ij , X 2 j + ε ij
Yij ~ N1 E (( ) )
Yij | X1ij , X 2 j , σ2ε

The bottom expression says that dependent variable scores are normally distributed
around predicted values (the E(Yij|X1ij, X2ij) term) that encode the school-­specific inter-
cepts. Visually, these predictions correspond to points on the cluster-­specific regression
lines in Figure 8.1.

Factored Regression Specification


The factored regression specification we’ve been using throughout the book readily
extends to multilevel models. Applying the strategy to the simple two-­predictor model
clarifies the setup for the ensuing analysis example. Returning to established ideas,
a factored regression specification uses the probability chain rule to express a multi-
variate distribution as the product of two or more simpler distributions. The sequential
specification (Erler et al., 2016, 2019) features the product of three univariate distribu-
tions below:

f ( PROBSOLVE, STANMATH, TEACHEXP ) =


f ( PROBSOLVE | STANMATH, TEACHEXP ) × (8.4)
f ( STANMATH | TEACHEXP ) × f (TEACHEXP )

The first term is the normal distribution induced by the focal analysis (see Equation
8.3), the second term is the level-1 predictor’s model, and the third term is the marginal
(overall) distribution of the level-2 predictor. It is important to order the variables so
level-1 predictors condition on level-2 predictors, because higher-­level variables can pre-
dict lower-level variables but not vice versa.
The generic expressions from the previous factorization translate into a random
intercept model for the level-1 predictor and a single-­level regression model for the
level-2 covariate. The level-1 predictor model is

STANMATH ij = ( γ 01 + g 01 j ) + γ11 (TEACHEXPj ) + r1ij (8.5)

X1ij = ( γ 01 + g 01 j ) + γ11 X 2 j + r1ij

(
r1ij ~ N1 0, σ2r1 ) (
g 01 j ~ N1 0, σ2g 01 )
where the γ’s are regression coefficients, g01j is a random intercept residual that captures
unexplained between-­school variation in the average math scores (this variable’s intra-
Multilevel Missing Data 305

class correlation is approximately .38), and r1ij is a within-­cluster deviation that reflects
test score variation among students in the same school. The level-2 covariate model is an
empty single-­level regression that features a grand mean and a between-­cluster devia-
tion score.

TEACHEXPj =X2 j =γ 02 + r2 j (8.6)

(
r2 j ~ N1 0, σr22 )
A partially factored model instead assigns a multivariate distribution to the explan-
atory variables (Enders et al., 2020; Goldstein et al., 2014). The generic factorization for
the problem-­solving analysis is as follows:

f ( PROBSOLVE | STANMATH, TEACHEXP ) × f ( STANMATH, TEACHEXP ) (8.7)

The multivariate distribution on the right is a two-part normal distribution with level-1
and level-2 components. This specification decomposes level-1 predictors into a within-­
cluster deviation involving a score and a group mean, and a between-­cluster deviation
between a group mean and the grand mean.

( ) (
X1ij = μ1 + μ1 j − μ1 + X1ij − μ1 j ) (8.8)

It is important to highlight that μ1j is the latent group mean for cluster j (e.g., a latent
estimate of a school’s average achievement) rather than a deterministic arithmetic aver-
age. Accordingly, the within-­cluster regression model expresses the level-1 regressor
(e.g., standardized math test scores) as deviations around the level-2 latent group means.

X1ij =μ1 j + r1ij( W ) (8.9)

X1ij ~ (
N1 μ1 j , σ2r1( W ) )
The alphanumeric subscript on the residual highlights that r1ij(W) measures within-­
cluster variation.
The level-1 predictor correlates with the level-2 covariate (e.g., teacher experience)
via its latent group mean in the between-­cluster model, which comprises two empty
regression equations with correlated residuals.

 μ1 j   μ1   r1 j( B ) 
 =   +  (8.10)
 X 2 j   μ 2   r2 j( B ) 
  2 
  μ1   σr1( B) σr1r2 ( B)  
X j( B ) ~ N 2   ,
 μ 2  σ σ2r2 ( B)  
 
  r2r1( B) 
The bottom expression says that the latent means and level-2 scores are normally distrib-
uted around their grand means, and the variables covary according to the off-­diagonal
element in the between-­cluster covariance matrix, Σ(B). The pair of round-robin linear
306 Applied Missing Data Analysis

regression equations below is an alternative way to parameterize this bivariate normal


distribution (Enders et al., 2020):

( )
μ1 j = μ1 + γ11 X 2 j − μ 2 + r1 j( B ) (8.11)

X 2 j = μ 2 + γ12 ( μ1 j − μ1 ) + r2 j( B )

As mentioned elsewhere, the distinction between Equations 8.10 and 8.11 is an algo-
rithmic nuance that has no practical impact on analysis results. Finally, notice that the
partially factored specification is ideally suited for models with centered predictors,
because the grand and group means are natural by-­products of estimation. Centering is
substantially more complicated with a sequential specification.

Distribution of Missing Values


Describing imputation for an incomplete outcome variable is a good starting point,
because the focal model alone defines the posterior predictive distribution of missing
values. Returning to the analysis model from Equation 8.3, the bottom row of the expres-
sion says that dependent variable scores are normally distributed around predicted val-
ues that encode school-­specific random intercepts. The model-­implied outcome distri-
bution serves double duty as the posterior predictive distribution of missing values.

(( ) )
Yij( mis ) ~ N1 E Yij | X1ij , X 2 j , σ2ε (8.12)

To illustrate imputation more concretely, Figure 8.2 shows the distribution of missing
outcome scores for three students who belong to different schools. The solid black circles
are predicted values (the expected value in the first argument of the normal distribution
function), and the spread of the normal curves reflects within-­school residual variance.
The candidate imputations fall directly on the vertical lines, but I added horizontal jitter
to emphasize that more scores are located near the center of each distribution. Concep-
tually, MCMC algorithm generates an imputation by randomly selecting a value from the
cluster of candidate scores (technically, the imputations can fall anywhere in the normal
distribution).
Turning to the regressors, the MCMC algorithm draws imputations from the condi-
tional distribution of an incomplete predictor given all other analysis variables. To illus-
trate, consider X1 (e.g., standardized math scores). Applying rules of probability reveals
that this distribution is proportional to the product of two univariate distributions, each
of which aligns with one of the previous regression models.

f ( X1 | Y , X 2 ) ∝ f ( Y | X1 , X 2 ) × f ( X1 | X 2 ) (8.13)

I’ve primarily used the Metropolis–­Hastings algorithm to sample imputations from


complicated composite functions, but it is instructive to look at the analytic distribution
to draw connections to earlier material.
Multilevel Missing Data 307

80
70
End-of-Year Problem Solving
60
50
40
30

0 20 40 60 80 100
Standardized Math Achievement

FIGURE 8.2. Distribution of missing outcome scores for three students who belong to dif-
ferent schools. The solid black circles are predicted values, and the spread of the normal curves
reflects within-school residual variance. The candidate imputations fall directly on the vertical
lines, but horizontal jitter is added to emphasize that more scores are located near the center of
each distribution.

Dropping unnecessary scaling terms and substituting the kernels of the distribu-
tion functions from Equation 8.3 and 8.9 into the right side of the factorization gives the
following expression:

( ) ( )
f Yij | X1ij , X 2 j × f X1ij | X 2 j ∝

1 ( Yij − ( β0 j + β1 X1ij + β2 X 2 j ) ) 
 2 

( ) 
2
(8.14)
  1 X1ij − μ1 j 
exp  −  × exp  − 2
σ2ε σ2r1( W ) 
 2  
 
 
Deriving the conditional distribution of X1 involves multiplying the two normal
curve equations and performing algebra that combines the component functions into a
single distribution for X1. The result is a normal curve with two-part mean and variance
expressions that depend on the focal and regressor model parameters.
308 Applied Missing Data Analysis

( ) ( ( )
f X1ij( mis ) | Yij , X 2 j = N1 E X1ij | Yij , X 2 j , var X1ij | Yij , X 2 j ( )) (8.15)

 μ β1 Yij − β0 j + β2 X 2 j ( ( )) 
(
E X1ij =
| Yij , X 2 j ) (
var X1ij | Yij , X 2 j ×  2 +
 σ
1j
) σ2ε 
r
 1( w ) 
−1
 1 β2 
( , X 2 j  2 + 12 
var X1ij | Yij=
 σr σε 
)
 1( w ) 
In fact, the distribution’s structure is identical to the one for linear regression models
back in Section 5.3 (e.g., see Equation 5.12). The main difference is that the distribution
includes group-­specific latent variables or random effects that capture between-­group
differences (i.e., μ1j and β0j).
Next, consider the distribution of the level-2 predictor, 2 (e.g., school-­average
teacher experience). Working from the partially factored specification, the conditional
distribution of missing values is the product of two univariate distributions (the sequen-
tial specification yields a triple product).

f ( X 2 | Y , X1 ) ∝ f ( Y | X1 , X 2 ) × f ( X 2 | X1 ) (8.16)

The first term corresponds to the focal regression, and the second term corresponds to
the between-­cluster regression from Equation 8.11. A level-2 predictor like X2 is com-
mon to all members of a given level-2 cluster (e.g., all students within a given school
share the same teacher experience value). To accommodate this feature, the analysis
model’s contribution to the distribution of missing values repeats nj times, once for each
observation in group j.
nj

( ) ∏N ( E (Y | X
X 2 j( mis ) | Yij , X1ij ∝
i =1
1 ij 1ij , X 2 j ) , σ2ε ) × N1 ( E ( X2 j | μ1 j ) , σr2 )
2


( ))
2 
 1 Yij − β0 j + β1 X1ij + β2 X 2 j
nj
 (
∝ exp  − ∏ 2
σε

i =1  2  (8.17)
 

( ( )))
2 
 1 X 2 j − μ 2 + γ12 μ1 j − μ1  (
× exp  − 2 
 2 σr2 ( B) 
 
Returning to form, the Metropolis–­Hastings algorithm is a convenient tool for drawing
imputations from complex functions like the previous one, because it works from the
simpler component distributions, in this case a pair of normal curves.

MCMC Algorithm
The posterior distribution for a multilevel analysis is a complicated multivariate func-
tion describing the relative probability of different combinations of model parameters,
Multilevel Missing Data 309

random effects, latent group means, and missing values given the observed data. The
core logic of Bayesian estimation and MCMC algorithms readily extends to multilevel
data: Estimate one unknown at a time (e.g., parameter, latent variable, missing score),
holding all other quantities at their current values. The generic MCMC recipe for a mul-
tilevel regression model is as follows:

Assign starting values to all parameters, random effects, and missing values.
Do for t = 1 to T iterations.
> Estimate the focal model’s parameters, given everything else.
> Estimate focal model’s random effects, given everything else.
> Estimate each predictor’s model’s parameters, given everything else.
> Estimate each predictor model’s random effects, given everything else.
> Impute the dependent variable given the focal model parameters.
> Impute each predictor, given the focal and supporting models.
Repeat.

The full conditional distributions for MCMC estimation are widely available in the lit-
erature (Browne, 1998; Browne & Draper, 2000; Enders et al., 2020; Goldstein et al.,
2009; Kasim & Raudenbush, 1998; Lynch, 2007; Schafer & Yucel, 2002; Yucel, 2008). In
the interest of space, I point readers to these sources for additional details.

Analysis Example
Expanding on the math problem-­solving analysis, I used Bayesian estimation to fit a
random intercept regression model with three level-1 predictors (a pretest measure of
math problem solving, standardized math test scores, and a binary indicator of whether
a student is eligible for free or reduced-­priced lunch) and a pair of level-2 predictors
(average years of teacher experience, and the treatment assignment indicator).

( ) ( ) (
PROBSOLVEij = β0 + b0 j + β1 PRETESTij + β2 STANMATH ij ) (8.18)
( ) ( ) (
+ β3 FRLUNCH ij +β4 TEACHEXPj + β5 CONDITION j + ε ij )
The β5 coefficient is of particular interest, as this parameter represents the mean dif-
ference for intervention schools (0 = comparison school, 1 = intervention school), con-
trolling for student- and school-­level covariates. The pretest and treatment assignment
indicators are complete, but the remaining variables have missing values; the missing
data rates are 20.5% for the dependent variable, 7.3% for the standardized reading test,
4.7% for the lunch assistance indicator, and 10.3% for the school-­level teacher experi-
ence variable.
Either the sequential or partially factored specifications are appropriate for the
explanatory variables. The sequential specification invokes the product of univariate
distributions like the following:
310 Applied Missing Data Analysis

f ( PROBSOLVE | PRETEST, STANMATH, FRLUNCH, TEACHEXP, CONDITION ) ×


( STANMATH | FRLUNCH, PRETEST, TEACHEXP, CONDITION ) ×
(
f FRLUNCH * | PRETEST, TEACHEXP, CONDITION × ) (8.19)

f ( PRETEST | TEACHEXP, CONDITION ) ×


f (TEACHEXP | CONDITION ) × f CONDITION *( )
Two points are worth highlighting. First, predictor variables are sequenced in the follow-
ing order: incomplete level-1 predictors, complete level-1 predictors, incomplete level-2
predictors, complete level-2 predictors. This order minimizes the number of models
needed to impute missing predictor scores, and it honors the fact that higher-­level vari-
ables can predict lower-level variables but not vice versa. Second, each binary predictor
appears as latent response variable in its own regression and a dummy code in all other
models. The last term can be dropped, because the experimental condition code is com-
plete and does not require a distribution.
A partially factored specification instead assigns a multivariate normal distribution
to continuous predictors and latent response variables.

f ( PROBSOLVE | PRETEST, STANMATH, FRLUNCH, TEACHEXP, CONDITION )


(8.20)
(
× f PRETEST, STANMATH, FRLUNCH * , TEACHEXP, CONDITION * )
I generally focus on this specification, because it readily accommodates group and grand
mean centering operations that are routine with these models. Expanding on earlier
ideas, level-1 predictors decompose into within- and between-­cluster components, fol-
lowing Equation 8.8. The within-­cluster model expresses level-1 scores as correlated
deviations around their latent group means, as follows:
 PRETEST   μ   r1ij( W ) 
 ij
  1j   
X ij( W ) =  2 j  +  r2ij( W ) 
 STANMATH ij  =μ (8.21)
 * 
 
 μ 
 FRLUNCH ij   3 j   r3ij( W ) 

(
X ij( W ) ~ N 3 μ j , S ( W ) )
A diagonal element of the within-­cluster covariance matrix is fixed at 1 to establish a
metric for the latent lunch assistance indicator, and the model also incorporates a fixed
threshold parameter for this variable. The model can be simplified by moving the com-
plete pretest scores to the right side of the equation as a predictor of the two incomplete
variables (i.e., treat pretest scores as known constants).
As explained elsewhere, modeling the multivariate normal distribution directly
is not straightforward when a covariance matrix contains a fixed constant. The set of
round-robin linear regression equations below is an alternative way to parameterize the
within-­cluster covariance matrix that avoids estimation difficulties associated with a
matrix that contains a fixed constant:
Multilevel Missing Data 311

( ) (
PRETESTij = μ1 j + γ11 STANMATH ij − μ 2 j + γ 21 FRLUNCH ij* − μ 3 j + r1ij) (8.22)

STANMATH ij = μ 2 j + γ12 ( PRETESTij − μ1 j ) + γ 22 ( FRLUNCH ij − μ 3 j ) + r2ij


*

FRLUNCH ij* = μ 3 j + γ13 ( PRETESTij − μ1 j ) + γ 23 ( STANMATH ij − μ 2 j ) + r3ij

The level-1 predictors condition on level-2 covariates via their latent group means
in the between-­cluster model, which now comprises five empty regression equations
with correlated residuals.
r 
 μ1 j   μ1   1 j( B ) 
     r2 j B 
μ2 j
   μ2   ( ) 
X=  μ =   μ 3  +  r3 j( B )  (8.23)
j( B ) 3j
     
 TEACHEXPj   μ 4   r4 j( B ) 
 CONDITION *   
 j  μ 5   r5 j B 
 ( )
(
X j( B ) ~ N 5 μ, S ( B ) )
A diagonal element of the between-­cluster covariance matrix is also fixed at 1 to estab-
lish a metric for the latent treatment condition indicator, and the model also incorpo-
rates a fixed threshold parameter for this variable. This model can also be parameterized
as a set of round-robin regressions like Equation 8.11.
The potential scale reduction factors (Gelman & Rubin, 1992) from a preliminary
diagnostic run indicated that the MCMC algorithm converged in fewer than 1,000 itera-
tions, so I used 12,000 total iterations with a conservative burn-in period of 2,000 itera-
tions. The same analysis that generates Bayesian summaries of the model parameters
can also generate model-based multiple imputations for a frequentist analysis. To illus-
trate, I created M = 100 imputations by saving the filled-­in data from the final iteration
of 100 parallel MCMC chains, each with 2,000 iterations. As explained previously, auto-
correlated imputations are not a concern with this approach.
After creating the multiple imputations, I used restricted maximum likelihood to fit
the random intercept regression model to each data set and applied Rubin’s (1987) rules
to pool the parameter estimates and standard errors. The Barnard and Rubin (1999)
small-­sample degrees of freedom adjustment for the t-statistic requires the complete-­
data degrees of freedom value as an input (see Equations 7.21 and 7.22). Following the
hierarchical linear modeling (HLM) software package (Raudenbush, Bryk, Cheong, &
Congdon, 2019), I used the number of schools minus the number of predictors minus 1
as the degrees of freedom for all coefficients (i.e., dfcom = 29 – 5 – 1). Analysis scripts are
on the companion website.
Table 8.1 gives the posterior summaries from the Bayesian analysis, and Table 8.2
summarizes the multiple imputation point estimates and standard errors. The primary
focus is the β5 coefficient, which indicates that intervention schools scored 2.15 points
higher than control group schools on average, controlling for student- and school-­level
covariates. It probably comes as no surprise that the Bayesian and frequentist results are
numerically similar, as we’ve seen numerous examples of this throughout the book. The
312 Applied Missing Data Analysis

TABLE 8.1. Posterior Summaries from a Random Intercept Model


Parameter Mdn SD LCL UCL
β0 29.10 2.11 25.00 33.27
β1 (PRETEST) 0.27 0.04 0.19 0.35
β2 (STANMATH) 0.19 0.02 0.15 0.23
β3 (FRLUNCH) –0.39 0.44 –1.23 0.49
β4 (TEACHEXP) 0.10 0.10 –0.11 0.30
β5 (CONDITION) 2.15 0.85 0.42 3.83
Intercept var. (σb20) 4.26 1.72 2.17 8.68
Residual var. (σε2) 18.57 0.98 16.79 20.67

Note. LCL, lower credible limit; UCL, upper credible limit.

one difference was the intercept variance estimate, which was somewhat larger in the
Bayes analysis. This parameter was sensitive to the choice of prior distribution, which isn’t
necessarily surprising given the relatively small number of level-2 units (Gelman, 2006).
To explore the influence of different prior distributions, I implemented three inverse
gamma prior distributions for the random intercept variance: an improper prior that sub-
tracts two degrees of freedom and adds 0 to the sum of squares (Asparouhov & Muthén,
2010a), a more informative prior that adds two degrees of freedom and a value of 1 to the
sum of squares, and the Jeffreys prior described in Section 4.5. A practical way to gauge the
impact of the prior distribution is to express the random intercept variance as a variance-­
explained effect size (Rights & Sterba, 2019). Across the three priors, the intercept vari-
ance captured between 10.5 and 12.6% of the total variation in the outcome. I suspect
that most researchers would not find this variability practically meaningful, so Table 8.1
reports the results with an improper prior, which is the default in some software packages
(the Jeffreys prior brought the posterior median closer to the frequentist point estimate).
Multilevel modeling textbooks recommend inspecting level-2 residuals to identify
possible model misspecifications (Raudenbush & Bryk, 2002, Ch. 9; Snijders & Bosker,
2012, Ch. 10). Bayesian analyses and model-based imputation are ideally suited for this

TABLE 8.2. Model‑Based Imputation Estimates from a Random


Intercept Analysis
Parameter Est. SE t df p FMI
β0 29.18 2.08 14.05 16.73 < .001 .29
β1 (PRETEST) 0.27 0.04 6.75 15.39 < .001 .35
β2 (STANMATH) 0.19 0.02 9.38 16.70 < .001 .29
β3 (FRLUNCH) –0.39 0.44 –0.89 15.32 .39 .35
β4 (TEACHEXP) 0.09 0.10 0.88 18.43 .39 .21
β5 (CONDITION) 2.16 0.81 2.67 20.23 .02 .13
Intercept var. (σb20) 3.77 — — — — —
Residual var. (σε2) 18.62 — — — — —

Note. FMI, fraction of missing information.


Multilevel Missing Data 313

Relative Probability

–6 –4 –2 0 2 4 6
School-Level Random Intercept Residuals

FIGURE 8.3. Distribution of random intercept residuals from 100 imputations.

purpose, because the MCMC algorithm estimates the between-­cluster residuals at every
iteration. As such, you can treat the random effects as multiply imputed latent data
and save them for further inspection. To illustrate, Figure 8.3 shows the estimated dis-
tributions of the random intercept residuals (i.e., the b0j terms in Equation 8.18) from
the 100 imputed data sets. The residuals are normal by assumption, and the empirical
distributions are a reasonably good match to that ideal (skewness and excess kurtosis
were –0.28 and –0.25, respectively). Quantile–­quantile (Q-Q) plots are another option
for evaluating normality, and graphing the residuals against other variables can reveal
certain types of model misspecification.

8.3 RANDOM COEFFICIENT MODELS

Changing substantive contexts, I use the two-level daily diary data set from the com-
panion website to illustrate Bayesian estimation for a random coefficient regression
model. The data come from a health psychology study in which J = 132 participants with
chronic pain provided up to nj = 21 daily assessments of mood, sleep quality, and pain
severity. The data set also includes several person-­level demographic and background
variables (e.g., educational attainment, gender, number of diagnosed physical ailments,
activity level) and psychological correlates of pain severity (e.g., pain acceptance, cata-
strophizing). The structure of the data set is now quite different, as daily measurements
are level-1 units and persons are level-2 units (or clusters). The data set and the variable
definitions are described in the Appendix.
314 Applied Missing Data Analysis

As explained previously, variation and covariation can exist at both levels of a mul-
tilevel hierarchy. Applied to the diary data, daily assessments (e.g., mood, pain, and sleep
quality) naturally vary within a given person, and individuals also differ in their average
levels of these variables. The dependent variable, positive affect, has an intraclass corre-
lation equal to .63, meaning that person-­level mean differences in average mood scores
account for roughly 63% of the total variation. This value is typical of repeated measures
data and is substantially larger than that of the student-­level problem-­solving scores
from the previous example. The daily predictors also have substantial between-­person
variation. As explained previously, the posterior predictive distributions of the missing
values preserve this important feature of the data.
A random coefficient model (also called a random slope model) is a multilevel
regression in which the influence of one or more level-1 explanatory variables varies
across level-2 units. As a starting point, consider a model that features daily pain ratings
as a level-1 predictor of daily positive affect and individual-­level average pain and pain
acceptance as level-2 predictors of mean positive affect. The within-­cluster regression
model describes daily score variation among affect scores from the same individual. The
model and its generic counterpart are as follows:

(
POSAFFECTij = β0 j + β1 j PAIN ij − μ1 j + ε ij ) (8.24)

( )
Yij = β0 j + β1 j X1ij − μ1 j + ε ij
ε ij ~ (
N1 0, σ2ε )
The j subscript on the intercept and slope coefficients signifies that these quantities
vary across persons (level-2 units). To illustrate, the dashed lines in Figure 8.4 depict
the within-­cluster regression lines for 25 individuals, and the solid line is the average
trajectory. Unlike Figure 8.3, there is considerable heterogeneity in both the intercept
and slope coefficients; the slopes are mostly negative, but some persons exhibit stronger
associations than others, and some individuals even have positive slopes.
Centering the daily pain ratings at the cluster (person) means isolates daily fluctua-
tions around a participant’s chronic pain level (i.e., group mean centering or centering
within context; Enders & Tofighi, 2007; Kreft, de Leeuw, & Aiken, 1995), such that
β1j is a “pure” within-­person effect. This centering scheme also defines β0j as an indi-
vidual’s average positive affect. Importantly, μ1j is a normally distributed latent mean
rather than a deterministic arithmetic average. Recent methodological research favors
this approach, as modeling cluster-­level quantities as latent variables can reduce bias in
some situations (Hamaker & Muthén, 2020; Lüdtke et al., 2008).
The between-­cluster part of the model features the latent group means and pain
acceptance scores as predictors of average affect, as follows:

( ) (
β0 j = β0 + β2 μ1 j − μ1 + β3 PAINACCEPTj − μ 2 + b0 j ) (8.25)

β0 j = β0 + β2 ( μ1 j − μ1 ) + β3 ( X 2 j − μ 2 ) + b0 j
β1 j =β1 + b1 j
Multilevel Missing Data 315

Centering the regressors at their grand means defines β0 as the grand mean (the mean
of the individual means), and b0j is a between-person residual that captures unexplained
variation average positive affect. The random slope equation says that an individual’s
coefficient is a function of the grand mean slope plus a person-specific deviation. The
regressors do not predict individual slopes, because that would change their status to
moderator variables (see Section 8.4). By assumption, b0j and b1j are bivariate normal
with a between-cluster covariance matrix Sb.
Finally, substituting the right sides of the β0j and β1j expressions into the within-
cluster model gives the following reduced-form equation:

( ) ( ) (
POSAFFECTij = β0 j + β1 j PAIN ij − μ1 j + β2 μ1 j − μ1 + β3 PAINACCEPTj − μ 2 + ε ij (8.26) )
Yij = β0 j + β1 j ( X1ij − μ1 j ) + β2 ( μ1 j − μ1 ) + β3 ( X 2 j − μ 2 ) + ε ij = E ( Yij | X1ij , X 2 j ) + ε ij
Yij ~ N1 ( E ( Yij | X1ij , X 2 j ) , σ2ε )

The expected value is a predicted score from a cluster-specific regression line, and the
bottom expression says that dependent variable scores are normally distributed around
these points. This normal curve is also the posterior predictive distribution of the miss-
ing positive affect scores.
7
6
5
Positive Affect
4
3
2
1

–4 –2 0 2 4
Daily Pain Rating (Centered)

FIGURE 8.4. The dashed lines depict the within-cluster regression lines for 25 individuals,
and the solid line is the average trajectory.
316 Applied Missing Data Analysis

A good deal of recent research has focused on missing data handling for random
coefficient models (Enders et al., 2020; Enders, Hayes, & Du, 2018; Enders, Keller, et
al., 2018; Erler et al., 2016, 2019; Grund et al., 2016a; Grund, Lüdke, & Robitzsch, 2018;
Kunkle & Kaizer, 2017; Lüdtke, Robitzsch, & Grund, 2017). These models are challeng-
ing, because they feature the product of a level-1 predictor variable and a level-2 latent
variable (e.g., the product of β1j and X1ij). Not all missing data-­handling procedures
appropriately accommodate this nonlinearity when the regressor is incomplete (e.g.,
current maximum likelihood estimators are prone to substantial biases), but pairing
factored regression models with Bayesian estimation or model-based multiple imputa-
tion provides a straightforward and familiar solution.

Factored Regression Specification


As explained previously, a partially factored regression model is ideally suited for analy-
ses with centered variables, because MCMC estimates the grand means and latent group
means at every iteration. The generic factorization for the positive affect analysis is as
follows:

f ( POSAFFECT | PAIN, PAINACCEPT ) × f ( PAIN, PAINACCEPT ) (8.27)

The composition of the focal model has no impact on the supporting regressor models,
so the specification for the second term follows the random intercept analysis (see Equa-
tions 8.5 and 8.6).

Distribution of Missing Values


As you might anticipate, the focal model alone defines the posterior predictive dis-
tribution of missing outcome scores. Returning to Equation 8.26, the bottom row of
the expression says that dependent variable scores are normally distributed around
predicted values that incorporate person-­specific random intercepts and slopes. This
normal distribution also generates imputations. To illustrate, Figure 8.5 shows the
distribution of missing outcome scores for three persons with different associations
between daily pain and positive affect. The solid black circles are predicted values, and
the spread of the normal curves reflects within-­person residual variance. The candidate
imputations fall directly on the vertical lines, but I added horizontal jitter to emphasize
that more scores are located near the center of each distribution. Conceptually, MCMC
algorithm generates an imputation by randomly selecting a value from the cluster of
candidate scores (technically, the imputations can fall anywhere in the normal distri-
bution).
Turning to the regressors, the MCMC algorithm draws imputations from the con-
ditional distribution of an incomplete predictor given all other analysis variables. To
illustrate, consider X1 (e.g., daily pain ratings). Following Equation 8.13, this distribu-
tion is proportional to the product of two univariate distributions, each of which aligns
with one of the previous regression models. It is again instructive to look at the analytic
distribution of the missing values to draw connections to single-­level regression models
Multilevel Missing Data 317

7
6
5
Positive Affect
4
3
2
1

–4 –2 0 2 4
Daily Pain Rating (Centered)

FIGURE 8.5. Distribution of missing outcome scores for three persons with different associa-
tions between daily pain and positive affect. The solid black circles are predicted values, and the
spread of the normal curves reflects within-school residual variance. The candidate imputations
fall directly on the vertical lines, but horizontal jitter is added to emphasize that more scores are
located near the center of each distribution.

with interactive effects (as mentioned previously, a random coefficient model features
the product of a level-1 predictor and level-2 latent variable). Dropping unnecessary scal-
ing terms and substituting the appropriate kernels into the right side of the factorization
give the following expression:


( ( )) 
2
 1 Yij − β0 j + β1 j X1ij + β2 X 2 j 
( ) (
f Yij | X1ij , X 2 j × f X1ij | X 2 j ) ∝ exp  −
σ2ε

 2 
  (8.28)

( )
2 
 1 X1ij − μ1 j 
× exp −
 2 σ2r1( w ) 
 

Multiplying the two normal curve functions and performing algebra that combines the
component functions into a single distribution for X1 gives a normal distribution with
318 Applied Missing Data Analysis

two-part mean and variance expressions that depend on the focal and regressor model
parameters.

( ) ( ( ) (
f X1ij( mis ) | Yij , X 2 j = N1 E X1ij | Yij , X 2 j , var X1ij | Yij , X 2 j )) (8.29)

 μ ( (
β1 j Yij − β0 j + β2 X 2 j )) 
(
E X1ij = )
| Yij , X 2 j (
var X1ij | Yij , X 2 j ) × 2 +
 σ
1j

σ2ε 
r
 1( w ) 
−1
 1 β12 j 
(
var X1ij | Yij= )
, X2 j  2 + 2 
 σr σε 
 1( w ) 

The distribution’s structure is virtually identical to the one for moderated regression
models back in Section 5.4 (e.g., see Equation 5.22), but a random coefficient replaces
a simple slope in the expression. Looking at the variance of the imputations, the ran-
dom slope introduces heteroscedasticity, such that the distribution’s spread depends on
cluster j’s coefficient. This result highlights that incomplete random slope predictors
induce differences in spread that are incompatible with a multivariate normal distribu-
tion (i.e., Equation 8.29 is a mixture of normal distributions that differ with respect to
their spread). Maximum likelihood and multiple imputation approaches that assume
multivariate normality (e.g., fully conditional specification) do a poor job of approxi-
mating this heteroscedasticity and are prone to substantial biases (Enders et al., 2020;
Enders, Hayes, et al., 2018; Enders, Keller, et al., 2018; Grund et al., 2016a).

Analysis Example
Expanding on the health psychology analysis, I used Bayesian estimation to fit a ran-
dom coefficient regression model that features daily pain and sleep quality ratings as
within-­person predictors of daily positive affect and individual-­level average pain, pain
acceptance, and gender as predictors of average mood scores.

( ) ( ) (
POSAFFECTij = β0 j + β1 j PAIN ij − μ1 j + β2 SLEEPij − μ 2 +β3 μ1 j − μ1 ) (8.30)
+β4 ( PAINACCEPTj − μ 2 ) + β5 ( FEMALE j ) + ε ij
I used (latent) group mean centering to partition pain ratings into orthogonal within-
and between-­person components, and I centered the level-1 sleep scores at the grand
mean. The factored regression specification follows Equation 8.27 but features two addi-
tional variables in each term.

( POSAFFECT | PAIN, SLEEP, PAINACCEPT, FEMALE ) × *

(8.31)
f ( PAIN, SLEEP, PAINACCEPT, FEMALE ) *

The composition of the supporting predictor models mimics earlier material.


The potential scale reduction factors (Gelman & Rubin, 1992) from a preliminary
diagnostic run indicated that the MCMC algorithm converged in fewer than 4,000 itera-
Multilevel Missing Data 319

tions, so I used 15,000 total iterations with a burn-in period of 5,000 iterations. The
same analysis that generates Bayesian summaries of the model parameters can also gen-
erate model-based multiple imputations for a frequentist analysis. To illustrate, I created
M = 100 imputations by saving the filled-­in data and the latent group means from the
final iteration of 100 parallel MCMC chains, each with 5,000 iterations. As explained
previously, autocorrelated imputations are not a concern with this approach.
After creating the multiple imputations, I centered the variables at the estimated
latent group means and used restricted maximum likelihood to fit the random coef-
ficient model to each data set. The Barnard and Rubin (1999) small-­sample degrees of
freedom adjustment for the t-statistic requires the complete-­data degrees of freedom
value as an input (see Equations 7.21 and 7.22). Following the HLM software package
(Raudenbush et al., 2019), I again used the number of level-2 units minus the number of
predictors minus 1 as the degrees of freedom for all coefficients (i.e., dfcom = 132 – 5 – 1).
Analysis scripts are on the companion website.
Table 8.3 gives the posterior summaries from the Bayesian analysis, and Table 8.4
summarizes the multiple imputation point estimates and standard errors. Not surpris-
ingly, the two sets of results are numerically equivalent, albeit with different philosophi-
cal baggage. Unlike the previous example, specifying different prior distributions for the
between-­cluster covariance matrix had virtually no impact on the Bayesian summaries,
because the number of observations at each level was sufficiently large to nullify the
prior’s influence. Latent group mean centering yields pain coefficients that access dif-
ference sources of variation in the mood scores; β1 quantifies the “pure” within-­person
slope, where β3 represents the between-­cluster regression of average positive affect on
average pain. Both coefficients are negative, suggesting that increases in daily or chronic
pain were associated with less positive affect. The difference between the two coef-
ficients is interesting in this context, because it clarifies whether daily or chronic pain
has a greater influence (Longford, 1989; Lüdtke et al., 2008; Raudenbush & Bryk, 2002).

TABLE 8.3. Posterior Summaries from a Random Coefficient Model


Parameter Mdn SD LCL UCL
β0 4.21 0.11 4.00 4.44
β1 (DAILY PAIN) –0.10 0.02 –0.13 –0.06
β2 (SLEEP) 0.09 0.01 0.07 0.10
β3 (MEAN PAIN) –0.09 0.06 –0.22 0.03
β4 (PAINACCEPT) 0.02 0.07 –0.12 0.17
β5 (FEMALE) –0.07 0.14 –0.34 0.22
Intercept var. (σb20) 0.61 0.08 0.48 0.80
Covariance (σb20b1) 0.01 0.02 –0.02 0.04
Slope var. (σb21) 0.02 0.01 0.01 0.03
Residual var. (σε2) 0.36 0.01 0.34 0.38
β3 – β1 (Contextual)   0.002 0.07 –0.13 0.13

Note. LCL, lower credible limit; UCL, upper credible limit.


320 Applied Missing Data Analysis

TABLE 8.4. Model‑Based Imputation Estimates from a Random


Coefficient Analysis
Parameter Est. SE t df p FMI
β0 4.20 0.10 40.29 123.35 < .001 .02
β1 (DAILY PAIN) –0.10 0.02 –5.00 108.22 < .001 .13
β2 (SLEEP) 0.09 0.01 11.02 86.22 < .001 .28
β3 (MEAN PAIN) –0.09 0.07 –1.43 115.43 .16 .08
β4 (PAINACCEPT) 0.02 0.07 0.26 122.11 .80 .03
β5 (FEMALE) –0.05 0.14 –0.39 123.47 .70 .02
Intercept var. (σb20) 0.59 — — — — —
Covariance (σb20b1) 0.01 — — — — —
Slope var. (σb21) 0.02 — — — — —
Residual var. (σε2) 0.36 — — — — —
β3 – β1 (Contextual)   0.003 0.07   0.002 123.01 .97 .08

Note. FMI, fraction of missing information.

The near-zero coefficient difference suggests that daily fluctuations in pain and chronic
pain exert the same influence on positive affect.

8.4 MULTILEVEL INTERACTION EFFECTS

Changing the substantive scenery once again, I use the employee data on the companion
website to illustrate Bayesian missing data handling for a multilevel regression model
with random coefficients and a cross-level interaction. The data include several work-­
related variables (e.g., work satisfaction, turnover intention, employee–­supervisor rela-
tionship quality) for a sample of N = 630 employees. I’ve used this data set for earlier
examples, but I’ve thus far ignored the fact that nj = 6 employees are nested within J =
105 workgroups or teams. The data’s structure is now like the random intercept analysis,
where persons are level-1 units and organizations (workgroups) are level-2 units. The
dependent variable, employee empowerment, has an intraclass correlation equal to .11,
meaning that team-level mean differences in average empowerment scores account for
roughly 11% of the total variation. The Appendix gives a description of the data set and
the variable definitions.
A cross-level interaction is one where the influence of a level-1 explanatory variable
is moderated by a level-2 regressor. This effect is usually (but not necessarily) accompa-
nied by random coefficients, as the interaction can be viewed as explaining slope het-
erogeneity. The analysis for this example features leader–­member exchange (employee–­
supervisor relationship quality) as a within-­team predictor of employee empowerment,
the effect of which is moderated by team-level leadership climate. The model also
includes a gender dummy code (0 = female, 1 = male) and group-level cohesion ratings
as level-1 and level-2 covariates, respectively.
Multilevel Missing Data 321

( ) ( ) (
EMPOWER ij = β0 j + β1 j LMX ij − μ1 j + β2 MALEij − μ 2 +β3 COHESION j − μ 3 ) (8.32)
( ) ( )(
+β4 CLIMATE j − μ 4 +β5 LMX ij − μ1 j CLIMATE j − μ 2 + ε ij )
((
EMPOWER ij ~ N1 E EMPOWER ij | LMX ij , MALEij , COHESION j , CLIMATE j , σ2ε ) )
The j subscript on the intercept and leader–­ member exchange slope conveys that
each team has a unique mean and bivariate association. Note that I centered leader–­
membership scores at the workgroup means to isolate within-­team variation in the
regressor (Enders & Tofighi, 2007; Kreft et al., 1995). As explained previously, these
group-level quantities are normally distributed latent variables rather than determinis-
tic arithmetic averages (Hamaker & Muthén, 2020; Lüdtke et al., 2008). The partially
factored specification is ideally suited for this analysis, because the grand means and
latent group means are iteratively estimated model parameters (Enders & Keller, 2019).
A sequential specification does not easily accommodate centering.

Factored Regression Specification


The partially factored specification expresses the joint distribution of the analysis vari-
ables as the product of a univariate distribution (the normal curve induced by the focal
model; see Equation 8.32) and a multivariate normal distribution for the predictors
(continuous or latent response variables). The factorization for this example is as fol-
lows:

f ( EMPOWER, LMX, MALE, COHESION, CLIMATE, LMX × CLIMATE ) =


f ( EMPOWER | LMX, MALE, COHESION, CLIMATE, LMX × CLIMATE ) × (8.33)

( *
f LMX, MALE , COHESION, CLIMATE )
Expanding on earlier ideas, level-1 predictors are decomposed into within- and
between-­ cluster components. The following within-­ cluster model expresses level-1
scores as correlated deviations around the latent group means:

 LMX ij   μ1 j   r1ij( W ) 
X=  =  +  (8.34)
ij( W )  MALE*j   μ 2 j   r2ij W 
     ( )
  μ   σ2 σ12( W )  
1( W )

X ij( W ) ~ N 2 
1j
, 
  μ 2 j   σ21 W 1 
  ( ) 
Note that the male dummy code appears as a latent response variable, and the corre-
sponding diagonal element of the within-­cluster covariance matrix Σ(W) is fixed at 1 to
establish a metric (the model also includes a single, fixed threshold).
The level-1 predictors condition on level-2 covariates via their latent group means
322 Applied Missing Data Analysis

in the level-2 model, which now consists of four empty regression equations with cor-
related residuals in the between-­cluster covariance matrix Σ(B).

r 
 μ1 j   μ1   1 j( B ) 
    r 
 μ2 j   μ 2  +  2 j( B ) 
X=
j( B ) = (8.35)
 COHESION j   μ 3   r3 j( B ) 
     
 CLIMATE j   μ 4   r4 j( B ) 
 
  σ2 σ12( B ) σ13( B ) σ14( B )  
  μ1   1( B ) 
  σ σ22( B ) σ23( B ) σ24( B )  
μ2 21( B )
X j( B ) ~ N 4    ,  
  μ3   σ σ 32( B ) σ23( B ) σ 34( B )  
    31( B ) 
  μ 4   
 σ σ 42( B ) σ 43( B ) σ 4( B )  
2
  41( B ) 

As explained previously, the within- and between-­cluster covariance matrices can also
be expressed as a set of round-robin regression equations, as this avoids difficulties esti-
mating covariance matrices with fixed elements.

Distribution of Missing Values


The distributions of missing values are effectively the same as those for a random coeffi-
cient analysis. For example, the focal model alone defines the posterior predictive distri-
bution of missing outcome scores, and MCMC draws imputations from the normal dis-
tribution in the bottom row of the Equation 8.32. Following Figure 8.5, each imputation
equals a predicted score plus a random noise term, and the predicted values incorporate
cluster-­specific random intercepts and slopes. Turning to the regressors, the MCMC
algorithm draws imputations from a complex distribution that depends on two or more
sets of model parameters. For example, the distribution of missing leader–­member
exchange scores mimics the two-part composition of Equation 8.29, with a couple of
extra variables; the random coefficient and product term again introduce heteroscedas-
ticity, such that the distribution’s spread depends on the cluster-­specific β1j coefficients.
In practice, the Metropolis step can do the heavy lifting of generating imputations from
these complex distributions.

Analysis Example
Continuing with the organizational example, I used Bayesian estimation to fit a mul-
tilevel moderated regression model from Equation 8.32. After inspecting the potential
scale reduction factors (Gelman & Rubin, 1992) from a preliminary diagnostic run, I
specified an MCMC process with 10,000 iterations following the burn-in period. Fol-
lowing earlier examples, I also created M = 100 filled-­in data sets by saving the imputa-
tions and latent group means from the final iteration of parallel MCMC chains. After
creating the multiple imputations, I centered the variables at the imputed latent group
Multilevel Missing Data 323

means and used restricted maximum likelihood to fit the random coefficient model to
each data set. Finally, I used Rubin’s (1987) rules to pool the parameter estimates and
standard errors and applied the Barnard and Rubin (1999) degrees of freedom expres-
sion to the significance tests. Analysis scripts are on the companion website.
Table 8.5 gives the posterior summaries for the Bayesian analysis, and Table 8.6
summarizes the multiple imputation point estimates and standard errors. The two
analyses produced similar coefficients, but the between-­cluster covariance matrix esti-
mates were somewhat different. The Bayesian analysis was sensitive to the choice of
prior distribution for the between-­cluster covariance matrix, so I considered four dif-
ferent options: a Jeffreys prior, an improper Wishart prior that subtracts three degrees
of freedom and adds 0 to the sum of squares and cross-­products matrix (Asparouhov &
Muthén, 2010a), a more informative prior that adds threes degrees of freedom and an
identity matrix to the sum of squares, and a separation strategy prior that decomposes
the covariance matrix into a pair of variances and a correlation. Recent literature recom-
mends the latter option when the number of observations per cluster is small like it is
here (Keller & Enders, 2022), so the table reflects this strategy.
Turning to the slope coefficients, lower-order terms are conditional effects that
depend on centering. For example, β1 reflects the pure within-­cluster influence of
leader–­member exchange (the focal predictor) for a workgroup with average leadership
climate (the moderator). The interaction coefficient is of particular interest, because it
captures the influence of the group-level moderator on the level-1 slope. The positive-­
valued coefficient implies the association between employee–­supervisor relationship
quality and employee empowerment gets stronger as workgroup climate improves (i.e.,
a one-unit increase in leadership climate increases β1 by about .04). The 95% credible

TABLE 8.5. Posterior Summaries from a Cross‑Level Interaction Analysis


Parameter Mdn SD LCL UCL
β0 28.65 0.22 28.22 29.08
β1 (LMX) 0.68 0.08 0.53 0.83
β2 (MALE) 0.21 0.06 0.10 0.32
β3 (COHESION) 1.73 0.33 1.09 2.38
β4 (CLIMATE) 0.18 0.17 –0.15 0.51
β5 (LMX × CLIMATE) 0.04 0.02 0.00 0.08
Intercept var. (σb20) 0.29 0.27 0.09 1.13
Correlation (rb0b1) 0.01 0.17 –0.36 0.34
Slope var. (σb21) 0.14 0.08 0.02 0.32
Residual var. (σε2) 12.33 0.92 10.71 14.31

Conditional effects at ± high and low climate


LMX at + SD 0.84 0.11 0.63 1.06
LMX at – SD 0.52 0.11 0.31 0.73

Note. LCL, lower credible limit; UCL, upper credible limit.


324 Applied Missing Data Analysis

TABLE 8.6. Model‑Based Imputation Estimates from a Cross‑Level


Interaction Analysis
Parameter Est. SE t df p FMI
β0 28.62 0.19 150.52 67.59 < .001 .30
β1 (LMX) 0.67 0.07 9.07 406.12 < .001 .21
β2 (MALE) 1.75 0.32 5.40 443.76 < .001 .18
β3 (COHESION) 0.20 0.17 1.21 58.58 .23 .38
β4 (CLIMATE) 0.21 0.06 3.59 52.99 < .001 .43
β5 (LMX × CLIMATE) 0.04 0.02 2.13 68.76 .04 .29
Intercept var. (σb20) 0.57 — — — — —
Covariance (σb20b1) –0.07 — — — — —
Slope var. (σb21) 0.16 — — — — —
Residual var. (σε2) 12.09 — — — — —

Conditional effects at ± high and low climate


LMX at + SD 0.83 0.11 7.87 91.01 < .001 —
LMX at – SD 0.51 0.11 4.84 92.18 < .001 —

Note. FMI, fraction of missing information.

interval suggests that 0 is an unlikely value for the parameter, and the frequentist sig-
nificance test in Table 8.6 similarly refutes the null hypothesis.
The bottom two rows of Tables 8.5 and 8.6 give the simple slopes for hypothetical
workgroups at plus and minus one between-­cluster standard deviation from the cli-
mate grand mean (and with b0j and b1j values equal to 0). The Bayesian analysis treats
the conditional effects as auxiliary quantities computed from the focal model param-
eters at each MCMC iteration (Keller & Enders, 2021), whereas the multiple imputation
estimates used standard linear contrasts with delta method standard errors (Grund,
­Robitzsch, & Lüdke, 2021). Figure 8.6 shows these conditional effects as dashed and
dotted lines, respectively, and the solid line is the regression line for a team with aver-
age leadership climate. As you can see in Figure 8.6, the influence of leader–­member
exchange is stronger in workgroups with above-­average leadership climate, and it is
weaker in workgroups with below-­average climate.

8.5 THREE‑LEVEL MODELS

Thus far I’ve considered hierarchical data structures with two levels, but Bayesian esti-
mation and model-based multiple imputation readily extend to three (or even more)
levels. Returning to the cluster-­randomized educational experiment from Section 8.2,
researchers collected seven problem-­solving assessments throughout the school year
at roughly monthly intervals. The earlier random intercept analysis used data from the
first and last occasion, and I now treat repeated measurements as an additional hier-
archy in the design. The three-level data structure features repeated measurements as
Multilevel Missing Data 325

40
35
Employee Empowerment
30
25
20
15
10

–10 –5 0 5 10
Leader-Member Exchange (Group Mean Centered)

FIGURE 8.6. Conditional slopes at three levels of the moderator. The dashed line is the slope
for a work group with a leadership climate score at one between-­cluster standard deviation above
the grand mean, and the dotted line is the slope for a team at one standard deviation below the
mean. The solid line is the slope for a team with average climate.

level-1 units, students as level-2 units, and schools as level-3 units (i.e., repeated mea-
surements nested in students, students nested in schools). As you might expect, the
missing data rates increased over time (e.g., baseline problem-­solving scores were com-
plete, and nearly 20% of the scores were missing by the final assessment), and compari-
son schools additionally had planned missing data at certain occasions. The data set and
the variable definitions are described in the Appendix.
A longitudinal growth curve model is a type of multilevel regression in which
repeated measurements are a function of a temporal predictor that codes the passage of
time, in this case the monthly testing occasions. To facilitate interpretation, researchers
usually code one of the measurement occasions as 0 and set the others relative to that
fixed point. One common option expresses time relative to the baseline assessment (e.g.,
MONTH = 0, 1, 2, 3, 4, 5, 6), and another reflects these “time scores” relative to the final
measurement (e.g., MONTH = –6, –5, –4, –3, –2, –1, 0). I use the latter definition for the
ensuing example, because this scaling yields an estimate of the intervention group’s
mean difference at the end of the school year. The temporal codes define a level-1 predic-
tor that targets within-­student changes in the dependent variable. However, unlike other
lower-level covariates, the time scores have no between-­person (level-2) or between-­
326 Applied Missing Data Analysis

school (level-3) fluctuation, because they are constant across students (i.e., all students
were assessed in approximately monthly increments). This feature impacts the composi-
tion of the supporting regressor models, as there is no need to estimate this variable’s
latent group means or higher-­level variation.
The growth curve model features an average linear trajectory for each experimental
condition, along with individual variation around the mean intercept and slope. Starting
with the repeated measurements, the within-­person linear model for student i in school
j is

(
PROBSOLVEtij = β0ij + β1ij MONTH tij + εtij ) (8.36)

(
εtij ~ N1 0, σ2ε )
where PROBSOLVEtij is the student’s problem-­solving test score at measurement occasion
t, MONTH is the temporal predictor or “time variable,” β0ij is the participant’s expected
end-of-year problem-­solving score (i.e., the predicted value when MONTH = 0), and β1ij is
his or her latent monthly change rate. Finally, εtij is a time-­specific residual that captures
the distances between the repeated measurements and the individual linear trajectories.
By assumption, these residuals are normally distributed with constant variance σε2.
Building on the earlier analysis example, I use standardized math achievement
scores and free or reduced-­price lunch eligibility as student-­level covariates. The regres-
sors enter the student-­level between-­cluster (level-2) model as predictors of the indi-
vidual random intercepts. The model is as follows:

( ) (
β0ij = β0 j + β2 STANMATH ij − μ 2 + β3 FRLUNCH ij − μ 3 + b0ij ) (8.37)

β1ij =β1 j + b1ij


 b0ij 
 (
 ~ N 2 0, S b( L2 )
 b1ij 
)
Centering covariates at their grand means defines β0j as the average intercept for stu-
dents in school j (i.e., a school’s average end-of-year problem-­solving test score), and b0ij
is a student-­level residual that captures unexplained variation in the end-of-year test
scores. The slope equation says that an individual’s monthly change rate is a function of
the school average slope plus a person-­specific deviation. The regressors do not predict
individual growth rates, because that would change their status to moderator variables
that interact with the temporal predictor. By assumption, b0ij and b1ij are bivariate nor-
mal with a between-­cluster covariance matrix Σb(L2).
Following the random intercept analysis, I use teacher experience and the interven-
tion dummy code as school-­level (level-3) regressors. The model is as follows:

( ) (
β0 j = β0 + β4 TEACHEXPj − μ 4 + β5 CONDITION j + b0 j ) (8.38)

(
β1 j = β1 + β6 CONDITION j + b1 j )
 b0 j 
(
  ~ N 2 0, S b( L3 )
 b1 j 
)
Multilevel Missing Data 327

Centering average years of teacher experience at the grand means defines β0 as the aver-
age end-of-year problem-­solving test score for control schools (i.e., CONDITION = 0),
and b0j is a school-­level residual that captures unexplained variation in the intercepts.
In the slope equation, β1 is the average monthly growth rate for control schools, β6 is the
growth rate difference for intervention schools (i.e., the group-by-time interaction), and
b1j is a school-­level random slope residual. As before, the random intercepts and slopes
are bivariate normal by assumption with a variance–­covariance matrix Σb(L3).
Finally, substituting the right sides of the higher-­level equations into the lower-
level equations gives a reduced form expression that features a cross-level interaction
between a level-3 regressor (treatment assignment) and the level-1 temporal predictor.

PROBSOLVEtij = (β0 + b0ij + b0 j ) + (β1 + b1ij + b1 j )( MONTHtij ) +β2 ( STANMATHij − μ2 )


+β3 ( FRLUNCH ij − μ 3 ) + β4 (TEACHEXPj − μ 4 ) +β5 ( CONDITION j )

+β6 ( MONTH tij )( CONDITION j ) + εtij

Ytij ~ N1 ( E ( PROBSOLVEtij | ⋅) , σ2ε ) (8.39)

Consistent with a single-­level moderated regression model, β0 and β1 are conditional


effects that convey the average end-of-year performance and monthly change rate for
control group schools, and β5 and β6 give the mean difference at the final wave and
growth rate difference for intervention schools (i.e., the group-by-time interaction
effect).

Factored Regression Specification


The partially factored specification expresses the joint distribution of the analysis vari-
ables as the product of a univariate distribution (the normal curve induced by the focal
model; see Equation 8.39) and a multivariate normal distribution for the predictors (con-
tinuous or latent response variables). The factorization for this example is as follows:

f ( PROBSOLVE | MONTH, STANMATH, FRLUNCH, TEACHEXP, CONDITION ) ×


(8.40)
(
f MONTH, STANMATH, FRLUNCH * , TEACHEXP, CONDITION * )
Expanding on earlier ideas, this specification decomposes level-1 predictors into a
within-­cluster deviation involving a score and a level-2 latent group mean, a between-­
cluster deviation involving a level-2 group mean and a level-3 latent group mean, and
a between-­cluster deviation between a level-3 latent group mean and the grand mean.

( ) ( ) (
X1tij = μ1 + μ1 j − μ1 + μ1ij − μ1 j + X1tij − μ1ij ) (8.41)

The temporal predictor is unique, because it contains only within-­student variation (i.e.,
the time scores are constant across students, so there is no fluctuation in the average
time scores). The within-­cluster regression model thus expresses time scores as devia-
tions around the grand mean, as follows:
328 Applied Missing Data Analysis

MONTH tij =μ1 + r1tij( W ) (8.42)

MONTH tij ~ (
N1 μ1 , σr21( W ) )
In fact, there is no need to estimate this model at all, because the time scores are com-
plete and function as known constants.
The student-­level between-­cluster (level-2) model expresses level-2 scores as cor-
related deviations around their level-3 latent group means, as follows:

 STANMATH ij   μ 2 j   r2ij( L2 ) 
X=  =  +  (8.43)
ij( L2 )  FRLUNCH ij*   μ 3 j   r3ij L2 
     ( )

(
X ij( L2 ) ~ N 2 μ j , S ( L2 ) )
Notice that the lunch assistance indicator appears as a latent response variable that
represents a continuous proclivity to receive free or reduced-­price lunch. As always,
the corresponding diagonal element of the level-2 covariance matrix is fixed at 1, and
the model requires a fixed threshold parameter. More generally, the factorization would
include the latent group means of any level-1 regressors with higher-­level variation.
The level-2 predictors condition on level-3 covariates via their latent group means
in the student-­level between-­cluster model, which now consists of four empty regression
equations with correlated residuals.

 μ2 j  r 
 μ 2   2 j( L3 ) 
    r
μ 
X=
 3j
=
  μ 3  +  3 j( L3 )  (8.44)
j( L3 )  TEACHEXP j
  μ 4   r4 j( L3 ) 
    
 CONDITION j 
* 
   μ 5   r5 j( L3 ) 
 

(
X j( L3 ) ~ N 4 μ, S ( L3 ) )
Notice that the treatment assignment indicator is also modeled as a latent response vari-
able, and the corresponding diagonal element of the level-3 covariance matrix is fixed
at 1 to establish a metric. Alternatively, this variable can be treated as a fixed constant,
because it is complete and does not require a distribution.

Analysis Example
Continuing with the education example, I used Bayesian estimation to fit the three-level
growth model from Equation 8.39. After inspecting the potential scale reduction factor
diagnostic (Gelman & Rubin, 1992), I specified an MCMC process with 10,000 burn-in
iterations and 20,000 total iterations. I also created M = 100 filled-­in data sets by saving
the imputations from the final iteration of parallel MCMC chains with 10,000 iterations
each. After creating multiple imputations, I centered the variables and used restricted
Multilevel Missing Data 329

maximum likelihood to fit the random coefficient model to each data set. Finally, I used
Rubin’s (1987) rules to pool the parameter estimates and standard errors and applied
the Barnard and Rubin (1999) degrees of freedom expression for the significance tests.
Analysis scripts are on the companion website.
Table 8.7 gives the posterior summaries for the Bayesian analysis, and Table 8.8
summarizes the multiple imputation point estimates and standard errors. Consistent
with earlier examples, the two sets of results were numerically similar. I examined the
consistency of the Bayesian results across three prior distributions for the between-­
cluster covariance matrices. Not surprisingly, the intercept and slope variances were
sensitive to this specification, but their variance explained effect sizes (Rights & Sterba,
2019) were relatively stable. The table reports results from an improper prior that sub-
tracts degrees of freedom, as this is the default in some popular software packages.
Turning to the slope coefficients, the lower-order terms are conditional effects that
depend on centering. For example, β0 is the end-of-year problem-­solving average for
control schools (marginalizing over the student- and school-­level covariates), and β1
reflects the average monthly change rate for these schools. The positive-­valued β5 coeffi-
cient indicates that intervention schools finished the year 1.79 points higher, on average,
and the group-by-time interaction slope shows that these schools improved by 0.31 more
per month than the comparison group. The 95% credible intervals suggest that 0 is an
unlikely value for the group difference parameters, and the frequentist significance tests
in Table 8.8 similarly reject the null hypothesis. To further illustrate the group-by-time
interaction, Figure 8.7 shows the average linear growth trajectory for control schools as
a dashed line, and the solid line is the average growth curve for intervention schools.

TABLE 8.7. Posterior Summary from a Three‑Level


Regression Analysis
Parameter Mdn SD LCL UCL
β0 52.73 0.72 51.29 54.07
β1 (MONTH) 0.45 0.10 0.25 0.63
β2 (STANMATH) 0.24 0.01 0.22 0.26
β3 (FRLUNCH) 0.06 0.24 –0.40 0.54
β4 (TEACHEXP) –0.04 0.06 –0.15 0.09
β5 (CONDITION) 1.68 0.91 –0.13 3.41
β6 (COND. × MONTH) 0.30 0.13 0.05 0.54
Intercept var. (student) 6.30 0.61 5.19 7.57
Slope var. (student) 0.06 0.03 0.02 0.11
Intercept var. (school) 5.18 1.99 2.78 10.32
Slope var. (school) 0.10 0.04 0.05 0.20
Residual var. (σε2) 12.55 0.27 12.03 13.09

Note. LCL, lower credible limit; UCL, upper credible limit.


330 Applied Missing Data Analysis

TABLE 8.8. Model‑Based Imputation Estimates from a Three‑Level


Regression Analysis
Parameter Est. SE t df p FMI
β0 52.72 0.60 87.91   22.66 < .001 .10
β1 (MONTH) 0.45 0.09 5.12 5308.27 < .001 .06
β2 (STANMATH) 0.24 0.01 25.26 837.86 < .001 .06
β3 (FRLUNCH) 0.09 0.24 0.36 711.10 .72 .13
β4 (TEACHEXP) –0.03 0.05 –0.58   19.32 .57 .24
β5 (CONDITION) 1.75 0.80 2.18   22.82 .04 .09
β6 (COND. × MONTH) 0.31 0.12 2.64   22.15 .02 .12
Intercept var. (student) 6.17 — — — — —
Slope var. (student) 0.05 — — — — —
Intercept var. (school) 4.08 — — — — —
Slope var. (school) 0.07 — — — — —
Residual var. (σε2) 12.57 — — — — —

Note. FMI, fraction of missing information.


60
55
Math Problem-Solving
50
45
40

–6 –5 –4 –3 –2 –1 0
Months Until Final Measurement Occasion

FIGURE 8.7. The dashed line is the average linear growth trajectory for comparison or control
schools, and the solid line is the average growth curve for intervention schools.
Multilevel Missing Data 331

8.6 MULTIPLE IMPUTATION

Chapter 7 classified multiple imputation procedures into two buckets according to the
degree of similarity between the imputation and analysis models: Agnostic imputation
procedures deploy a model that differs from the focal analysis, whereas model-based
imputation invokes the same model as the secondary analysis. The previous examples
highlight that model-based multiple imputation goes hand in hand with a Bayesian anal-
ysis that tailors the filled-­in data to a particular multilevel model. Adopting a tailored
approach is vital for models with random coefficients and interaction effects, whereas a
variety of methods are appropriate for random intercept models.
Considering the agnostic imputation procedures from Chapter 7, single-­level joint
modeling and fully conditional specification approaches are known to introduce sub-
stantial biases when applied to multilevel data sets, because they produce filled-­in data
with no between-­cluster variation (Andridge, 2011; Black, Harel, & McCoach, 2011;
Lüdtke et al., 2017; Mistler & Enders, 2017; Reiter, Raghunathan, & Kinney, 2006;
Taljaard, Donner, & Klar, 2008; van Buuren, 2011). However, both procedures readily
extend to multilevel analyses and are widely available in software packages (Asparouhov
& Muthén, 2010c; Carpenter et al., 2011; Carpenter & Kenward, 2013; Enders, Keller, et
al., 2018; Goldstein et al., 2009, 2014; van Buuren, 2011; Yucel, 2008, 2011). Consistent
with their single-­level counterparts, the joint modeling framework uses a multivariate
regression model as an imputation model, and fully conditional specification uses a
sequence of univariate multilevel models.

Fixed Effect Imputation


A method known as fixed effect imputation warrants a brief discussion, because there
are situations where it may be a useful alternative to a multilevel imputation scheme.
Whereas multilevel imputation uses random effects or latent variables to preserve
between-­cluster variation, fixed effect imputation dummy codes the level-2 groups and
includes the code variables in a single-­level imputation scheme. This strategy may be
useful when (1) the number of clusters is very small (a situation that makes estimation
difficult), (2) the level-2 groups cannot be viewed as samples from a larger population, or
(3) between-­cluster differences are viewed as nuisance variation rather than a substan-
tive phenomenon.
To illustrate fixed effect imputation, reconsider the random intercept analysis from
Equation 8.3. The fixed effect imputation model for the dependent variable is
J
Yij= ∑γ D j j ( )
+ γ J +1 X1ij + ε ij= E Yij | X ij + ε ij (8.45)
j =1

( ( ) )
Yij ~ N1 E Yij | X ij , σ2ε

where Dj is one of J code variables that equals 1 if participant i belongs to group j and
0 otherwise. Note that I use an absolute coding scheme that omits the usual regression
332 Applied Missing Data Analysis

intercept and includes the entire set of J code variables. This specification defines each γj
as a group-­specific mean or intercept. Importantly, I exclude level-2 predictors, because
the code variables explain all between-­cluster variation in the outcome (McNeish &
Kelley, 2019).
Fixed effect imputation is computationally simple and capable of producing accu-
rate parameter estimates in certain situations (Lüdtke et al., 2017; Reiter et al., 2006). It
also has noteworthy limitations. Methodologists have pointed out that dummy coding
appears to overcompensate for group mean differences (Graham, 2012, p. 136), and
analytic work confirms the procedure can exaggerate between-­group variation (Lüdtke
et al., 2017). This positive bias gets worse as either the intraclass correlation or within-­
cluster sample size decreases. On the inferential side, other studies have shown that
fixed effect imputation can produce positively biased standard errors and inaccurate
confidence intervals (Andridge, 2011; van Buuren, 2011). Bias issues aside, the proce-
dure is practically limited to random intercept analyses (Enders, Mistler, & Keller, 2016),
as preserving random slope variation requires a large set of product terms between the
dummy codes and a level-1 variable.

8.7 JOINT MODEL IMPUTATION

Joe Schafer extended the popular joint model imputation framework to multilevel data
structures (Schafer, 2001; Schafer & Yucel, 2002), and a number of flexible variations
of his approach have since appeared in the literature (Asparouhov & Muthén, 2010c;
Carpenter et al., 2011; Carpenter & Kenward, 2013; Goldstein et al., 2009; Goldstein
et al., 2014; Yucel, 2008, 2011). I describe a version that uses an empty multivariate
regression model where all variables are outcomes regardless of their role in the analy-
sis (Asparouhov & Muthén, 2010c; Carpenter & Kenward, 2013). The model allows
missing data at either level of the data hierarchy, and it readily accommodates a latent
variable formulation for incomplete categorical variables. Importantly, the joint model is
limited to random intercept analyses and has no capacity for preserving random associa-
tions between pairs of incomplete variables. Later in the section, I describe an extension
that uses cluster-­specific covariance matrices to preserve these relations (Quartagno &
Carpenter, 2016; Yucel, 2011).
To illustrate the joint model imputation scheme more concretely, reconsider the
two-level random intercept analysis model from Equation 8.18. Consistent with its
single-­level counterpart, the multilevel joint model invokes a multivariate normal dis-
tribution for continuous and latent response variables. The normal distribution is now
more complex and involves within- and between-­cluster variation and covariation (e.g.,
variation among employees who belong to the same workgroup, and variation among
the workgroups). Following ideas presented earlier in the chapter, each level-1 variable
decomposes into the sum of a grand mean and within- and between-­cluster residu-
als (see Equation 8.8). The within-­cluster model expresses level-1 scores as correlated
deviations around their latent group means, as follows:
Multilevel Missing Data 333

 μ1 j   1ij( W ) 
 PROBSOLVEij   Y1ij  r
      r
Y=
 PRETESTij 
=
 Y2ij   μ 2 j  +  2ij( W ) 
ij( W )  STANMATH  =Y3ij   μ3 j   r (8.46)
  3ij( W ) 
ij
   *  
 FRLUNCH ij*    μ
   Y4 ij   4 j   r4 ij( W ) 

(
Yij( W ) ~ N 4 μ j , S ( W ) )
Following procedures from Chapter 6, the lunch assistance indicator appears as a latent
response variable (e.g., a student’s underlying proclivity for receiving free or reduced-­
price lunch), and the corresponding diagonal element of the within-­cluster covariance
matrix is fixed at 1 to establish a metric. In words, the bottom row of the equation
says that level-1 scores are normally distributed around their latent group means. The
within-­cluster normal distribution is also the posterior predictive distribution of the
level-1 missing values.
Importantly, the within-­cluster model presumes that associations among level-1
variables are the same across all level-2 units (e.g., all schools share the same variance–­
covariance matrix). This effectively limits joint model imputation to random intercept
analyses, because it has no capacity for preserving random associations between pairs
of incomplete variables. Simulation studies show that applying the joint model prior to
estimating a random coefficient analysis produces substantial bias, because the filled-
­in values eradicate cluster-­specific associations from the data (e.g., slope variance esti-
mates are dramatically attenuated; Enders et al., 2016). A variant of the joint model with
random within-­cluster covariance matrices addresses this shortcoming (Yucel, 2011).
Level-1 variables correlate with level-2 variables via their latent group means in the
between-­cluster model, which now consists of six empty regression equations with cor-
related residuals.
 r1 j( B ) 
 μ1 j   μ1 j   μ1   
       r2 j( B ) 
 μ2 j   μ2 j  μ
 2  
 μ3 j   μ3 j   μ 3   r3 j( B ) 
Y=
j( B )
 =  =   +  (8.47)
 μ4 j   μ4 j   μ 4   r4 j( B ) 
 TEACHEXP  Y  μ   
   5j   5   r5 j( B ) 
j
 CONDITION *j   Y6*j   μ6   r 
   
 6 j( B ) 
(
Y j( B ) ~ N 6 μ, S ( B ) )
The treatment assignment indicator appears as a latent response variable, and the cor-
responding diagonal element of the between-­cluster covariance matrix is fixed at 1 to
establish a metric. The between-­cluster mean structure also includes fixed threshold
parameters for the two binary variables. The between-­cluster normal distribution serves
double duty as the posterior predictive distribution of the level-2 missing values (includ-
ing the latent group means). In fact, Equations 8.46 and 8.47 are the same models that
334 Applied Missing Data Analysis

I applied to incomplete regressors earlier in Section 8.2, but the equations now include
the dependent variable and its latent group mean. I characterize this imputation model
as agnostic, because it looks nothing like the analysis model in Equation 8.18. This
difference is not a problem, because the multivariate normal data structure does not
conflict with the analytic model.

MCMC Algorithm
The posterior distribution for joint model imputation is a complicated multivariate func-
tion that describes the relative probability of different combinations of model param-
eters, latent group means, and missing values given the observed data. The MCMC algo-
rithm applies a now-­familiar strategy: Estimate one unknown at a time (e.g., param-
eter, latent variable, missing score), holding all other quantities at their current values.
The generic MCMC recipe for parallel imputation chains is shown below, and I refer
interested readers to the literature for the exact form of each distribution (Carpenter &
­Kenward, 2013; Goldstein et al., 2009; Schafer & Yucel, 2002; Yucel, 2008):

Do for m = 1 to M imputations.
Assign starting values to all parameters, random effects, and missing values.
Do for t = 1 to T iterations.
> Estimate the grand means, given everything else.
> Estimate latent group means, given everything else.
> Estimate the between-­cluster covariance matrix, given everything else.
> Estimate the within-­cluster covariance matrix, given everything else.
> Impute missing values, given the model parameters.
Repeat.
Save the filled-­in data for later analysis.
Repeat.

The final imputation step uses the mean vector and covariance matrices to construct a
regression model and distribution of missing values for each unique missing data pat-
tern. I illustrated this process for a single-­level analysis in Section 5.9, and the same idea
applies here.

Analysis Example
Revisiting the cluster-­randomized educational intervention, I applied joint model impu-
tation to the random intercept analysis from Equation 8.18. After inspecting the poten-
tial scale reduction factors (Gelman & Rubin, 1992) from a preliminary diagnostic run,
I created M = 100 filled-­in data sets by saving the imputations from the final iteration
of 100 parallel MCMC chains with 2,000 iterations each. I used restricted maximum
Multilevel Missing Data 335

TABLE 8.9. Multiple Imputation Estimates from a Random Intercept Analysis


Parameter Est. SE t df p FMI
Joint model
β0 29.31 2.05 14.32 19.36 < .001 .27
β1 (PRETEST) 0.27 0.04 6.69 483.74 < .001 .26
β2 (STANMATH) 0.19 0.02 9.30 533.40 < .001 .23
β3 (FRLUNCH) –0.43 0.44 –0.96 17.31 .35 .35
β4 (TEACHEXP) 0.10 0.10 0.98 20.78 .34 .21
β5 (CONDITION) 2.11 0.79 2.67 23.02 .01 .12
Intercept var. (σb20) 3.56 — — — — —
Residual var. (σε2) 18.56 — — — — —

Fully conditional specification


β0 28.72 1.90 15.13 22.03 < .001 .15
β1 (PRETEST) 0.28 0.04 7.50 66.59 < .001 .17
β2 (STANMATH) 0.19 0.02 9.16 41.47 < .001 .32
β3 (FRLUNCH) –0.44 0.42 –1.05 17.11 .31 .30
β4 (TEACHEXP) 0.10 0.10 0.99 20.75 .34 .19
β5 (CONDITION) 2.12 0.79 2.67 22.93 .01 .12
Intercept var. (σb20) 3.62 — — — — —
Residual var. (σε2) 18.35 — — — — —

Note. FMI, fraction of missing information.

likelihood to fit the random intercept regression model to each data set, and I applied
Rubin’s (1987) rules to pool the parameter estimates and standard errors. To refresh,
the Barnard and Rubin (1999) small-­sample degrees of freedom adjustment requires
the complete-­data degrees of freedom as an input (see Equations 7.21 and 7.22). Follow-
ing the HLM software package (Raudenbush et al., 2019), I used the number of schools
minus the number of predictors minus 1 as the degrees of freedom for all coefficients
(i.e., dfcom = 29 – 5 – 1). The top panel of Table 8.9 gives the multiple imputation point
estimates and standard errors. Perhaps not surprisingly, joint model imputation pro-
duced results that are effectively equivalent to the Bayesian analysis, and model-based
multiple imputation results from Section 8.2 (see Tables 8.1 and 8.2). As such, no further
discussion of the results is warranted.

Random Within‑Cluster Covariance Matrices


The classic joint imputation model has no capacity for preserving random slope varia-
tion, but Yucel (2011) described an extension of the method that accommodates cluster-­
specific associations among level-1 variables. Although relatively few studies have eval-
uated this approach, it holds promise for random coefficient and meta-­analysis mod-
336 Applied Missing Data Analysis

els (Enders, Hayes, et al., 2018; Quartagno & Carpenter, 2016). To illustrate Yucel’s
approach more concretely, reconsider the daily diary study and the random coefficient
analysis model from Equation 8.30. The within-­cluster model again expresses level-1
scores as correlated deviations around their latent group means, as follows:

 POSAFFECTij   Y1ij   μ1 j   r1ij( W ) 


     
Yij( W ) = PAIN ij =  Y2ij  =μ  2 j  +  r2ij( W )  (8.48)
 SLEEP   Y   μ   
 ij   3ij   3 j   r3ij( W ) 
(
Yij( W ) ~ N 3 μ j , S j( W ) )
The distribution is similar in composition to the one in Equation 8.46 but features
cluster-­specific covariance matrices (i.e., the j subscript on Σj(W)). The between-­cluster
part of the model has the same composition as before.
r 
 μ1 j   μ1   1 j( B ) 
     r2 j B 
μ2 j
   μ2   ( ) 
Y=  μ =  μ 3  +  r3 j( B )  (8.49)
j( B ) 3 j
     
 PAINACCEPTj   μ 4   r4 j( B ) 
 FEMALE*  μ  
 j   5   r5 j B 
 ( )
(
Y j( B ) ~ N 5 μ, S ( B ) )
The Bayesian analysis for modeling group-­specific covariance matrices treats each
Σj(W) as an unknown variable that follows a Wishart distribution (a multivariate gener-
alization of a right-­skewed chi-­square distribution). As explained in Section 4.10, the
distribution’s center and spread are determined by a degrees of freedom value and sums
of squares and cross-­products matrix, respectively. The Wishart serves as a metadistri-
bution for the group-­specific covariance matrices, such that each Σj(W) varies around
an average matrix defined by a pooled degrees of freedom value and scale matrix. The
pooled degrees of freedom value is a function of the average within-­cluster sample size
plus the number of imaginary data points assigned to the prior distribution, and the
pooled scale matrix is a function of the average within-­cluster covariance matrix plus a
prior scale matrix.

MCMC Algorithm
The posterior distribution for the random covariance matrix model is again a compli-
cated multivariate function describing the relative probability of different combina-
tions of model parameters, latent group means, and missing values given the observed
data. The algorithmic steps resemble those from the classic joint model, but the recipe
includes two new steps that estimate the pooled degrees of freedom and scale matrix
(the two components that define the average level-1 covariance matrix). The estimation
Multilevel Missing Data 337

step for the cluster-­specific covariance matrices leverages this pooled information, such
that some variables can be completely missing within a given cluster (Quartagno &
­Carpenter, 2016). The generic MCMC recipe is shown below, and I refer interested read-
ers to the literature for the exact form of each distribution (Carpenter & Kenward, 2013;
Goldstein et al., 2009; Schafer & Yucel, 2002; Yucel, 2008):

Do for m = 1 to M imputations.
Assign starting values to all parameters, random effects, and missing values.
Do for t = 1 to T iterations.
> Estimate the grand means, given everything else.
> Estimate latent group means, given everything else.
> Estimate the between-­cluster covariance matrix, given everything else.
> Estimate the pooled scale matrix, given everything else.
> Estimate the pooled degrees of freedom, given everything else.
> Estimate cluster-­specific covariance matrices, given everything else.
> Impute missing values, given the model parameters.
Repeat.
Save the filled-­in data for later analysis.
Repeat.

Analysis Example
Revisiting the earlier health psychology analysis, I used the joint model with random
within-­cluster covariance matrices to create daily diary imputations for the random
coefficient regression from Equation 8.30. Although the imputation model is not per-
fectly compatible with a random slope analysis model, simulation studies suggest it nev-
ertheless performs well in this context (Enders, Hayes, et al., 2018). I generated M = 100
imputations from a sequential MCMC chain with 3,000 burn-in iterations and 3,000
thinning or between-­imputation iterations (i.e., I saved the first data set after 3,000
iterations and saved the remaining data sets every 3,000 computational cycles thereaf-
ter). After creating the multiple imputations, I centered the predictor variables at their
arithmetic averages (the software for this variant of the joint model does not save latent
group means) and used restricted maximum likelihood to fit the random coefficient
model to each data set. Finally, I used Rubin’s (1987) rules to pool the parameter esti-
mates and standard errors and applied the Barnard and Rubin (1999) degrees of freedom
expression for significance tests. Table 8.10 summarizes the multiple imputation point
estimates and standard errors, which were quite like the model-based imputation results
in Table 8.4. The random covariance model appears to provide a close approximation to
an optimal model-based imputation routine, especially when the proportion of between-­
cluster variation is large, as it is here (Enders, Hayes, et al., 2018). The substantive inter-
pretations match the earlier example, so no further discussion is warranted.
338 Applied Missing Data Analysis

TABLE 8.10. Joint Model with Random Covariance Matrices Multiple


Imputation Estimates from a Random Coefficient Analysis
Parameter Est. SE t df p FMI
β0 4.21 0.10 40.45 124.88 < .001 0.03
β1 (DAILY PAIN) –0.08 0.02 –3.61 199.04 < .001 0.63
β2 (SLEEP) 0.09 0.01 7.95 167.08 < .001 0.69
β3 (MEAN PAIN) –0.10 0.06 –1.58 124.53 .12 0.03
β4 (PAINACCEPT) 0.01 0.07 0.21 125.20 .84 0.02
β5 (FEMALE) –0.05 0.14 –0.40 125.58 .69 0.02
Intercept var. (σb20) 0.59 — — — — —
Covariance (σb20b1) 0.01 — — — — —
Slope var. (σb21) 0.01 — — — — —
Residual var. (σε2) 0.35 — — — — —
β3 – β1 (Contextual) –0.01 0.07 0.22 124.87 .83 —

Note. FMI, fraction of missing information.

8.8 FULLY CONDITIONAL SPECIFICATION IMPUTATION

The fully conditional specification routine described in Chapter 7 imputes variables


one at a time by stringing together a series regression models, one per incomplete vari-
able. Van Buuren (2011) extended this popular strategy to multilevel data structures
with incomplete level-1 variables, and subsequent work applied the approach to level-2
and level-3 variables (Enders, Keller, et al., 2018; Grund, Lüdke, & Robitzsch, 2017).
This section describes the fully conditional specification approach implemented in van
Buuren’s popular MICE package (van Buuren et al., 2021) as well as an alternative version
that uses latent response variables and latent group means (Enders, Keller, et al., 2018;
Keller & Enders, 2021). Although the two strategies often produce the same results, the
latter approach is advantageous, because it is compatible with the joint model and natu-
rally accommodates unequal group sizes. Importantly, applications of fully conditional
specification should be limited to random intercept analyses, as applying reverse regres-
sion schemes to random coefficient models introduces bias (Enders et al., 2020; Enders,
Keller, et al., 2018; Grund et al., 2016a, 2018). I return to this issue later in the section.
To illustrate multilevel fully conditional specification more concretely, reconsider
the cluster-­randomized educational experiment and the random intercept analysis model
from Equation 8.18. Recall that pretest scores and the school-­level treatment assignment
indicator were complete, but the four remaining variables had missing data. Consis-
tent with its single-­level counterpart, fully conditional specification deploys regression
models in a round-robin fashion, one for each incomplete variable. A basic imputation
scheme applies a random intercept imputation model to each level-1 variable. The fol-
lowing linear regressions are the imputation models for the posttest problem-­solving
and standardized achievement test variables:
Multilevel Missing Data 339

   
t
( t −1
) (
PROBSOLVEij( ) = γ 01 j + γ11 STANMATH ij( ) + γ 21 FRLUNCH ij( )
t −1
) (8.50)
( ) ( )
+ γ 31 TEACHEXPj( ) + γ 41 PRETESTij + γ 51 CONDITION j + r1ij
t −1
( )
( ) (
STANMATH ij( ) = γ 02 j + γ12 FRLUNCH ij( ) + γ 22 PROBSOLVEij( )
t t −1 t
)
   
( ) ( )
+ γ 32 TEACHEXPj( ) + γ 42 PRETESTij + γ 52 CONDITION j + r2ij
t −1
( )
To track changes to the imputed data across models, I attach a t superscript to the
incomplete variables to index iterations (e.g., the imputed standardized math scores
on the right side of the problem-­solving imputation model originate from the previous
iteration).
The binary lunch assistance indicator requires a multilevel logistic regression
model with random intercepts.

(
 Pr FRLUNCH (t ) = 1 
)
ln  (
 = γ + γ PROBSOLVE (t ) + γ STANMATH (t )
) ( )
ij

( 
)
03 j 13 ij 23 ij
(t )
 1 − Pr FRLUNCH ij = 1 
  (8.51)
( ( t −1)
+ γ 33 TEACHEXPj )
+ γ 43 PRETESTij ( )
(
+γ 53 CONDITION j )
The logistic model has no within-­cluster variance, because this parameter is a fixed con-
stant. Following Section 8.2, the random intercepts (i.e., γ01j, γ02j, and γ03j) and within-­
cluster residuals (i.e., r1ij and r2ij) are normally distributed. Following established logic,
MCMC creates continuous imputations by drawing random numbers from a normal
distribution centered at a predicted score that incorporates the random effects (e.g.,
Equation 8.12), and the algorithm samples binary outcome scores from a binomial dis-
tribution.

FRLUNCH ij( ()mis ) ~ Binomial 1, πij


t
( )
The 1 in the binomial function’s first argument indicates that everyone has a single
score, and πij is the predicted probability of receiving free or reduced-­price lunch (see
Equation 6.53). Conceptually, drawing binomial random numbers is akin to tossing a
biased coin where the probability of a head equals πij. If the biased coin toss produces a
head, the imputation equals 1 and is 0 otherwise.
The previous imputation models do not allow within- and between-­cluster associa-
tions to differ, because they use a single slope coefficient to preserve each bivariate asso-
ciation (joint model imputation imposes no such restriction, and the level-1 and level-2
correlations can differ). This feature isn’t an issue for the random intercept analysis,
which imposes the same structure on the data (i.e., the model represents each level-1
predictor with a single coefficient). To apply fully conditional specification to analyses
that posit distinct covariance structures at level-1 and level-2 (e.g., contextual effects
models, multilevel factor analysis, or covariance structure models), you simply add the
340 Applied Missing Data Analysis

cluster means as level-2 regressors. This strategy mimics the logic of joint model imputa-
tion, and the two approaches often produce equivalent results (Carpenter & Kenward,
2013; Enders et al., 2016; Grund et al., 2017; Mistler & Enders, 2017). I illustrate a model
that uses latent group means in the next section.
Level-1 variables relate to level-2 variables via their group-level averages. As such,
the between-­cluster imputation models are single-­level regressions with cluster means
and level-2 variables as predictors. The imputation model for the teacher experience
variable is as follows:

(t )  (t ) 
TEACHEXPj( ) = γ 04 + γ14  PROBSOLVE j 
t
 + γ 24  STANMATH j 
    (8.52)


(t )
+ γ 34  FRLUNCH j  + γ 44

( PRETEST ) + γ (CONDITION ) + r
j 54 j 04 j

The bars over the level-1 regressors convey that the group means are arithmetic aver-
ages of the imputed level-1 scores from iteration t, and the between-­cluster residual is
normally distributed, as before. Importantly, this specification assumes equal cluster
sizes and is incompatible with a joint model when cluster sizes are unbalanced (Car-
penter & ­Kenward, 2013; Enders et al., 2016; Grund et al., 2017; Mistler & Enders,
2017). However, empirical research suggests that biases resulting from unequal group
sizes tend to be relatively small and are most evident when the intraclass correlation or
within-­cluster sample size is very small (Grund et al., 2017). The next section remedies
this shortcoming by marrying fully conditional specification with latent group means.

Fully Conditional Specification with Latent Variables


This section describes a modification to fully conditional specification that uses latent
response variables in lieu of categorical variables and latent group means instead of
arithmetic averages means (Enders, Keller, et al., 2018; Keller & Enders, 2021). This
formulation is potentially advantageous, because it is equivalent to the joint model and
naturally accommodates unequal group sizes.
Fully conditional specification with latent variables can be understood as a repa-
rameterization of the within- and between-­cluster joint models in Equations 8.46 and
8.47, respectively. The equations below leverage the decomposition in Equation 8.8 and
the well-known property that a multivariate normal distribution’s parameters can be
expressed as an equivalent set of linear regression models (Arnold et al., 2001; Liu et al.,
2014). Using generic notation, the within-­cluster regressions are as follows:
t
( ) (
t −1
) ) (
    Y1(ij) = μ1 j + γ11( W ) Y2(ij ) − μ 2 j + γ 21( W ) Y3(ij ) − μ 3 j + γ 31( W ) Y4 ij( ) − μ 4 j + r1ij( W ) (8.53)
t −1 * t −1

Y = μ + γ ( ) (Y
(t )
−μ )+ γ ( )(
( t −1)
− μ ) + γ ( ) (Y − μ ) + r ( )
Y4 ij( ) (t )
* t −1
2 ij 2j 12 W 3ij 3j 22 W 4j 32 W 1ij 1j 2 ij W

Y ( ) = μ + γ ( ) (Y ( ) − μ ) + γ ( ) (Y ( ) − μ ) + γ ( ) (Y ( ) − μ ) + r ( )
t * t −1 t t
3ij 3j 13 W 4 ij 4j 23 W 1ij 1j 33 W 2 ij 2j 3ij W

Y ( ) = μ + γ ( ) (Y ( ) − μ ) + γ ( ) (Y ( ) − μ ) + γ ( ) (Y ( ) − μ ) + r ( )
*t t t t
4 ij 4j 14 W 1ij 1j 24 W 2 ij 2j 34 W 3ij 3j 4 ij W
Multilevel Missing Data 341

Centering the regressors in each equation at their latent group means removes all
between-­cluster variation from the level-1 variables (i.e., the γ’s reflect pure within-­
cluster associations) and defines the intercept as the target variable’s latent group mean.
Although it isn’t obvious, level-1 variables correlate with level-2 variables (e.g., teacher
experience and the latent treatment assignment indicator) via these random intercepts.
The bottom equation is a probit model for the latent response variable, which now
replaces the binary variable as a regressor on the right side of the other equations. As
always, setting the variance of the r4ij(W) residuals to 1 establishes a metric.
Next, consider the between-­cluster joint model in Equation 8.47. The six-­dimensional
multivariate normal distribution similarly spawns an equal number of between-­cluster
regressions.

( ) (
μ1( j) = μ1 + γ11( B ) μ (2 j ) − μ 2 + γ 21( B ) μ (3 j ) − μ 3 + γ 31( B ) μ (4 j ) − μ 4
t t −1 t −1 t −1
) ( ) (8.54)
( )
+ γ 41( B ) Y5( j ) − μ 5 + γ 51( B ) Y6 (j ) − μ 6 + r1 j( B )
t −1 * t −1
( )
...
(t )
Y5 j = μ 5 + γ15( B ) ( Y6 (j )
* t −1
− μ6 ) ( )
+ γ 25( B ) μ1( j) − μ1 + γ 35( B ) μ 2( j) − μ 2
t t
( )
( ) (
+ γ 45( B ) μ (3 j) − μ 3 + γ 55( B ) μ (4 j) − μ 4 + r5 j( B )
t t
)
Y6 (j )
*t
( (t )
)
= μ 6 + γ16( B ) μ1 j − μ1 + γ 26( B ) μ 2 j − μ 2 + γ 36( B ) μ (3 j) − μ 3 ( (t )
) ( t
)
( )
+ γ 46( B ) μ (4 j) − μ 4 + γ 56( B ) Y5( j) − μ 5 + r6 j( B )
t
( t
)
Consistent with the lunch assistance variable, the treatment assignment indicator
appears as a latent response variable, and the variance of r6j(B) is fixed at 1.
The latent group means are effectively missing data, and each MCMC iteration esti-
mates these quantities by drawing their values from a normal distribution. However,
the conditional distribution of the latent means given all other quantities is complex,
because the group means appear in two models (e.g., μ1j functions as a random intercept
in Y1’s within-­cluster regression in Equation 8.53, and it appears as the outcome in a
between-­cluster model). Because a given latent mean is common to all members of the
level-2 cluster (e.g., all students within a given school share the same group mean), the
within-­cluster model’s contribution to the conditional distribution repeats nj times, once
for each observation in group j. The distribution that generates the latent group means
is the product of the two normal curves below:
nj

( μ1 j |, Y2ij , Y3ij , Y4*ij , Y5 j , Y6*j ∝ ) ∏N ( E (Y


i =1
1 1ij )
| Y2ij , Y3ij , Y4*ij , σr21( W ) ) (8.55)

× N1 E (( μ1 j | μ 2 j , μ 3 j , μ 4 j , Y5 j , Y6*j ) , σr21( B) )
In fact, the distribution’s two-part composition is identical to that of an incomplete
level-2 regressor in a Bayesian analysis. The product over the level-1 scores highlights
that latent group averages accommodate unequal group sizes by explicitly conditioning
on the number of within-­cluster observations (e.g., the reliability of each group’s latent
variable increases as group size increases and vice versa).
342 Applied Missing Data Analysis

Analysis Example
Revisiting the cluster-­randomized education study, I applied fully conditional specifica-
tion with latent variables to the random intercept analysis from Equation 8.18. Follow-
ing earlier examples, I created M = 100 filled-­in data sets by saving imputations from
parallel MCMC chains, and I used Rubin’s (1987) rules to pool the restricted maximum
likelihood estimates and standard errors. The bottom panel of Table 8.9 summarizes the
analysis results. Perhaps not surprisingly, joint model imputation produced results that
are effectively equivalent to joint model imputation, as well as the Bayesian analysis and
model-based multiple imputation results from Section 8.2 (see Tables 8.1 and 8.2). As
such, no further discussion of the results is warranted.

Reverse Random Coefficient Imputation


Applications of fully conditional specification should be limited to random intercept
analyses, because applying reverse regression schemes to random coefficient models
introduces bias (Enders et al., 2020; Enders, Keller, et al., 2018; Grund et al., 2016a,
2018). To illustrate why this is the case, consider a simple random slope model with
an incomplete level-1 predictor (e.g., daily pain ratings predicting daily positive affect
scores).

( )
Yij = β0 j + β1 j X ij + ε ij = E Yij | X ij + ε ij (8.56)

Yij ~ N1 ( E ( Yij | X ij ) , σ2ε )

A seemingly reasonable way to impute the explanatory variable is to employ a reverse


random coefficient model that allows the influence of Y on X to vary across level-2 units
(Grund et al., 2016a).

( )
X ij = γ 0 j + γ1 j Yij + rij = E X ij | Yij + rij (8.57)

( (
X ij ~ N1 E X ij | Yij , σr2) )
Although the two equations share the same structure and appear to target the same
group-­specific association, they are logically inconsistent. Revisiting a concept from
Chapter 7, the regression models are incompatible, because the two univariate normal
distributions cannot originate from the same multivariate distribution (unless the ran-
dom slope variance equals zero). In practical terms, incompatibility means that the nor-
mal distribution of X in Equation 8.57 is mathematically impossible given the composi-
tion of the analysis and the model-­implied distribution of Y in Equation 8.56 (and vice
versa).
Revisiting the Bayesian specification for a random coefficient model provides addi-
tional insight into why reverse regression gives flawed imputations. Returning to the
distribution of missing values in Equation 8.29, the random coefficients induce het-
eroscedasticity, such that the variation of the imputations in that equation depends on
the magnitude of a group’s random slope (or more accurately, β12j ). In contrast, the nor-
Multilevel Missing Data 343

mal distribution in Equation 8.57 is misspecified, because it says that all imputations
have the same variance, regardless of group membership. The practical consequence of
this misspecification is that slope variance estimates are too small, and other parameters
may also exhibit bias (Enders et al., 2020; Enders, Hayes, et al., 2018; Enders, Keller,
et al., 2018; Erler et al., 2016; Grund et al., 2016a, 2018). As such, you should avoid
the reverse random coefficient specification and use Bayesian estimation or model-
based multiple imputation. You may recall that we previously encountered the same
misspecification problem when applying reverse regression to moderated regression
analyses (i.e., just-­another-­variable imputation). In fact, random slope models are just a
special type of moderated regression that feature the product of a level-1 regressor and
level-2 latent variable.

8.9 MAXIMUM LIKELIHOOD ESTIMATION

At least for now, maximum likelihood estimation is arguably less capable than Bayesian
estimation and multiple imputation, because it handles a more limited set of multilevel
missing data problems. Virtually any software package that estimates mixed models can
accommodate incomplete outcomes. Consistent with classic regression models, analyz-
ing the observed data yields accurate estimates when missing values are restricted to
the dependent variable and missingness is due to explanatory variables (Little, 1992;
von Hippel, 2007). This scenario could arise, for example, in a longitudinal study where
baseline covariates are complete but repeated measurements are incomplete due to inter-
mittent missingness or attrition. In fact, no imputation is needed, because this missing
data pattern is simply a complete-­data estimation problem with unbalanced group sizes
(i.e., each level-2 individual could have a different number of level-1 repeated measure-
ments).
The situation becomes more complicated when explanatory variables have missing
values. Currently, dedicated multilevel modeling software packages have limited capac-
ity for handling incomplete predictors, and most programs simply delete observations
with missing covariates. Not only does deletion assume a stringent MCAR process (i.e.,
missingness is haphazard and unrelated to the data), but it can also dramatically reduce
the sample size, particularly when an entire cluster is removed, because its level-2
scores are incomplete. The notable exception is the HLM program (Raudenbush et al.,
2019), which addresses incomplete predictors using an approach developed by Shin and
Raudenbush (2007, 2013). This estimator assumes that incomplete variables are multi-
variate normal, and it currently has no capacity for handling categorical predictors or
random slopes between incomplete level-1 variables (these limitations do not apply to
complete covariates). The procedure is conceptually like joint model imputation and
leverages comparable normal distributions.
To describe the HLM approach, reconsider the random intercept analysis model
from Equation 8.18. Multilevel modeling software packages typically assume explana-
tory variables are fixed by design, meaning that no distributional assumptions are
applied to these variables. As you know, this specification is antithetical to any type
of missing data handling. Shin and Raudenbush (2007, 2013) address this problem
344 Applied Missing Data Analysis

by deriving transformations that reparameterize the analysis model into within- and
between-­cluster normal distributions like those in Equations 8.46 and 8.47. After repa-
rameterizing the model, Shin and Raudenbush use an EM algorithm to estimate level-­
specific mean vectors and covariance matrices. The expectation and maximization steps
at each level are fundamentally similar to a conventional EM algorithm (Dempster et al.,
1977; Little & Rubin, 2002), because each covariance matrix reflects a single source of
variation. Finally, a reverse transformation converts maximum likelihood estimates of
the mean vector and covariance matrices to the desired regression model parameters.
Multilevel structural equation modeling is a second option for implementing maxi-
mum likelihood missing data handling. Like the HLM program, most multilevel struc-
tural equation modeling estimators currently assume that incomplete predictors are
multivariate normal. Although some software packages do allow for incomplete random
slope predictors, simulation studies suggest that maximum likelihood is prone to sub-
stantial biases (Enders et al., 2020; Enders, Hayes, et al., 2018), presumably, because the
incomplete predictors condition on the outcome in a way that doesn’t account for the
heteroscedasticity of their missing values (see Equation 8.29). For this reason, I restrict
the focus to regression models with random intercepts. A number of accessible descrip-
tions of multilevel structural models are available in the literature (Mehta & Neale,
2005; Rabe-­Hesketh, Skondral, & Zheng, 2012; Stapleton, 2013), as are technically ori-
ented papers that provide a deeper dive into the mechanics of estimation (Asparouhov
& Muthén, 2007; Bentler & Liang, 2011; Liang & Bentler, 2004; Muthén & Asparouhov,
2008; Rabe-­Hesketh et al., 2004, 2012).
Returning to ideas from Chapter 3, a structural equation model views an individu-
al’s responses as a multivariate normal data vector (see Equation 3.26). The framework
is a flexible tool for implementing maximum likelihood, because it allows researchers
to structure the normal distribution’s parameters as regression models, potentially with
latent variables. Importantly, individuals need not contribute the same amount of infor-
mation to estimation, and each person’s likelihood expression can have a different pat-
tern or number of observed responses in Yi. When confronted with missing values, the
estimator uses analytic expressions such as expectations to replace the missing parts of
the data instead of imputing the missing values themselves.
A multilevel structural equation model leverages the same multivariate normal
distribution function, but Yi functions as a correlated vector of exchangeable level-1
observations nested within a level-2 unit i (i.e., clusters become the unit of analysis). To
illustrate, consider an empty random intercept model for the dependent variable, math
problem solving. The multilevel structural equation model defines each level-1 observa-
tion (e.g., student test score) as an indicator of a cluster-­level latent factor. The following
equation shows the factor model for a particular school i with 20 students:

 PROBSOLVE1i  0 1  ε1i 


       
Yi = 
PROBSOLVE2i 
= υ + Λβ i + ε =  0  +  1  ( β ) +  ε 2i  (8.58)
 …   …  … 0 i  … 
       
 PROBSOLVE20i  0 1  ε 20i 
Fixing the measurement intercepts in υ to zero transmits the mean structure infor-
Multilevel Missing Data 345

mation to the factor, such that β0i represents a level-2 latent group mean. Setting the
factor loading matrix equal to a vector of 1’s treats student scores as exchangeable,
equivalent indicators of the latent factor (by extension, all residuals share the same
variance).
The structural equation model parameters combine to produce predictions about
the population means and covariance matrix. Following earlier notation from Chap-
ter 3, the model-­predicted or model-­implied moments for the empty random intercept
model are μ(θ) and Σ(θ).

 σ2b + σ2ε σ2b0  σ2b0 


 β0   0 
   σ2b σ2b0 + σ2ε  σ2b0 
β0
μ ( θ ) = Λβ =   S ( θ ) = ΛΨΛ ′ + Θ =   (8.59)
0

…     σ2b0 
   
 β0   σ2b σ2b0 σ2b0 σ2b0 + σ2ε 
 0

A conventional multilevel model induces the same model-­implied mean and covariance
structure. Mehta and Neale (2005) provide an accessible tutorial on multilevel struc-
tural models that highlight linkages with the conventional mixed modeling framework.
Maximum likelihood estimation for multilevel structural equation models bor-
rows heavily from concepts in Chapter 3. For example, the observed-­data log-­likelihood
replaces the population mean vector and covariance matrix with their model-­implied
counterparts (see Equation 3.26), and the goal of estimation is to find the random inter-
cept model parameters that minimize the discrepancies between the observed data and
the model-­implied moments in μ(θ) and Σ(θ). Methodologists have developed EM algo-
rithms for this purpose that are similar to those in Chapter 3 (Bentler & Liang, 2011;
Liang & Bentler, 2004; Poon & Lee, 1998; Raudenbush, 1995).

Analysis Example
Revisiting the earlier math problem-­solving analysis, I used a multilevel structural equa-
tion modeling framework to estimate the random intercept analysis from Equation 8.18.
The model incorrectly treats the incomplete binary lunch assistance indicator as a nor-
mally distributed variable, but limited computer simulation evidence suggests that this
specification may be fine for binary variables (Muthén et al., 2016). Analysis scripts are
available on the companion website.
Table 8.11 gives the focal model parameters and their standard errors. The table
omits predictor model parameters, because these are not the substantive focus. The
maximum likelihood estimates were mostly indistinguishable from those of multiple
imputation and Bayesian estimation; as described previously, the Bayesian analysis had
a somewhat different between-­cluster variance, but other parameters were similar. This
analysis, like many others in the book, underscores the point that different analytic
methods that apply similar assumptions generally give the same answer. It is important
to reiterate that maximum likelihood missing data handling is best suited for random
intercept analyses, as the few estimators that currently allow for incomplete random
slope predictors are prone to bias.
346 Applied Missing Data Analysis

TABLE 8.11. Maximum Likelihood Estimates from a Random


Intercept Analysis
Parameter Est. SE z p
β0 29.29 1.99 14.70 < .001
β1 (PRETEST) 0.27 0.03 8.34 < .001
β2 (STANMATH) 0.19 0.02 12.48 < .001
β3 (FRLUNCH) –0.50 0.39 –1.28   .20
β4 (TEACHEXP) 0.10 0.06 1.57   .12
β5 (CONDITION) 2.13 0.75 2.86 < .001
Intercept var. (σb20) 3.26 0.95 3.41 < .001
Residual var. (σε2) 24.43 1.24 19.70 < .001

8.10 SUMMARY AND RECOMMENDED READINGS

This chapter has described missing data handling for multilevel data structures. Ideas
established in earlier chapters readily extend to multilevel regression. For example,
imputations still equal predicted scores plus noise, but cluster-­specific regressions gen-
erate the predictions, and within-­cluster variation defines the spread of the random
residuals. Similarly, missing dependent variable scores depend only on the focal analysis
model’s parameters and random effects, whereas incomplete predictors require one or
more additional supporting models.
Chapter 7 classified multiple imputation procedures into two buckets according to
the degree of similarity between the imputation and analysis models: An agnostic impu-
tation strategy deploys a model that differs from the substantive analysis, and a model-
based imputation procedure invokes the same focal model as the secondary analysis
(perhaps with additional auxiliary variables). These classifications emphasize that an
analysis model’s composition—­in particular, whether it includes nonlinear effects—­
determines the type of imputation strategy that works best. This distinction also applies
to multilevel data sets, where model-based missing data-­handling strategies are demon-
strably superior for analyses that feature random coefficients, interactive, or curvilinear
effects. In contrast, multilevel extensions of the joint model and fully conditional speci-
fication (agnostic imputation procedures) are best suited for random intercept models. I
recommend the following articles for readers who want additional details on topics from
this chapter:

Enders, C. K., Du, H., & Keller, B. T. (2020). A model-based imputation procedure for multilevel
regression models with random coefficients, interaction effects, and other nonlinear terms.
Psychological Methods, 25, 88–112.

Enders, C. K., Mistler, S. A., & Keller, B. T. (2016). Multilevel multiple imputation: A review and
evaluation of joint modeling and chained equations imputation. Psychological Methods,
21, 222–240.
Multilevel Missing Data 347

Grund, S., Lüdke, O., & Robitzsch, A. (2016). Multiple imputation of missing covariate values in
multilevel models with random slopes: A cautionary note. Behavior Research Methods, 48,
640–649.

Quartagno, M., & Carpenter, J. R. (2016). Multiple imputation for IPD meta-­analysis: Allowing for
heterogeneity and studies with missing covariates. Statistics in Medicine, 35, 2938–2954.

Schafer, J. L., & Yucel, R. M. (2002). Computational strategies for multivariate linear mixed-­
effects models with missing values. Journal of Computational and Graphical Statistics, 11,
437–457.

van Buuren, S. (2011). Multiple imputation of multilevel data. In J. J. Hox & J. K. Roberts (Eds.),
Handbook of advanced multilevel analysis (pp. 173–196). New York: Routledge.

Yucel, R. M. (2011). Random-­covariances and mixed-­effects models for imputing multivariate


multilevel continuous data. Statistical Modelling, 11, 351–370.
9

Missing Not at Random Processes

9.1 CHAPTER OVERVIEW

Analyses that assume a conditionally MAR process have been our go-tos throughout the
book. This mechanism stipulates that unseen score values carry no unique information
about missingness beyond that contained in the observed data. This assumption is con-
venient, because there is no need to specify and estimate a model for the missing data
process. Although the MAR assumption is quite reasonable for a broad range of practi-
cal applications, in some cases it may be plausible that the unseen score values do carry
unique information about missingness, in which case the process is called missing not at
random (MNAR). The two major modeling frameworks for MNAR processes—­selection
models and pattern mixture models—­mitigate nonresponse bias by introducing a model
that describes the occurrence of missing data, but they do that in very different ways: A
typical selection model features a regression equation with a missing data indicator as a
dependent variable, whereas a pattern mixture model uses the indicator as a predictor.
As you will see, models for MNAR processes require strict and unverifiable assump-
tions; selection models leverage untestable normal distributional assumptions, whereas
pattern mixture models generally require you to specify values for one or more inesti-
mable population parameters. Ultimately, there is no way to confirm these requirements
are satisfied, and model misspecifications could produce estimates that contain more
bias than those from a MAR analysis. A common view is that MNAR models are best
suited for sensitivity analyses that examine the stability of one’s substantive conclu-
sions across different assumptions. Beunckens, Molenberghs, Thijs, and Verbeke (2007,
p. 477) define a sensitivity analysis as “one in which several statistical models are con-
sidered simultaneously and/or where a statistical model is further scrutinized using
specialized tools (such as diagnostic measures).” I adopt this definition throughout the
chapter, and the data analysis examples illustrate how to implement this strategy and
interpret potentially conflicting results.

348
Missing Not at Random Processes 349

9.2 MISSING NOT AT RANDOM PROCESSES REVISITED

To refresh ideas from Section 1.3, the MNAR mechanism states that the probability of
missingness is related to the unseen score values. Gomer and Yuan (2021) further sepa-
rate this mechanism into focused and diffuse subtypes that describe different roles for
the observed data. I adopt this distinction throughout the chapter, because it provides a
useful framework for structuring sensitivity analyses that consider competing explana-
tions about the missing data process.
Gomer and Yuan (2021) define a focused MNAR process as one where missingness
depends only on the unseen score values in Y(mis). The conditional distribution of the
missing data indicators is

(
Pr M i = 1| Yi( mis ) , φ ) (9.1)

where Mi is a vector of binary missingness codes for individual i (each indicator equals
1 if missing and 0 otherwise), Yi(mis) represents the person’s unseen score values, and φ
contains model parameters that describe the occurrence of missing data. To illustrate,
consider the depression scores from the chronic pain data set on the companion website.
A focused process would occur if the unseen depression scores were the sole determi-
nant of missingness (e.g., participants with the worst symptoms leave the study to seek
treatment elsewhere).
A diffuse MNAR mechanism is one where nonresponse depends on both the unseen
score values in Y(mis) and the observed data in Y(obs). The conditional distribution of the
missing data indicators for this process is as follows:

(
Pr M i = 1| Yi( obs ) , Yi( mis ) , φ ) (9.2)

Applied to the chronic pain study, a diffuse mechanism would occur if participants
with the worst symptoms (i.e., high values of Yi(mis)) leave the study to seek treatment
elsewhere and participants with low perceived control over their pain (i.e., low values of
Yi(obs)) miss assessments that coincide with acute disruptions in their day-to-day activi-
ties. The literature suggests that diffuse processes are somewhat harder to model (e.g.,
require much larger sample sizes) and are capable of inducing greater nonresponse bias
than focused ones (Du, Enders, Keller, Bradbury, & Karney, 2021; Gomer & Yuan, 2021;
Zhang & Wang, 2012).

9.3 MAJOR MODELING FRAMEWORKS

The previous definitions imply that nonresponse is entangled with the outcome vari-
able in a way that cannot be ignored when analyzing the data. The two major model-
ing frameworks for MNAR processes—­selection models and pattern mixture models—­
mitigate nonresponse bias by introducing a model that describes the occurrence of
missing data, but they do that in different ways: A selection model features a regression
350 Applied Missing Data Analysis

equation with a missing data indicator as a dependent variable, whereas a pattern mix-
ture model uses the missing data indicator as a predictor. Both approaches start with a
multivariate distribution for the data and the missingness indicators, and factorize this
function into the product of two or more separate distributions. The basic idea mimics
factored regression models from previous chapters.
Using generic notation, the selection model factorizes the joint distribution of the
analysis variables and the missing data indicators (Y and M, respectively) into the fol-
lowing sequence of univariate functions:

( Yi , Mi ) f ( Mi | Yi ) × f ( Yi )
f= (9.3)

Each “f of something” represents a probability distribution induced by a statistical


model. The f(Mi|Yi) term corresponds to a probit or logistic regression with the analy-
sis variables predicting the missing data indicators, and f(Yi) aligns with the analysis
model (to keep the notation simple, I do not differentiate Y’s and X’s). Figure 9.1 shows
path diagrams for focused and diffuse selection models. Readers familiar with statistical
mediation (MacKinnon, 2008) will recognize the diagrams as a single mediator model
where quantitative differences on the analysis variables predict missingness via direct
and indirect effects.

(a) Focused selection model

(b) Diffuse selection model

FIGURE 9.1. Panel (a) shows a path diagram of a focused selection model where only the
incomplete dependent variable predicts missingness, and panel (b) depicts a diffuse selection
model where both analysis variables predict missingness.
Missing Not at Random Processes 351

In contrast, pattern mixture models correspond to a moderated process where par-


ticipants with incomplete and complete data define qualitatively different subgroups
with distinct parameter values. This framework factorizes the multivariate distribution
of the data and the missing data indicators into a sequence of distribution functions
with the missing data indicators on the right side of the vertical pipe.

( Yi , Mi ) f ( Yi | Mi ) × f ( Mi )
f= (9.4)

The f(Yi|Mi) term conveys that model parameters differ by missing data pattern, and f(Mi)
is a model that describes the pattern proportions. These proportions serve as weights for
combining pattern-­specific estimates into population-­level quantities that average over
the distribution of missing data. Figure 9.2 shows path diagrams for focused and diffuse
pattern mixture models. The missing data indicator in the top panel creates mean dif-
ferences on the dependent variable, and the dashed line in the bottom figure represents
a moderation effect (i.e., product term) where the influence of X on Y differs for people
with missing data.
Selection and pattern mixture models are equivalent in the sense that they attempt
to describe the same multivariate distribution, but the similarities stop there. For one,
translating the generic factorizations from Equations 9.3 and 9.4 into formal statisti-

(a) Focused pattern mixture model

(b) Diffuse pattern mixture model

FIGURE 9.2. Panel (a) shows a path diagram of a focused pattern mixture model where the
missing data indicator predicts the dependent variable, and panel (b) depicts a diffuse pattern
mixture model where the indicator moderates the influence of X on Y.
352 Applied Missing Data Analysis

cal models requires different inputs and assumptions, and applying the two modeling
frameworks to the same data can yield very different estimates. From a practical per-
spective, the two modeling frameworks tell a different story about the missing data;
selection models treat missingness as an outcome that depends on the analysis vari-
ables, whereas pattern mixture models treat missingness as a qualitative dimension that
moderates the focal model parameters. Either or both descriptions could be reasonable
for a given application.

9.4 SELECTION MODELS FOR MULTIPLE REGRESSION

Selection models for missing data trace to Heckman’s (1976, 1979) seminal work on sam-
ple selection, truncation, and limited dependent variables. Heckman’s work spawned a
great deal of interest in the econometrics literature, and there is now a considerable body
of methodological research devoted to his approach, as well as countless applications. A
selection model for missing data pairs the focal regression with an additional probit or
logistic model for the binary missingness indicator (see Figure 9.1). Heckman’s papers
described a two-step limited information method that first estimates the missingness
model, after which it uses a function of the resulting parameter estimates to formulate
a corrective variable that appears in the focal model. Winship and Mare (1992) and
Puhani (2000) provide good summaries of this estimator and the early literature. Cut
to today, and we can use factored or sequentially specified regressions to estimate these
models in either the likelihood or Bayesian frameworks (Du et al., 2021; Ibrahim et al.,
2002; Ibrahim et al., 2005; Ibrahim, Lipsitz, & Chen, 1999; Lipsitz & Ibrahim, 1996;
Lüdtke et al., 2020a, 2020b).
Continuing with the bivariate example involving depression and perceived control
over pain, the selection model pairs the focal analysis with an additional logistic or probit
regression for the binary missing data indicator. I focus on the latter for consistency with
earlier material. As a quick recap, probit regression envisions binary scores originat-
ing from a latent response variable that now represents a normally distributed propen-
sity for missing data. The model also includes a threshold parameter τ that divides the
latent response distribution into two segments, such that participants with missing and
observed data have latent scores above and below the threshold, respectively. The link
between the latent response variable and its manifest missing data indicator is as follows:
0 if Mi* ≤ τ
Mi =  *
(9.5)
1 if Mi > τ
Following the specification from Section 6.2, the threshold parameter is fixed at 0 to
identify the latent response variable’s mean structure.
The simplest selection model is one for a focused MNAR process where only the
dependent variable appears in the missingness model.

Yi = β0 + β1 X i + ε i = E ( Yi | X i ) + ε i (9.6)

(
Yi ~ N1 E ( Yi | X i ) , σ2ε )
Missing Not at Random Processes 353

(
Mi* = γ 0 + γ1Yi + ri = E Mi* | Yi + ri)
Mi* ((
~ N1 E Mi* ) )
| Yi ,1

The second model is a probit regression with the latent response variable as the out-
come. As always, fixing the model’s residual variance to 1 establishes a metric. Fig-
ure 9.1a shows a path diagram of the two regressions that effectively comprise a single
mediator model in which the influence of the explanatory variable on the missing data
indicator is transmitted indirectly via the dependent variable.
To make the previous model more concrete, I used computer simulation to create an
artificial data set based on the perceived control over pain and depression variables from
the chronic pain data, and I deleted 25% of the artificial depression scores to mimic a
strong selection process where participants with high depression scores are more likely
to have missing data (e.g., those with the worst symptoms leave the study to seek treat-
ment elsewhere). Figure 9.3 shows the resulting scatterplot of the data, with gray circles
representing complete cases and black crosshairs denoting partial data records with
missing depression scores. The contour rings convey the perspective of a drone hover-
ing over the peak of the bivariate normal population distribution, with smaller contours
35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 9.3. Scatterplot of a focused MNAR process where only the dependent variable deter-
mines missingness. Gray circles represent complete cases, and black crosshairs denote partial
data records with missing depression scores.
354 APPLIED MISSING DATA ANALYSIS

denoting higher elevation (and vice versa). The graph depicts a systematic process where
the missing values are primarily located above the regression line in the upper left quad-
rant of the plot.
Section 1.8 presented computer simulation results illustrating the biases that result
from applying a model for conditionally MAR data to an MNAR process like the one in
Figure 9.3. To illustrate the problem, Figure 9.4 shows a single data set from a model-
based imputation routine that treats the missing values as a function of the predic-
tor variable. The MCMC algorithm incorrectly intuits that the unobserved depression
scores should be evenly dispersed around a regression line, as you would expect from
a conditionally MAR process. To accommodate that expectation, the estimator favors
a biased regression line (the dashed line) that fills in the wrong part of the population
distribution.
A more complex selection model for a diffuse MNAR process includes one or more
additional predictor variables in the missingness equation. Using generic notation, a
model that includes both depression and perceived control over pain in the selection
equation is as follows:
35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 9.4. A single data set from a model-based imputation routine that treats the missing
values as a function of the explanatory variable. The MCMC algorithm evenly disperses imputes
(the black crosshairs) around a biased regression line (the dashed line) that fills in the wrong
part of the population distribution.
Missing Not at Random Processes 355

Yi = β0 + β1 X i + ε i = E ( Yi | X i ) + ε i (9.7)

(
Yi ~ N1 E ( )
Yi | X i , σ2ε )
Mi* = γ 0 + γ1Yi + γ 2 X i + ri = E ( Mi* )
| Yi , X i + ri
Mi* ~ N1 E(( Mi* ) )
| Yi , X i ,1

Figure 9.1b shows a path diagram of the model, which corresponds to a partially medi-
ated process in which the explanatory variable uniquely predicts missingness after con-
trolling for the unseen values of the outcome variable.
I again used computer simulation to create an artificial data set with missing data
patterns based on the previous model, and I deleted 25% of the artificial depression
scores to mimic a strong selection process where participants with high levels of depres-
sion or lower levels of perceived control over pain are more likely to have missing val-
ues (e.g., those with the worst symptoms leave the study to seek other treatment, and
participants with low control over their pain miss assessments that coincide with acute
disruptions in their day-to-day activities). I kept the overall strength of the selection
process the same as it was before, but both variables now contribute equally to missing-
ness. Figure 9.5 shows the resulting scatterplot of the data, with gray circles again repre-
senting complete cases and black crosshairs denoting partial data records with missing
depression scores. The missing depression scores are still primarily located above the
population regression line, but a portion of the black crosshairs have relocated to the
lower left quadrant of the plot.

Important Assumptions
Estimating the selection equation is a formidable task, because the latent response
variable is completely missing and the outcome is partially missing from an unknown
region of the distribution. This seemingly impossible charge requires a strict bivariate
normality assumption for the model’s two residuals. In fact, the normality assumption
is the glue that holds the two models together and makes estimation possible when the
regressions share the same variables (Little & Rubin, 1987, p. 230; Puhani, 2000; Sar-
tori, 2003), as is the case in Equation 9.7. As a practical matter, estimation works best
when the selection model includes variables that do not appear in the analysis model (or
vice versa). Auxiliary variables that correlate with the missing data indicator but not the
analysis variables are good candidates for achieving this separation.
Eliminating nonresponse bias also requires that the missingness model is approxi-
mately correct. Generally, omitting an important determinant of missingness from the
selection equation can introduce substantial bias, while overfitting the model with one
or two unnecessary predictors can cause extraordinarily noisy estimates with large sam-
pling variation. While it is most important to specify the right set of variables, specify-
ing the wrong functional forms can also introduce bias. For example, if you thought that
participants with high or low depression scores are more likely to have missing data
(e.g., those with mild symptoms leave the study, because they no longer feel treatment
356 APPLIED MISSING DATA ANALYSIS

35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 9.5. Scatterplot of a diffuse MNAR process where the dependent variable and pre-
dictor explain missingness. Gray circles represent complete cases, and black crosshairs denote
partial data records with missing depression scores.

is necessary, whereas those with acute symptoms leave the study to seek treatment else-
where), then the selection equation should include a curvilinear effect, as shown below:

Mi* = γ 0 + γ1Yi + γ 2 Yi2 + ri (9.8)

Here, again, the model is not robust to misspecification, and including the wrong con-
figuration of effects can impact parameter estimates.

How Does the Selection Model Mitigate Bias?


A correctly specified selection model can mitigate or eliminate nonresponse bias, but
the previous regression equations don’t offer much insight into how it achieves this feat.
Heckman’s (1976, 1979) original two-step estimator was somewhat more transparent,
as it used functions of the missingness model parameters to create a corrective variable
that appeared as an additional regressor in the focal analysis. The factored regression
specification implements an analogous correction that appears in the distribution of
missing values.
Missing Not at Random Processes 357

Applying the generic factorization from Equation 9.3 to the diffuse selection model
from Equation 9.7 gives the factored regression (sequential) specification below:

f ( Y , M, X=
) f ( M|Y , X ) × f ( Y|X ) × f ( X ) (9.9)

The first term to the right of the equals sign corresponds to the missingness model, the
second term is the focal model (e.g., the regression of depression on perceived control
over pain), and the rightmost term is the marginal (overall) distribution of the predictor.
By now this should look familiar, because we’ve applied this specification throughout
the book.
Because the dependent variable acts as a predictor in the missingness model, the
distribution of its missing values mirrors the two-part function for an incomplete regres-
sor (see Section 5.3). It is instructive to look at the analytic distribution of the missing
values to draw connections to earlier material. Dropping unnecessary scaling terms and
substituting the kernels of the distribution functions from Equation 9.7 into the right
side of the factorization gives the following expression:


( ) 
2
1 Mi − ( γ 0 + γ1Yi + γ 2 X i )
*

( ) 
f Mi | X i , Yi × f ( Yi | X i ) ∝ exp  −
*
σ2r


 2 
  (9.10)

1 ( Yi − ( β0 + β1 X1i ) ) 
 2 

× exp  −
 2 σ2ε 
 
The selection model’s residual variance is fixed to 1, but I include σr2 in the equation to
maintain comparability to analogous expressions from Chapter 5.
Multiplying the two normal curve functions and performing algebra that combines
the component functions into a single distribution for Y gives a normal distribution with
two-part mean and variance expressions that depend on the focal and selection model
parameters. This should be a familiar refrain, as we’ve seen this composition several
times (e.g., Equation 5.12).

( ) (( ) (
f Yi( mis ) | Mi* , X i = N1 E Yi | Mi* , X i , var Yi | Mi* , X i )) (9.11)

(
 γ1 Mi* − γ 0 − γ 2 X i )
β +β X

E ( Yi | Mi* , X i ) = (
var Yi | Mi* , X i )
×
 σr2
+ 0 21 i
σε


 
−1
 γ2 1 
(
var Yi | Mi* , = )
X i  12 + 2 
 σr σ ε 
Although the equation is not intuitive, you can see that both the mean and variance
contain a term that depends on the strength of the MNAR process, as encoded by γ1.
These terms vanish when γ1 = 0 (i.e., the mechanism is MCAR or MAR), in which case
the distribution of missing values simplifies to that of a conditionally MAR analysis.
358 Applied Missing Data Analysis

Conversely, nonzero values of this coefficient induce a correction to both the center and
spread of the distribution.

Practical Recommendations
The literature suggests diffuse processes are somewhat harder to model (e.g., require
larger samples) and can induce greater nonresponse bias than focused ones (Du et al.,
2021; Gomer & Yuan, 2021; Zhang & Wang, 2012). As a general observation, including
too many predictors in the missingness model is generally less detrimental than ignor-
ing an important determinant of missing data (Du et al., 2021; Gomer & Yuan, 2021),
but overfitting can produce very noisy estimates and a nontrivial reduction in precision
and power. Unless you have a very large data set, adopting a judicious rather than inclu-
sive approach to selecting predictors for the missingness model is probably a good idea.
It is also important to reiterate that the model works best when the selection equation
includes variables that do not appear in the focal analysis (or vice versa).
Modeling a focused process is a good starting point for building a diffuse missing-
ness model that includes carefully chosen predictors. Ibrahim et al. (2005) suggested
adding regressors in a stepwise fashion, perhaps in conjunction with model selection
indices and individual influence diagnostics. I have found this approach to be very use-
ful, and the next section provides additional details on model comparisons. Carefully
inspecting parameter estimates and their standard errors (or posterior standard devia-
tions) is also important, because selection models often leave breadcrumbs that signal
a misspecification. For example, misspecified models are often accompanied by large
increases in some standard errors (e.g., values that are double or triple the size of those
from an MAR analysis) or an implausibly large R2 statistic from the missingness model.
The subsequent analysis examples illustrate a model-­building procedure that uses a
variety of inputs and criteria to identify a plausible selection model.

9.5 MODEL COMPARISONS AND INDIVIDUAL


INFLUENCE DIAGNOSTICS

As mentioned previously, the γ1 coefficient from Equations 9.6 and 9.7 encodes the
strength of the MNAR process. This feature appears to offer a way to test the missing
data mechanism, as this slope should be significantly different from zero if an MNAR
process caused the missing data patterns. Unfortunately, significance tests of the probit
model parameter estimates are not trustworthy, nor are likelihood ratio tests comparing
nested models with and without the dependent variable in the selection equation (­Jansen
et al., 2006; Molenberghs & Kenward, 2007, Section 19.6; Verbeke & ­Molenberghs,
2000). Molenberghs, Beunckens, Sotto, and Kenward (2008) further cast shade on such
comparisons, showing that any MNAR model has a MAR counterpart that fits the data
equally well. Ibrahim et al. (2005) and others instead suggest the Akaike information cri-
terion (AIC; Akaike, 1974) and Bayesian information criterion (BIC; Schwarz, 1978) for
relative fit assessments, and published applications of such comparisons are common in
Missing Not at Random Processes 359

the literature (e.g., Gottfredson, Bauer, Baldwin, & Okiishi, 2014; Muthén, A ­ sparouhov,
Hunter, & Leuchter, 2011). These indices can sometimes distinguish among competing
models if you are willing to accept the validity of the selection model and its predictions about
the missing values with no input from the observed data.
I briefly describe the AIC and BIC, and point interested readers to the literature for
additional details (Dziak, Coffman, Lanza, Li, & Jermiin, 2020; Vrieze, 2012). The AIC
and BIC are

−2LL + ln ( N ) P
AIC = (9.12)

BIC =
−2LL + 2P

where LL is the log-­likelihood value of the fitted model and P is the number of esti-
mated parameters. The idea is to downgrade models that achieve good fit (i.e., attain a
lower –2LL value) by including too many parameters, and the rightmost term in each
expression is a penalty factor that inflates the deviance value to compensate for model
complexity. As such, lower values of both indices are better when considering compet-
ing models. The AIC and BIC need not agree, because they make different assumptions
and select on different features. Roughly speaking, the BIC attempts to identify the true
data-­generating model from a set of candidate models, whereas the AIC tries to select
a candidate model that isn’t necessarily correct but adequately describes the unknown
data-­generating function.
The difference between AIC and BIC values provides information about the relative
fit of two models (the models need not be nested). For example, using the indices to
compare models that assume a conditionally MAR versus MNAR process gives the fol-
lowing ΔAIC and ΔBIC values:

ΔAIC = ( ) (
−2 LL( MAR ) − LL( MNAR ) + ln ( N ) P( MAR ) − P( MNAR )
AIC( MAR ) − AIC( MNAR ) = ) (9.13)
ΔBIC =
BIC( MAR ) − BIC ( MNAR −2 ( LL(
)= MAR ) − LL( MNAR ) ) + ln ( N ) ( P( MAR ) − P( MNAR ))

Because lower values are better, positive ΔAIC or ΔBIC values favor the MNAR analysis,
and negative values support the conditionally MAR model. Using Bayesian factors (akin
to the ratio of two likelihood values) as a guide, Raftery (1995) developed the following
effect-­size-like rules of thumb for the BIC difference: |ΔBIC| values from 0 to 2 constitute
“weak” evidence, differences between 2 and 6 reflect “positive” evidence, values from 6
to 10 are “strong,” and a difference greater than 10 is “very strong.” Anderson and Burn-
ham (2004) propose analogous rules for the AIC.
It is important to emphasize that model comparisons with ΔAIC or ΔBIC require
strong and untestable assumptions about the missing values, and you must be will-
ing to accept the MNAR model’s propositions with blind faith. Moreover, because the
observed data may contain relatively little information about a given model compari-
son, ΔAIC and ΔBIC are especially susceptible to extreme or unusual data patterns.
360 Applied Missing Data Analysis

This can produce a problematic situation where the presence or absence of just one
observation can cause the indices to favor one model over another. Interestingly, it
isn’t necessarily participants with missing data that exert the greatest leverage on a
model comparison, as an unusual pattern of observed data can do the same. For exam-
ple, Kenward (1998) reported an example from the biostatistics literature involving a
longitudinal study of dairy cow milk yields. He found that including two sick cows
with complete data and qualitatively different trajectories produced a comparison that
favored a selection model, whereas excluding these two individuals supported a condi-
tionally MAR analysis.
A number of biostatistics papers describe sensitivity procedures designed to iden-
tify data records that unduly influence the results of a model comparison (­Beunckens
et al., 2007; Kenward, 1998; Molenberghs & Verbeke, 2001; Molenberghs, Verbeke,
Thijs, Lesaffre, & Kenward, 2001; Thijs, Molenberghs, & Verbeke, 2000; Verbeke,
­Molenberghs, Thijs, Lesaffre, & Kenward, 2001). One such approach is to iteratively
fit two models of interest after removing one person at a time from the data. This jack-
knife strategy produces individual influence diagnostics that measure the change in
the ΔAIC or ΔBIC that results from excluding a participant from the analysis (Sterba &
­Gottfredson, 2014). These individual-­level difference values are

ΔAICi = ΔAIC − ΔAIC( −i ) (9.14)

ΔBICi = ΔBIC − ΔBIC ( −i )

where the (–i) subscript indicates that individual i is excluded from the ΔAIC or ΔBIC
computation. Selection models are computationally intensive, so refitting multiple mod-
els N times is a substantial barrier to implementing this approach in practice.
Sterba and Gottfredson (2014) propose noniterative approximations to ΔAICi and
ΔBICi that are simple by-­products of maximum likelihood estimation. These approxi-
mate influence diagnostics are

Δˆ AIC i = (
−2 LLi( MAR ) − LLi( MNAR ) ) (9.15)

Δˆ BICi = ( )  N 
−2 LLi( MAR ) − LLi( MNAR ) + ln  (
 P( MAR ) − P( MNAR )
 N −1
)
where LLi(MAR) and LLi(MNAR) are individual i’s contributions to the sample log-­likelihood
values (computed by substituting a participant’s data and the maximum likelihood esti-
mates into the observed-­data log-­likelihood expression from Equation 3.5). Software
packages that implement maximum likelihood estimation routinely save the necessary
quantities upon request, making these diagnostics simple to compute. If the ΔAIC (or
ΔBIC) is positive (i.e., the observed data favor the MNAR model), a Δ̂AICi (or Δ̂BICi) value
that exceeds ΔAIC (or ΔBIC) is influential in the sense that excluding that participant’s
data could switch the sign of the model comparison. Conversely, if the ΔAIC (or ΔBIC) is
negative (i.e., the observed data favor the conditionally MAR analysis), a Δ̂AICi (or Δ̂BICi)
value more negative than ΔAIC (or ΔBIC) is similarly influential. The analysis examples
in the next section illustrate these diagnostics.
Missing Not at Random Processes 361

9.6 SELECTION MODEL ANALYSIS EXAMPLES

I use the psychiatric trial data on the companion website to illustrate a sensitivity analy-
sis for a multiple regression model with an outcome that could be MNAR. The data,
which were collected as part of the National Institute of Mental Health Schizophre-
nia Collaborative Study, comprise repeated measurements of illness severity ratings
from 437 individuals. In the original study, participants were assigned to one of four
experimental conditions (a placebo condition and three drug regimens), but the data
collapse these categories into a dichotomous treatment indicator (DRUG = 0 for the
placebo group, and DRUG = 1 for the combined medication group). The outcome was
measured in half-point increments ranging from 1 (normal, not at all ill) to 7 (among the
most extremely ill). The researchers collected a baseline measure of illness severity prior
to randomizing participants to conditions, and they obtained follow-­up measurements
1 week, 3 weeks, and 6 weeks later. The overall missing data rates for the repeated mea-
surements were 1, 3, 14, and 23%, and these percentages differ by treatment condition;
19 and 35% of the placebo group scores were missing at the 3-week and 6-week assess-
ments, respectively, versus 13 and 19% for the medication group. The data set and the
variable definitions are given in the Appendix, and other illustrations with these data
are found in the literature (Demirtas & Schafer, 2003; Hedeker & Gibbons, 1997, 2006).
The focal analysis is a linear regression model with baseline severity ratings, a gen-
der dummy code (0 = female, 1 = male), and the treatment assignment indicator predict-
ing 6-week follow-­up scores.

SEVERITY6i = β0 + β1 ( DRUGi ) + β2 ( SEVERITY0i − μ 2 ) + β3 ( MALEi − μ 3 ) + ε i (9.16)

Centering baseline scores and the male dummy code at their grand means facilitates
interpretation by defining β0 and β1 as the placebo group average and group mean differ-
ence, respectively (marginalizing over the covariates). This model is an ideal candidate
for a sensitivity analysis, because focused and diffuse MNAR mechanisms are plausible
explanations for dropout. To this end, I fit a series of selection models that invoked
different assumptions about the missingness process, and I used simple model checks
to assess the validity of each analysis. These analyses rely heavily on the normal distri-
bution assumption, so it is worth noting that the observed data distribution is slightly
non-­normal, with skewness and excess kurtosis equal to 0.21 and –0.94, respectively.
This small data set offers limited choices for auxiliary variables, but the illness
severity ratings at the 1-week and 3-week follow-­up assessments are excellent candi-
dates, because both have salient semipartial correlations with the dependent variable (r
= .40 and .61, respectively). Following the procedure from Section 5.8, I used a pair of
linear regression models to link the auxiliary variables to the focal variables:

SEVERITY1i = γ 01 + γ11 ( SEVERITY6i ) + γ 21 ( DRUGi ) + γ 31 ( SEVERITY0i )


(9.17)
+ γ 41 ( MALEi ) + r1i
SEVERITY3i = γ 02 + γ12 ( SEVERITY1i ) + γ 22 ( SEVERITY6i ) + γ 32 ( DRUGi )
+ γ 42 ( SEVERITY0i ) + γ 52 ( MALEi ) + r2i
362 Applied Missing Data Analysis

I omitted the auxiliary variables from the missingness model to reduce problematic
dependencies across regression equations (as noted previously, conditioning on differ-
ent variables in the focal and selection models helps estimation). In some cases, it may
be necessary to include one or more auxiliary variables as regressors in the selection
equation. One way to mitigate collinearity problems is to use principal components
analysis to reduce a large superset of auxiliary variables into one or two composites that
link to the focal model variables (Howard et al., 2015), as such a composite would likely
provide unique information that enhances estimability even if one or two of its constitu-
ents appear in the missingness model.
I used maximum likelihood estimation for the examples, and the companion web-
site also provides scripts for Bayesian estimation and model-based multiple imputation.
Selection models can be difficult to estimate, because the observed data may contain
very little information about the missingness model. Monitoring convergence is espe-
cially important in this context, as invalid solutions are common. For the maximum
likelihood analyses, I refit the model with several sets of random starting values and
examined the final log-­likelihood values to verify that different runs produced the same
solution. For a Bayesian analysis, specifying multiple MCMC chains with different start-
ing values provides analogous information, along with convergence diagnostics such as
trace plots and the potential scale reduction factors (Gelman & Rubin, 1992). For both
modeling frameworks, slow convergence is often a signal that the model is too complex
for the data. Finally, it is also important to check whether the MNAR results are unduly
influenced by unusual data records, and I use individual influence diagnostics for this
purpose (Sterba & Gottfredson, 2014). Analysis scripts are available on the companion
website.

Analysis 1: Conditionally MAR Process


The conditionally MAR mechanism is a natural starting point for a sensitivity analysis.
The appeal of this assumption is especially evident now, as there is no need to formu-
late or model an unknown and potentially complex missingness process. To begin, I
estimated the analysis model with and without the auxiliary variables, as this, too, com-
pares alternative assumptions about missingness. To achieve comparable log-­likelihood
values, I included the auxiliary variables in both models but fixed their regression slopes
to zero in one of the analyses. Both analyses also featured an empty probit regression to
ensure comparability with the ensuing selection models.
Table 9.1 shows the parameter estimates and standard errors. Conditioning on the
auxiliary variables had a substantial impact on key parameters estimates; the intercept
coefficients (placebo group averages) differed by nearly three-­fourths of a standard error
unit, and the slope coefficients (medication group mean differences) differed by more
than one standard error. The auxiliary variable model also produced dramatically lower
(better) AIC and BIC values. Although the natural inclination is to favor the analysis
with auxiliary variables, there is no way to know for sure which is more correct, as con-
ditioning on the wrong set of variables can exacerbate nonresponse bias, at least hypo-
thetically (Thoemmes & Rose, 2014). Nevertheless, the differences are consistent with
the shift from an MNAR-by-­omission mechanism to a more MAR-like process. Given
Missing Not at Random Processes 363

TABLE 9.1. Parameter Estimates Assuming MAR Processes


MAR MAR + AV
Effect Est. SE Est. SE
Focal analysis model
Intercept (β0) 4.29 0.17 4.41 0.16
DRUG (β1) –1.24 0.19 –1.46 0.18
SEVERITY0 (β2) 0.31 0.09 0.28 0.09
MALE (β3) 0.21 0.15 0.23 0.15
Residual Var. (σε2) 1.87 0.15 2.00 0.16
R2   .16 .04   .19 .04

Missingness model
Intercept (γ0) –0.73 0.07 –0.73 0.07
R2 0 0 0 0

Model fit
AIC (P) 5484.44 (14) 4994.00 (23)
BIC (P) 5541.56 (14) 5087.85 (23)

their apparent importance, I include the auxiliary variable regressions in the subsequent
selection models.

Analysis 2: Focused MNAR Process


Next, I fit a selection model for a focused MNAR process where the dependent variable
predicted missingness.

Mi* = γ 0 + γ1 ( SEVERITY6i ) + ri (9.18)

As always, the probit model’s threshold and residual variance are fixed at 0 and 1, respec-
tively. The leftmost columns in Table 9.2 show the estimates and standard errors, which
are effectively equivalent to the conditionally MAR analysis with auxiliary variables.
Model comparisons using ΔAIC or ΔBIC are not very useful here given that the esti-
mates are effectively identical to those in the right panel of Table 9.1. Returning to the
path diagram in Figure 9.1a, the explanatory variable influences missingness indirectly
via the dependent variable, which essentially acts as mediator. If a focused missingness
model is reasonable, the indirect pathway alone should reproduce the bivariate associa-
tion between each predictor and the missing data indicator, whereas inaccurate predic-
tions about this relation signal the need for a diffuse process with a direct pathway (e.g.,
the model in Figure 9.1b). For this analysis, the cell proportions in a 2 × 2 contingency
table encode the bivariate association between the treatment assignment (or gender
dummy code) and the missing data indicator.
364 Applied Missing Data Analysis

TABLE 9.2. Selection Model Parameter Estimates Assuming MNAR Processes


Focused Diffuse + DRUG Diffuse + MALE
Effect Est. SE Est. SE Est. SE
Focal analysis model
Intercept (β0) 4.40 0.17 4.26 0.17 4.27 0.17
DRUG (β1) –1.45 0.18 –1.40 0.18 –1.40 0.18
SEVERITY6 (β2) 0.28 0.09 0.27 0.09 0.27 0.09
MALE (β3) 0.23 0.15 0.26 0.15 0.25 0.15
Residual Var. (σε2) 2.00 0.16 2.06 0.17 2.06 0.17
R2   .18 .04   .17 .04   .17 .04

Missingness model
Intercept (γ0) –0.75 0.21 0.44 0.35 0.37 0.32
SEVERITY6 (γ1) 0.01 0.06 –0.19 0.08 –0.18 0.08
DRUG (γ2) — — –0.78 0.20 –0.76 0.20
MALE (γ3) — — — — –0.13 0.14
R2 < .01   .002   .11 .06   .11 .06

Model fit
AIC (P) 4995.64 (24) 4981.00 (25) 4982.16 (26)
BIC (P) 5093.56 (24) 5083.00 (25) 5088.24 (26)

To illustrate, consider the treatment assignment variable, which had 35 and 19%
missing data rates for the placebo and medication conditions, respectively. To check
the selection model’s predictions about these proportions, I applied the procedure for
computing the indirect effect from a mediation model with a binary outcome (Muthén
et al., 2016, p. 310). Using generic notation, the expected value and variance of the latent
response variable at a particular value of the explanatory variable (e.g., X = 0 or 1) is as
follows:

( )
E Mi* | X i = γ 0 + γ1 ( β0 + β1 X i ) = γ 0 + γ1 ( E ( Yi | X i ) ) (9.19)

( )
var Mi* | X i =γ12 σ2ε + 1

The predicted probability of missing data for a particular value of the explanatory vari-
able via the indirect pathway through the dependent variable is an area under the following
normal curve (Muthén et al., 2016, Eq. 8.6):
 *
( 
 τ − E Mi | X i 
Pr ( Mi = 1| X i ) = 1 − Φ 

)*
 E Mi | X i 

( )
 = Φ  (9.20)

 (
 var Mi* | X i 
 )
 var Mi* | X i 
  ( )
where Φ(·) is the cumulative distribution function of the standard normal curve (i.e., a
Missing Not at Random Processes 365

function that gives the area below the z-score inside the parentheses). Substituting the
parameter estimates and dummy codes into these expressions gives model-­predicted
missingness rates of 24 and 23% for the placebo and medication conditions, respectively.
These rather dramatic mispredictions signal a misspecification that could be remedied
by modeling a diffuse process with a direct effect for the treatment indicator. The miss-
ing data rates for males and females were not nearly as different, and the model’s predic-
tions about these proportions were far more accurate.

Analysis 3: Diffuse MNAR Processes


The final set of analyses examined a small collection of diffuse mechanisms. Specifying
a selection model for a diffuse MNAR process is a daunting task, because myriad con-
figurations of covariates could appear in the missingness model. While it is generally
less harmful to include unnecessary predictors than it is to ignore important determi-
nants of missingness, overfitting can introduce substantial noise and reduce precision
and power. Moreover, a simpler missingness model that includes a subset of the analysis
variables facilitates estimation, as you want to avoid situations where the focal regres-
sion and selection equation condition on too many of the same variables. Judiciously
building the missingness model by adding regressors in a stepwise fashion is a good
strategy, starting with direct effects for explanatory variables.
Adding the treatment assignment indicator to the missingness model is an obvious
starting point, because its indirect effect alone did a poor job of predicting the observed
missing data rates. Accordingly, I started by modeling a diffuse process with the depen-
dent variable and treatment assignment indicator as determinants of missingness as
follows:

Mi* = γ 0 + γ1 ( SEVERITY6i ) + γ 2 ( DRUGi ) + ri (9.21)

The middle panel of Table 9.2 shows the parameter estimates and standard errors from
the analysis. The diffuse model produced some important and noticeable differences.
Relative to the conditionally MAR analysis, the intercept (placebo group average) was
lower by nearly nine-­tenths of a standard error unit, and the treatment group difference
was smaller (less negative) by an amount equal to one-third of a standard error. There
are no obvious indications that the model is unreasonable, and the observed and pre-
dicted missing data rates for the treatment conditions were a close match.
Looking at the information criteria at the bottom of the table, both the AIC and BIC
favored the selection model over an analysis that assumes MAR data; the differences
between pairs of information criteria values are ΔAIC = AICMAR – AICMNAR = 12.65 and
ΔBIC = BICMAR – BICMNAR = 4.49 (see Equation 9.13). The ΔBIC represents “positive” evi-
dence favoring the selection model according to Raftery’s (1995) effect-­size-like rules of
thumb. It is important to reiterate that model comparisons with ΔAIC or ΔBIC require
untestable assumptions about the missing values, and you must be willing to accept the
MNAR model’s propositions a priori. Following verbiage from Verbeke, Lesaffre, and
Spiessens (2001, p. 426), we could say there is positive evidence for nonrandom dropout,
conditional on the validity of the selection model.
366 APPLIED MISSING DATA ANALYSIS

A concern with using information criteria for model comparisons is that a small
number of outliers (or perhaps even a single individual) may unduly influence the con-
clusions. To explore this possibility, I used the expressions from Equation 9.15 to com-
pute individual influence diagnostics. As a reminder, these quantities approximate the
change in the model comparison (ΔAIC or ΔBIC) that would result from excluding a
single participant’s data from both analyses (Sterba & Gottfredson, 2014). In this case,
a positive value of Δ̂AICi (or Δ̂BICi) that exceeds ΔAIC (or ΔBIC) indicates that exclud-
ing a participant’s data could switch the sign of the model comparison to favor the MAR
analysis. Figure 9.6 is an index plot that graphs the ΔBICi values for each participant.
That graph reveals no outliers, thus lending credence to the conclusion that the diffuse
selection model is plausible for these data.
Next, I added the gender dummy code to the missingness model as follows:

Mi* = γ 0 + γ1 ( SEVERITY6i ) + γ 2 ( DRUGi ) + γ 3 ( MALEi ) + ri (9.22)

The rightmost set of columns in Table 9.2 show the parameter estimates and standard
errors, which were effectively equivalent to those of the previous analysis. The AIC and
BIC were both slightly higher, indicating that the additional complexity is not war-
ranted. Finally, I did not consider baseline severity scores as a predictor of missingness,
because this variable’s bivariate association with the missing data indicator was nearly
zero.
4
Approximate Change in BIC Difference
2
0
–2
–4

0 100 200 300 400


Data Record

FIGURE 9.6. Index plot displaying the influence diagnostics for each participant. That graph
reveals no outliers that exceed ΔBIC.
Missing Not at Random Processes 367

When specifying a selection model, you not only need to include the right set of
regressors in the missingness model, but you also must get their functional forms cor-
rect. Following Ibrahim et al. (2005), I investigated a final missingness model with
higher-­order interaction term between the treatment assignment indicator and depen-
dent variable. Such a process could occur, for example, if placebo group participants
with the most acute symptoms quit the study to seek treatment elsewhere, whereas treat-
ment group participants with the least acute symptoms quit, because they’ve achieved
adequate relief.

Mi* = γ 0 + γ1 ( SEVERITY6i ) + γ 2 ( DRUGi ) + γ 3 ( SEVERITY6i )( DRUGi ) + ri (9.23)

The model exhibited two telltale symptoms of misspecification: very large increases
in the standard errors (e.g., the treatment slope standard error increased by 40%) and
a missingness model with an implausibly large R2 statistic near .70. Additionally, the
estimators struggled to achieve convergence and required long iterative sequences (e.g.,
MCMC required more than 100,000 burn-in iterations). All evidence suggests that the
interactive model is either not plausible or the sample size isn’t large enough to support
estimation. In either case, the results are not credible and should not be interpreted.

Summary
The sensitivity analysis identified a defensible selection model that included the depen-
dent variable and treatment assignment indicator as predictors of missingness. Con-
ducting simple model checks and looking for signs of misspecification were instrumen-
tal in ruling out candidate missingness models with dubious support from the data.
The selection model produced noticeable differences in some key parameters; relative
to the MAR analysis, the intercept (placebo group average) was lower by nearly nine-­
tenths of a standard error unit, and the treatment group difference was smaller by about
one-third of a standard error. Nontrivial differences like this might seem troubling, but
the results simply reflect two different, plausible assumptions about the missing data.
To emphasize, there is no way of knowing whether the MNAR analysis is better than a
simpler analysis that assumes a conditionally MAR mechanism. Both sets of results are
defensible and could (and should) be presented in a research report.

9.7 PATTERN MIXTURE MODELS FOR MULTIPLE REGRESSION

Whereas selection models are consistent with a mediated mechanism in which quanti-
tative differences on the analysis variables predict missingness via direct and indirect
effects, pattern mixture models are aligned with moderated processes in which missing
data patterns form qualitatively different subgroups with distinct parameter values (see
the factorization in Equation 9.4 and the path diagrams in Figure 9.2). I return to the
bivariate example with perceived control over pain predicting depression as a substan-
tive backdrop for describing the model (I generically refer to these variables as X and Y,
respectively).
368 Applied Missing Data Analysis

Examining a pattern mixture model for bivariate normal data sets up its application
to multiple regression. The data feature two missing data patterns: One group has com-
plete data and indicator scores of M = 0, and the second group has missing depression
scores and M = 1. The model specifies a unique mean vector and variance–­covariance
matrix for each pattern, as follows:

 X i( 0 )    μ ( 0 )   σ 2( 0 ) σ(XY)  
0  X i(1)    μ (1)   σ2(1) σ (XY)  
1
  ~ N1   X  ,  X    ~ N1   X  ,  X  (9.24)
 Y (0)    μ( 0 )   σ( 0 ) 2( 0 )  
σY    Y (1)    μ (1)   σ (1) 2 (1)  
σY  
 i    Y   YX  i    Y   YX

The ultimate quantities of interest—­parameters of the marginal distribution that aver-


age over the missing data patterns—­are weighted combinations of the group-­specific
parameters. For example, the dependent variable mean is

μ Y = π( ) μ (Y ) + π( ) μ Y( )
0 0 1 1
(9.25)

where π(0) and π(1) are weights equal to the proportion of cases in each pattern.
The idea behind a pattern mixture model is relatively straightforward: Estimate the
parameters of interest within each missing data pattern, then average over the distribu-
tion of missingness by computing a weighted composite of the group-­specific estimates.
However, the example highlights that people with missing depression scores have no
data with which to estimate the mean and variance of Y and its covariance with X.
The regression analysis similarly features three inestimable parameters. To use the pat-
tern mixture model, you need to either provide values for the inestimable quantities or
impose so-­called identifying restrictions that borrow information by setting the ines-
timable parameters equal to functions of the estimable ones (Kenward & Molenberghs,
2014; Little, 1993, 1994). I describe the former strategy here and illustrate identifying
restrictions later in Section 9.13.
A straightforward way to specify a pattern mixture model for multiple regression is
to cast missing data indicators as dummy codes that moderate the influence of one or
more explanatory variables on the outcome (Hedeker & Gibbons, 1997; Hogan & Laird,
1997a, 1997b). Applied to the bivariate data, this specification corresponds to a familiar
moderated regression model with a focal predictor X (e.g., perceived control over pain),
a binary missing data indicator M, and the product of the two (see Figure 9.2b).

Yi = β(0 ) + β(0 ) Mi + β1( ) X i + β1( ) X i Mi + ε i


0 diff 0 diff
(9.26)

I use superscripts to emphasize that these are not the coefficients of substantive inter-
est. The lower-order terms are conditional effects that depend on scaling: β0(0) is the
predicted outcome score for a participant with complete data and X = 0, β1(0) is a simple
slope that reflects the influence of X in the M = 0 pattern, and β0(diff) is the pattern mean
difference at X = 0. Finally, the β1(diff) coefficient is the slope difference for the group with
missing data. Importantly, β0(diff) and β1(diff) are inestimable, because the dependent vari-
able is missing for the M = 1 group. The next section describes a simple effect-­size-based
strategy for specifying these coefficients.
Missing Not at Random Processes 369

To make the previous model more concrete, I used computer simulation to create
an artificial data set based on the perceived control over pain and depression variables
from the chronic pain data. I deleted 25% of the artificial depression scores to mimic a
process where participants with missing data form a subpopulation with higher depres-
sion scores at the center of the perceived control distribution and a stronger bivariate
association between the two variables. To maintain comparability with earlier figures,
I used the same population parameters that created the bivariate scatterplots for the
selection model examples, and I chose pattern-specific coefficients that maintained the
same overall strength of the association between the analysis variables and missingness.
Figure 9.7 shows the resulting scatterplot of the data, with gray circles representing
complete cases and black crosshairs denoting partial data records with missing depres-
sion scores. The contour rings convey the perspective of a drone hovering over the peak
of the bivariate normal population distribution, with smaller contours denoting higher
elevation (and vice versa). The dashed and dotted lines are the true pattern-specific
35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 9.7. Scatterplot of a diffuse MNAR process from a pattern mixture model where par-
ticipants with missing data form a subpopulation with higher depression scores at the center of
the perceived control distribution and a stronger bivariate association between the two variables.
Gray circles represent complete cases, and black crosshairs denote partial data records with
missing depression scores. The dashed and dotted lines are the true pattern-specific regressions,
and the solid line is the overall (marginal) population function that averages over missing data
patterns.
370 APPLIED MISSING DATA ANALYSIS

regressions, and the solid line is the overall (marginal) population function that aver-
ages over missing data patterns, as follows:

β0 =
0 0
(1 0
)
π( )β(0 ) + π( ) β(0 ) + β(0 ) =
diff
π( )β(0 ) + π( )β(0 )
0 0 1 1
(9.27)

π β + π (β + β
( ) ( ) () ( ) ( )
) =π( )β( ) + π( )β( )
0 0 1 0 diff 0 0 1 1
β1 = 1 1 1 1 1

Figure 9.7 depicts a systematic process where missing values are primarily located in
the upper left and lower right quadrants of the contour plot. An analysis based on the
conditionally MAR assumption is incapable of targeting these regions, as its imputations
would disperse around a single, biased regression line.
The moderated regression model in Equation 9.26 is consistent with a diffuse
MNAR process where both analysis variables uniquely correlate with missingness (see
Figure 9.2b). In contrast, the model for a focused process (shown in Figure 9.2a) features
an inestimable pattern mean difference and a common regression slope.

Yi = β(0 ) + β(0 ) Mi + β1 X i + ε i
0 diff
(9.28)
35
30
25
20
Depression
15
10
5
0

0 10 20 30 40
Perceived Control Over Pain

FIGURE 9.8. Scatterplot of a focused MNAR process from a pattern mixture model where
participants with missing data form a subpopulation with higher depression scores. Gray circles
represent complete cases, and black crosshairs denote partial data records with missing depres-
sion scores. The dashed and dotted lines are the true pattern-specific regressions, and the solid
line is the overall (marginal) population function that averages over missing data patterns.
Missing Not at Random Processes 371

To illustrate, Figure 9.8 shows the scatterplot of an artificial data set where the 25% of
participants with missing data form a subpopulation with a much higher depression
average. As before, gray circles represent the complete cases, black crosshairs denote the
partial data records with missing depression scores, the dashed and dotted lines are the
true group-­specific regressions, and the solid line is the overall (marginal) population
function. This model is equivalent in spirit to the old idea of filling in the data under an
MAR assumption and adding a constant to each impute (Rubin, 1987, p. 22; van Buuren,
2012, p. 92).

Specifying Values for Inestimable Parameters


The previous pattern mixtures are fairly pedestrian regression models, but they are
uniquely challenging to implement, because you need to provide values for inesti-
mable quantities (or impose identifying restrictions that set these parameters equal to
functions of the estimable ones). Specifying reasonable values is vital, because these
nonidentified parameters determine the strength of the MNAR process. This section
describes a simple approach that uses off-the-shelf effect size benchmarks as a heuristic
for specifying these quantities. Related strategies are described in the literature (e.g.,
Little, 2009, p. 428; Rubin, 1987, p. 22; van Buuren, 2012, p. 92).
To begin, consider the inestimable pattern mean difference (the β0(diff) coefficient in
Equations 9.26 and 9.28). Centering the predictor variable facilitates interpretation by
defining this coefficient as the pattern mean difference at the regressor’s mean. Dividing
β0(diff) by the dependent variable’s standard deviation (or residual standard deviation)
gives a standardized mean difference similar to Cohen’s (1988) d effect size, the bench-
mark values for which are |d| > 0.20 (small), 0.50 (medium), and 0.80 (large). Reversing
that flow, you can specify β0(diff) on a standardized metric and solve for the raw score
mean difference as follows:

β(0 ) = d × σY or d × σ2ε
diff
(9.29)

Continuing with chronic pain example, if you believed that the missing depression
scores have a higher mean than the observed data, you could set the standardized dif-
ference to a value like +0.20 or +0.50 and solve for β0(diff). A preliminary analysis based
on the MAR assumption can provide an estimate of the standard deviation, or you could
use the residual standard deviation to constrain β0(diff) during estimation.
A similar strategy can provide a value for the inestimable slope difference in Equa-
tion 9.26. To begin, consider the situation where X is binary (e.g., 0 = placebo, 1 = medi-
cation), in which case, β1(0) is the group mean difference for participants with complete
data. The inestimable β1(diff) term is the additional mean difference for the M = 1 pattern.
Standardizing the net mean difference by dividing by the standard deviation or residual
standard deviation is equivalent to subtracting pattern-­specific Cohen’s d values:

β( ) + β1( ) β1( ) β1( )


0 diff 0 diff
dΔ = d ( ) − d ( ) = 1
1 0
− = (9.30)
σY σY σY
372 Applied Missing Data Analysis

which leads to the following expression for β1(diff):

β1( =
diff )
d Δ × σY or d Δ × σ2ε (9.31)

For example, setting d Δ equal to –0.10 implies that the group mean difference for the
incomplete cases is smaller (more negative) by an amount equal to one-tenth of a stan-
dard deviation unit.
When X is a quantitative variable, β1(0) is the focal regression slope for participants
with complete data, and β1(diff) is the slope difference for the M = 1 pattern. Standardizing
the slope difference by dividing by the standard deviation or residual standard deviation
is equivalent to subtracting pattern-­specific standardized coefficients or beta weights:
β1( )
( )σ
diff
(1) (0) (0) ( diff ) σ X (0) σ X
d Δ = βz − βz = β1 + β1 − β1 = (9.32)
Y σY σY
and solving for γ3 gives the following expression for the slope difference:
σY σ2ε
β1( ) =
diff
dΔ or d Δ (9.33)
σX σX
Applied to the chronic pain example, setting dΔ equal to +0.25 means that, among par-
ticipants with missing data, a one standard deviation increase in perceived control over
pain increases the predicted depression score by one-­fourth of a standard deviation
more than that of the complete data.
Linking inestimable parameters to the standardized mean difference provides a
practical heuristic for specifying coefficients, but it is still incumbent on the researcher
to choose values that are reasonable for a given application. Moreover, it is incorrect
to view “small” values of d and dΔ as unimportant, as standardized differences of this
magnitude could be quite salient in many situations. For example, consider a random-
ized intervention where the true effect size is d = 0.25. Setting dΔ to 0.50 in Equation
9.31 (Cohen’s medium effect size benchmark) is equivalent to saying that the moderat-
ing impact of missing data is twice as large as the intervention effect itself. The medium
effect size threshold is probably an upper bound for many practical applications, and
much smaller values of dΔ could be realistic.

Important Assumptions
In contrast to the selection model, which requires strict and untestable distributional
assumptions, correctly specifying the missingness process is the primary barrier to get-
ting good results from a pattern mixture model. In practice, this means selecting the
correct set of interaction terms (the missing data indicator could moderate the influence
of any regressor) and providing accurate values for all inestimable parameters. Even
with simple heuristics for deriving these quantities, there is no way to verify whether
our choices increase or decrease nonresponse bias. While this may seem like a serious
disadvantage, methodologists have argued that the transparency of the model’s assump-
tions is actually its strength (Little, 2009). Rather than having to iteratively fit models
Missing Not at Random Processes 373

and scour computer output for subtle clues that may signal a misspecification or identi-
fication problem, you simply lay your statistical cards on the table and estimate a model
that reflects the presumed missingness process.

Averaging over the Distribution of the Missingness Indicator


Returning to the factorization from Equation 9.4, the f(Mi) term is a model that describes
the missing data pattern proportions. For example, with two missing data patterns, this
function corresponds to an empty probit (or logistic) regression model like the one below:

Mi* =γ 0 + ri (9.34)

ri* ~ N1 ( 0,1)

and the predicted probability of missing data is the area above the threshold in a stan-
dard normal distribution.

Pr ( Mi = 1) = 1 − Φ ( −γ 0 ) = Φ ( γ 0 ) (9.35)

This proportion and its complement determine the weights for computing marginal esti-
mates that average over the missing data patterns (see Equations 9.25 and 9.27).
Fitting a separate model to estimate the observed missing data rate might seem like
overkill, but doing so facilitates standard error computations. Returning to Equation
9.27, the coefficients of interest are functions of the pattern proportions and group-­
specific intercepts and slopes. Similarly, the standard errors (or posterior standard devi-
ations) of β̂0 and β̂1 are composite functions that depend on the sampling variances
(or posterior variances) of the γ̂’s and the π̂’s. Some software packages that implement
maximum likelihood estimation offer facilities for computing auxiliary parameters that
are functions of estimated model’s parameters, and these programs use the multivariate
delta method (Raykov & Marcoulides, 2004) to combine the squared standard errors of
the proportions and pattern-­specific coefficients into measures of uncertainty for β̂0 and
β̂1. Bayesian estimation software packages often offer similar functionality for defining
auxiliary parameters, the posterior distributions of which naturally reflect uncertainty
in their constituent parts.
Absent software to do the heavy lifting, you would need to compute standard errors
by hand. Enders (2010, pp. 309–312) and Hedeker and Gibbons (1997, p. 74) show
worked examples of this process. Demirtas and Schafer (2003) suggest multiple imputa-
tion as an alternative to explicitly pooling over the missing data patterns. To implement
this procedure, you would use the pattern mixture model from Equation 9.26 to impute
the missing values (i.e., model-based multiple imputation), after which you would fit
the focal regression analysis to the filled-­in data. The second-­stage analysis does not
require the missing data indicators, because the MNAR process is already embedded in
the imputations, and applying Rubin’s (1987) rules to the imputation-­specific estimates
of β0 and β1 gives pooled values that average over the missingness process.
374 Applied Missing Data Analysis

Sensitivity Analyses
Fitting selection models requires a researcher to proactively search for a model that has sup-
port from the data, while looking for subtle clues that signal a misspecification or identifica-
tion problem. Pattern mixture models are very different, because they are easy to estimate
once you supply values for the inestimable parameters (or impose comparable constraints).
The main challenge is supplying accurate information about the unknown coefficients, as
the validity of the results hinges on a correct (or approximately so) specification. Although
it might be possible to formulate directional hypotheses about group mean differences in
some situations (e.g., participants with missing data have a higher depression average),
selecting specific values for the inestimable parameters is very challenging.
Rather than trying to select the optimal quantities for the inestimable parameters,
you can instead conduct a sensitivity analysis that considers a range of plausible coef-
ficient values. Little (2009, p. 428) recommends inducing distributional differences of
± 0.20 or 0.50 residual standard deviation units, and earlier formulas facilitate this strat-
egy. You could also vary the strength of the MNAR process across an entire response
surface by refitting a model with d and dΔ values between –0.50 and +0.50. Tracking
changes to the focal model parameters can answer two important practical questions:
“How much does the missing data distribution need to differ to meaningfully change
MAR-based estimates?” and “How big a difference is needed to change significance test
results?” The analysis examples in the next section illustrate such a procedure.

9.8 PATTERN MIXTURE MODEL ANALYSIS EXAMPLES

I use the psychiatric trial data on the companion website to illustrate pattern mixture
models for multiple regression. Equation 9.16 shows the focal analysis model. Reducing
a potentially large number of missing data patterns into a manageable number of groups
is often the starting point for these models, as specifying many pattern-­specific differ-
ences is both unwieldy and unnecessary. Table 9.3 shows the nine missing data pat-
terns for the repeated outcome variable, with O and M indicating observed and missing
values, respectively. Following Hedeker and Gibbons (1997), I classify participants as
“completers” or “dropouts” based on the presence or absence of an illness severity rating
at the 6-week follow-­up. The completer pattern combines a large group of participants
with fully observed data (Pattern 1) and several smaller groups with intermittent miss-
ing values (Patterns 5, 6, 7, and 9), and the dropout pattern primarily reflects monotone
missingness where participants permanently leave the study (Patterns 2, 3, 4, and 8
in bold typeface). Pattern reduction decisions are important and potentially impact-
ful, because they define the qualitatively different subpopulations that presumably have
unique parameter values. The earlier analysis examples showed that conditioning on the
1-week and 3-week follow-­up ratings was important, so I again leveraged these interme-
diate assessments as auxiliary variables. Following Equation 9.17, I used a pair of linear
regression models to link the auxiliary variables to the focal variables, and these equa-
tions also included the dropout indicator.
Missing Not at Random Processes 375

TABLE 9.3. Missing Data Patterns from the Schizophrenia Trial Data
Pattern % sample Baseline Week 1 Week 3 Week 6
1 71.40 O O O O
2 0.69 O M M M
3 10.30 O O M M
4 12.13 O O O M
5 0.69 M O O O
6 1.14 O M O O
7 2.97 O O M O
8 0.23 O M O M
9 0.46 O M M O

Note. Rows in bold typeface are “dropouts”; all other patterns are “completers.”

Analysis 1: Focused MNAR Process


The pattern mixture model for a focused process features a group mean difference but
common regression slopes, as follows:

SEVERITY6i = β(0 ) + β(0 ) ( Mi ) +β1 ( DRUGi ) +β2 ( SEVERITY0i − μ 2 )


0 diff
(9.36)
+ β3 ( MALEi − μ 3 ) + ε i
Centering baseline severity scores and the gender dummy code defines β0(0) as the pla-
cebo group mean for the completers (DRUG = 0 and M = 0) and β0(diff) as the inestimable
mean difference for the dropout pattern. I used Equation 9.29 to obtain β0(diff) coefficients
for standardized effect size values between d = –0.50 and +0.50 in increments of 0.10.
This process created 11 coefficients that varied the strength and direction of the MNAR
process, such that the dropout group’s illness severity distribution differed by up to half
a standard deviation in either direction. I then fit a pattern mixture model for each effect
size and computed the placebo group average as a weighted mean of the pattern-­specific
intercepts (see Equation 9.27). The empty probit model from Equation 9.34 and the cor-
responding probability function in Equation 9.35 provided the pattern proportions and
standard errors for pooling. Analysis scripts are available on the companion website.
Table 9.4 shows pattern-­specific and marginal parameter estimates for each effect
size. The middle row with d = 0 corresponds to a conditionally MAR process where
dropouts and completers share the same model parameters, the top row of the table
depicts a MNAR process where the mean of the missing scores is lower by roughly half
a standard deviation unit (σ̂Y = 1.57), and the bottom row shows a mechanism where
the mean is higher by a corresponding amount. Unlike the selection model, which relies
on opaque assumptions that operate behind the scenes, the pattern mixture model is
fully transparent about its propositions; each row of the table represents a hypothetical
scenario that varies the strength of the focused MNAR process.
376 Applied Missing Data Analysis

TABLE 9.4. Pattern Mixture Model Estimates Assuming


a Focused MNAR Process
Effect size (d) β0(0) β0(1) β0 SE β1 SE
–0.5 4.49 3.71 4.31 0.17 –1.50 0.18
–0.4 4.49 3.87 4.35 0.17 –1.50 0.18
–0.3 4.49 4.02 4.38 0.17 –1.50 0.18
–0.2 4.49 4.18 4.42 0.17 –1.50 0.18
–0.1 4.49 4.33 4.45 0.17 –1.50 0.18
0 4.49 4.49 4.49 0.17 –1.50 0.18
0.1 4.49 4.65 4.53 0.17 –1.50 0.18
0.2 4.49 4.80 4.56 0.17 –1.50 0.18
0.3 4.49 4.96 4.60 0.17 –1.50 0.18
0.4 4.49 5.12 4.64 0.17 –1.50 0.18
0.5 4.49 5.27 4.67 0.17 –1.50 0.18

Considering a range of plausible effects allows you to determine how large the
pattern-­specific differences need to be to alter significance tests or meaningfully change
estimates from a conditionally MAR analysis. Table 9.4 shows that the dropout group’s
distribution would need to differ by at least ±0.30 standard deviation units to alter the
MAR-based estimate of β0 by half a standard error unit, and a difference of at least ±0.50
standard deviations is necessary to change this parameter by an entire standard error.
A change of one standard error is quite large, because it implies that the missing data
process affects estimates by an amount equal to the expected sampling error. You may
recall that the diffuse selection model for these data induced similarly large changes to
some parameters.

Analysis 2: Diffuse MNAR Process


A pattern mixture model for a diffuse MNAR process features a pattern mean difference
and one or more interaction terms between the missing data indicator and predictor
variables. The regression model for this example specifies pattern-­specific treatment
effects via the product of the missingness indicator and treatment assignment dummy
code as follows:

SEVERITY6i = β(0 ) + β(0 ) ( Mi ) +β1( ) ( DRUGi ) +β1( ) ( Mi )( DRUGi )


0 diff 0 diff
(9.37)
+β2 ( SEVERITY0i − μ 2 ) + β3 ( MALEi − μ 3 ) + ε i

Consistent with the previous analysis, centering baseline severity scores and the gender
dummy code defines β0(0) as the placebo group mean for the completers (DRUG = 0 and
M = 0) and β0(diff) as the inestimable mean difference for the dropout pattern. The β1(0)
coefficient represents the medication group mean difference for completers, and β1(diff) is
the additional medication effect among participants who dropped out.
I used Equation 9.29 to obtain β0(diff) coefficients for standardized effect size values
between d = –0.50 and +0.50 in increments of 0.10, and I used Equation 9.31 to obtain
Missing Not at Random Processes 377

β1(diff) coefficients for the same range of effects. This process created 121 unique param-
eter combinations that varied the strength and direction of the MNAR process across a
broad range of values, not all of which may be plausible. I estimated a pattern mixture
model for each combination of coefficients and computed focal model parameters that
average over the missing data patterns (see Equation 9.27). The empty probit model
from Equation 9.34 and the corresponding probability function in Equation 9.35 again
provided the pattern proportions and standard errors for pooling. Analysis scripts are
available on the companion website.
Figure 9.9 is a heat map that summarizes changes to the focal model’s intercept coef-
ficient (the placebo group mean, β0) across the 121 effect size combinations. The circle
in the middle of the plot represents a conditionally MAR mechanism where both effect
sizes equal zero, and the colors get progressively darker as the change to the pooled esti-
mate increases. The white squares denote estimates that differ by less than one-fourth of
a standard error from the MAR analysis, whereas black squares are estimates that differ
by more than one standard error. Broadly speaking, the greatest distortions occur when
the dropout group’s distribution differs by at least ±0.40 standard deviation units along
0.6
0.4
Standardized Pattern Slope Difference
0.2
0.0
–0.2
–0.4
–0.6

–0.6 –0.4 –0.2 0.0 0.2 0.4 0.6


Standardized Pattern Mean Difference

FIGURE 9.9. Heat map summarizing changes to the focal model’s intercept coefficient across
the 121 effect size combinations. The circle in the middle of the plot represents a conditionally
MAR mechanism, and the colors get progressively darker as the difference between the pattern
mixture model and MAR-based estimates gets larger. The white squares denote estimates that
differ by less than one-fourth of a standard error, whereas black squares are estimates that dif-
fer by more than one standard error. Broadly speaking, the greatest distortions occur when the
dropout group’s distribution differs by at least ±0.40 standard deviation units along the horizon-
tal axis.
378 Applied Missing Data Analysis

TABLE 9.5. Pattern Mixture Model Estimates Assuming a Diffuse


MNAR Process
d d∆ β0(0) β0(1) β0 SE β1(0) β1(1) β1 SE
0.3 –0.5 4.43 3.96 4.32 0.17 –1.42 –2.20 –1.60 0.18
0.3 –0.4 4.44 3.97 4.33 0.17 –1.43 –2.06 –1.58 0.18
0.3 –0.3 4.45 3.98 4.34 0.17 –1.45 –1.92 –1.56 0.18
0.3 –0.2 4.47 4.00 4.36 0.17 –1.46 –1.78 –1.54 0.18
0.3 –0.1 4.48 4.01 4.37 0.17 –1.48 –1.64 –1.52 0.18
0.3 0 4.49 4.02 4.38 0.17 –1.50 –1.50 –1.50 0.18
0.3 0.1 4.50 4.03 4.39 0.17 –1.51 –1.36 –1.48 0.18
0.3 0.2 4.52 4.05 4.41 0.17 –1.53 –1.22 –1.46 0.18
0.3 0.3 4.53 4.06 4.42 0.17 –1.54 –1.08 –1.44 0.18
0.3 0.4 4.54 4.07 4.43 0.17 –1.56 –0.93 –1.41 0.18
0.3 0.5 4.55 4.08 4.44 0.17 –1.58 –0.79 –1.39 0.18

the horizontal axis, in which case β0 differs from the MAR-based estimate by at least
half a standard error unit (i.e., most of the boxes are dark gray to black). A similar heat
map for the β1 coefficient revealed that except for the dΔ = ±0.50 conditions, the MNAR
slope coefficient always differed from its MAR-based counterpart by less than half a
standard error unit. Moreover, the medication group difference was always statistically
significant, indicating that even a very strong MNAR process was incapable of altering
the substantive conclusion that the medication was beneficial.
Figure 9.9 also provides information about specific processes that might be plausi-
ble for this application. For example, consider the possibility that placebo group partici-
pants with the most acute symptoms (and highest severity scores) leave the study to seek
treatment elsewhere, whereas medication group participants with mildest symptoms
(and lowest severity scores) leave the study, because they no longer feel treatment is nec-
essary. This scenario corresponds to positive values along the horizontal axis (placebo
group participants with missing data have a higher mean) and negative values along the
vertical axis (medication group participants with missing data have a lower mean than
those with complete data). To illustrate this scenario in more detail, Table 9.5 shows
the pattern-­specific and marginal coefficients from the vertical slice of squares located
at +0.30 on the horizontal axis (i.e., the placebo group mean for the dropout pattern is
higher by three-­tenths of a standard deviation). For comparison, the MAR analysis pro-
duced intercept and slope estimates of β̂0 = 4.49 (SE = 0.17) and β̂1 = –1.50 (SE = 0.18),
and MNAR estimates that differ by more than half a standard error are highlighted with
bold typeface.

Summary
Compared to selection models, pattern mixtures provide a very different vehicle for con-
ducting sensitivity analyses. The examples illustrated a process that varied the type,
direction, and strength of the missingness process across a broad range of “what if” sce-
narios. I used these results to answer two useful practical questions: “How much does the
Missing Not at Random Processes 379

missing data distribution need to differ to meaningfully change MAR-based estimates?”


and “How big a difference is needed to change significance test results?” With respect to
the latter, the treatment group difference was always statistically significant, even with
a very strong MNAR process. The impact of the missing data on the placebo group mean
was somewhat variable, but the dropout pattern’s distribution generally needed to differ
by more than ±0.30 standard units to see large distortions in this parameter.

9.9 LONGITUDINAL DATA ANALYSES

A substantial amount of methodological work has focused on longitudinal analyses for


MNAR mechanisms, especially in the biostatistics and clinical studies literatures where
such processes are particularly germane. The selection and pattern mixture modeling
frameworks readily extend to longitudinal data, and a hybrid approach called the shared
parameter model shares features with both. The remainder of the chapter describes these
models in the context of longitudinal growth curve analyses. Several resources provide
accessible and detailed overviews of these modeling frameworks for longitudinal data
(Albert & Follmann, 2009; Enders, 2011; Feldman & Rabe-­Hesketh, 2012; Hedeker &
Gibbons, 1997; Little, 2009; Molenberghs & Verbeke, 2001; Muthén et al., 2011; Xu &
Blozis, 2011).
A longitudinal growth curve model is a regression where repeated measurements are
a function of a temporal predictor that captures the passage of time at each measurement
occasion (e.g., days, weeks, months). Building on the chronic pain data example, sup-
pose that researchers collect a pretest measure of depression and two monthly follow-­up
assessments during an intervention period. A growth curve model for such data describes
the expected monthly change in depression, as well as individual differences around that
average change trajectory. To facilitate the interpretation, researchers usually code one of
the measurement occasions as 0 and set the others relative to that fixed point. One com-
mon option is to code measurement occasions relative to the baseline assessment (e.g.,
MONTH = 0, 1, 2), and another reflects these “time scores” relative to the final measure-
ment (e.g., MONTH = –2, –1, 0). I use the former definition for the ensuing examples.
Using generic notation, the multilevel linear growth curve model is

Yti = ( β0 + b0i ) + ( β1 + b1i )( MONTH ti ) + εti (9.38)

where Yti is the outcome (e.g., depression) score at occasion t for individual i, β0 is the
predicted baseline average when MONTH = 0, and β1 is the average monthly change
rate. Everyone has a unique linear trajectory, and the b0i and b1i terms are latent vari-
ables or random effects that reflect deviations between the individual intercepts and
slopes and the corresponding average coefficients. Finally, εti is a time-­specific residual
that captures the difference between an individual’s own linear trajectory and his or
her repeated measurements. By assumption, the individual intercepts and slopes are
bivariate normal with a covariance matrix Σb, and the within-­person residuals are nor-
mally distributed around the individual trajectories with constant variance σε2. Adding a
between-­person (time-­invariant) predictor of the individual intercepts and slopes (e.g., a
380 Applied Missing Data Analysis

treatment assignment indicator) gives the following model with a so-­called “cross-level
interaction effect”:

Yti = ( β0 + b0i ) + ( β1 + b1i )( MONTH ti ) + β2 ( X i ) + β3 ( X i )( MONTH ti ) + εti (9.39)

In this model, β0 and β1 give the predicted baseline score average monthly change rate
for a person with X = 0 (e.g., the control group if X is binary), and β2 and β3 are intercept
and slope differences.
The same longitudinal analysis can be specified as a latent curve model in the struc-
tural equation modeling framework. This specification views repeated measurements
as manifest indicators of an intercept and slope latent factor with fixed loadings that
encode the passage of time. To illustrate, Figure 9.10 shows a path diagram of the linear
growth model with a person-­level predictor of growth. Consistent with standard path
diagram conventions, ellipses represent latent variables, rectangles denote manifest (i.e.,
measured) variables, single-­headed straight arrows symbolize regression coefficients,
and double-­headed curved arrows are correlations. The latent variables represent the
individual intercepts and slopes, and the factor means (not shown in the diagram) cor-
respond to the β0 and β1 coefficients from the multilevel model. The unit factor loadings
connecting the intercept latent factor to the repeated measurements reflect the constant
influence of this term at each time point, and the fixed slope factor loadings encode
the monthly time scores (i.e., MONTH = 0, 1, 2). While the latent curve version of the
analysis requires a different data structure and model specification, it is equivalent to
the multilevel model and gives identical estimates. Interested readers can consult work
by Mehta and colleagues (Mehta & Neale, 2005; Mehta & West, 2000) for a synopsis of
these linkages.

Intercept Slope

1 0

1 1 1 2

Y1 Y2 Y3

FIGURE 9.10. Path diagram depicting a linear growth model with a person-­level predictor of
growth. Ellipses represent latent variables, rectangles denote manifest (i.e., measured) variables,
single-­
headed straight arrows symbolize regression coefficients, and double-­ headed curved
arrows are correlations.
Missing Not at Random Processes 381

Types of Missingness
Little (1995) described two distinct types of MNAR data in longitudinal settings.
Outcome-­dependent missingness occurs when the unseen value of the dependent vari-
able at a particular measurement occasion predicts concurrent nonresponse. Applied
to the depression example, this type of missing data could occur if a sudden spike in
depressive symptoms at the 1-month (or 2-month) follow-­up prompts some partici-
pants to quit the study and seek treatment elsewhere. In contrast, random coefficient-­
dependent missingness occurs when one’s underlying growth trajectory is responsible
for missing data rather than time-­specific realizations of the dependent variable. For
example, participants experiencing the most rapid declines in depressive symptoms
might quit the study, because they judge that treatment is no longer necessary, whereas
individuals with elevated and flatter trajectories might drop out to seek treatment else-
where. This type of missingness leverages one’s entire score history, including unseen
future measurements that could potentially relate to missingness if the outcome is mea-
sured with error (Demirtas & Schafer, 2003). These processes call for different models,
and both could be plausible for the same analysis.

Coding Missing Data Indicators


The classic growth models for MNAR data assume a monotone missingness pattern
where individuals with missing data at a particular occasion are always missing subse-
quent measurements. This missing data pattern requires binary indicators that code for
dropout or attrition rather than the absence of data. The depression example has up to
three dropout patterns with the following coding scheme: individuals who complete the
study have indicator scores equal to M = (M1, M2, M3) = (0, 0, 0), individuals who quit the
study after the baseline assessment are coded M = (0, 1, missing), and participants who
drop out prior to the final wave have codes of M = (0, 0, 1). An individual who missed the
1-month follow-­up but returned for the final assessment would have the same codes as
a participant with complete data. For those who quit the study after the baseline assess-
ment, the missing M3 code removes these individuals from further consideration, such
that the probability of missing data at the final assessment reflects just the people who
have yet to quit. When used as dependent variables in a selection model, these dropout
indicators define the conditional probability that a participant will drop out of the study
at occasion t, given that she or he was still in the study at the prior measurement occasion.
This interpretation is consistent with a hazard probability from the discrete-­time sur-
vival modeling framework (Muthén & Masyn, 2005; Singer & Willett, 2003).
Dropout indicators work well for longitudinal studies with a predominantly
monotone missing data pattern, but growth models readily accommodate other cod-
ing schemes. For example, in lieu of multiple dropout indicators, Feldman and Rabe-­
Hesketh (2012) proposed a single-­indicator approach that codes whether a participant
ever dropped out, and Schluchter (1992) used a single continuous measure of time to
dropout. For data with a preponderance of intermittent missing values, you can use a
conventional set of time-­varying missing data indicators that code participants as Mt =
0 or Mt = 1 at each occasion t (Albert & Follmann, 2009), and multicategorical indica-
382 Applied Missing Data Analysis

tor variables that encode different types of missingness are also a possibility (Albert,
­Follmann, Wang, & Suh, 2002). Little (1995) discusses some of these alternatives.

9.10 DIGGLE–KENWARD SELECTION MODEL

The Diggle–­Kenward selection model (Diggle & Kenward, 1994) for outcome-­dependent
missing data links missingness at an occasion t to the concurrent unseen score values
and observed data from prior occasions. Like its siblings from earlier in the chapter, the
model augments the focal analysis (a growth curve model) with additional regression
equations that explain missingness. To illustrate, Figure 9.11 shows a path diagram of
a three-wave structural equation model that includes a between-­person predictor of the
individual intercepts and slopes. The rectangles labeled M2 and M3 are binary discrete-­
time survival indicators that code dropout at the 1- and 2-month follow-­up assessments,
respectively (M1 is not needed, because baseline scores are complete). The dashed lines
are logistic or probit regressions that link the probability of dropout at wave t to the out-
come scores at that occasion, and the dotted paths add a MAR component that connects
dropout to scores at the prior occasion.
The diagram’s missingness model corresponds to the following regression equa-
tions:

M2*i = γ 02 + γ1Y2i + γ 2 Y1i + r2i (9.40)

M 3*i = γ 03 + γ1Y3i + γ 2 Y2i + r3i


rti ~ N1 ( 0,1)

Diggle and Kenward (1994) used logistic selection models, but I use probit models for
consistency with earlier material. The equations feature occasion-­specific intercepts that
allow dropout probabilities to vary over time, but they impose equality constraints on the
concurrent and lagged effects (i.e., the γ1 and γ2 take on the same value in both equations).
Estimating time-­specific effects would introduce complex outcome-­by-­occasion interac-
tions that may be difficult to estimate. This model reflects a diffuse MNAR process where
the observed data from prior occasions uniquely predict missingness beyond the unseen
score values. The path diagram shows that X’s influence on missingness is transmitted
indirectly via the latent variables and repeated measurements, which effectively function
as mediators. An even more diffuse model would include direct pathways between the
predictor and dropout indicators, and a model for a focused process would omit the lagged
effects from the prior measurement occasion. Consistent with previous selection models,
significance tests of the γ coefficients are not trustworthy and do not provide a way to eval-
uate the missingness mechanism (Jansen et al., 2006; Molenberghs & Kenward, 2007).
Specifying the Diggle–­Kenward model in the multilevel framework requires soft-
ware that can estimate multilevel path models with categorical and continuous out-
comes (Keller & Enders, 2021; Muthén & Muthén, 1998–2017). The data structure for a
multilevel growth curve model features the repeated measurements stacked in a single
column. Table 9.6 shows the data setup for two hypothetical participants from a three-
Missing Not at Random Processes 383

Intercept Slope

1 0

1 1 1 2

Y1 Y2 Y3

M2 M3

FIGURE 9.11. Path diagram depicting the Diggle–­Kenward growth model. The rectangles
labeled M2 and M3 are binary missing data indicators, dashed lines are probit regressions that
link the probability of dropout at wave t to the outcome scores at that occasion, and the dotted
paths add an MAR component that connects dropout to scores at the prior occasion.

wave study. The corresponding selection equation features a column of dropout indi-
cators regressed on time-­specific dummy codes, the dependent variable, and a lagged
version of the outcome, as follows:

Mti* = γ 01T1i + γ 02T2i + γ 03T3i + γ1Yti + γ 2 Yli + rti (9.41)

rti ~ N1 ( 0,1)

The Tti terms are binary variables that code the three measurement occasions, such that
Tt equals 1 in rows that correspond to measurement occasion t, and 0 otherwise. These
variables essentially function like on–off switches that initiate occasion-­specific inter-

TABLE 9.6. Hypothetical Data for a Multilevel


Diggle–Kenward Model
i t MONTH T1 T2 T3 Yt Yl
1 1 0 1 0 0 15 —
1 2 1 0 1 0 12 15
1 3 2 0 0 1 11 12
2 1 0 1 0 0 18 —
2 2 1 0 1 0 19 18
2 3 2 0 0 1 16 19
384 Applied Missing Data Analysis

cept coefficients capturing time-­related changes to the missing data rates. If the baseline
scores are complete or nearly so, fixing γ01 at a large negative z-value during estimation
induces a zero missingness probability. Finally, Yli is a lagged version of the dependent
variable that pairs each Yti with the participant’s score from the prior measurement occa-
sion. This variable is always missing at the baseline measurement.

9.11 SHARED PARAMETER (RANDOM COEFFICIENT)


SELECTION MODEL

The shared parameter model addresses random coefficient-­dependent missingness by


using individual random intercepts and slopes (the b0i and b1i terms in Equation 9.38)
as predictors of dropout (Albert & Follmann, 2009; Wu & Carroll, 1988). To illustrate,
Figure 9.12 shows a path diagram of the shared parameter structural equation model
for a hypothetical three-wave study. I omit the covariate from the diagram, but this
variable could predict the growth factors, missingness indicators, or both. The rect-
angles labeled M2 and M3 are again binary (discrete-­time survival) dropout indicators.
The dashed lines are logistic or probit regressions that link the probability of drop-
out to the random intercepts, and the dotted paths connect dropout to the individual
slopes. The diagram’s missingness model corresponds to the following probit regres-
sion equations:

M2*i = γ 02 + γ1b0i + γ 2 b1i + r2i (9.42)

M 3*i = γ 03 + γ1b0i + γ 2 b1i + r3i


rti ~ N1 ( 0,1)

where γ02 and γ03 are occasion-­specific intercepts that allow the missingness rates to vary
across time, and γ1 and γ2 are shared slope coefficients (i.e., parameters with equality
constraints). The equivalent multilevel selection equation is as follows:

Mti* = γ 01T1i + γ 02T2i + γ 03T3i + γ1b0i + γ 2 b1i + rti (9.43)

rti ~ N1 ( 0,1)

Finally, note that random coefficient-­dependent missingness is an inherently diffuse


process that depends on one’s entire response history, including observed data from ear-
lier measurement occasions and hypothetical scores from future assessments (Demirtas
& Schafer, 2003; Little, 1995). An even more diffuse missingness model could include
time-­varying or time-­invariant predictors.
The phrase “shared parameter” owes to the fact a shared set of predictor variables,
in this case the latent variables, link the repeated measurements to the missing data
indicators. Formally, the model factorizes the joint distribution of the repeated measure-
ments, missing data indicators, and random effects into the product of three univariate
distributions
Missing Not at Random Processes 385

M2 M3

Intercept Slope

1 0

1 1 1 2

Y1 Y2 Y3

FIGURE 9.12. Path diagram depicting the shared parameter growth model. The rectangles
labeled M2 and M3 are binary missing data indicators, dashed lines are probit regressions that
link the probability of dropout to the random intercepts, and the dotted paths connect dropout
to the individual slopes.

f ( Yi , M i , b i =
) f ( Yi | bi , Mi ) × f ( Mi | bi ) × f ( bi ) (9.44)

where Y represents the repeated measurements, M contains the missing data indica-
tors, and b denotes the intercepts and slopes (the b0i and b1i terms in Equation 9.38). By
assumption, the repeated measurements and missing data indicators are conditionally
independent after controlling for the random intercepts and slopes, which simplifies the
factorization as follows:

f ( Yi | b i ) × f ( M i | b i ) × f ( b i ) (9.45)

The shared parameter b plays the key role by absorbing the MNAR linkage between the
outcomes and indicators. This feature is evident in Figure 9.12, where the repeated mea-
surements and indicators are connected (correlated), because they share the intercept
and slope latent factors as common predictors. Methodologists have described several
variants of this approach that instead use latent class membership as a shared parameter
(Beunckens, Molenberghs, Verbeke, & Mallinckrodt, 2008; Gottfredson, Bauer, & Bald-
win, 2014; Muthén et al., 2011; Roy, 2003).

9.12 RANDOM COEFFICIENT PATTERN MIXTURE MODELS

Hedeker and Gibbons (1997) describe a random coefficient pattern mixture model
where missing data patterns form qualitatively different subgroups with distinct growth
trajectories but common random effect parameters. Their model casts missing data indi-
386 Applied Missing Data Analysis

cators as dummy codes that moderate the influence of one or more explanatory vari-
ables on the outcome.
Returning to the depression scenario, there are three dropout patterns: partici-
pants who (1) completed the study, (2) quit following the baseline assessment, and (3)
dropped out prior to the final assessment. The simplest incarnation of the pattern mix-
ture model uses a single binary indicator that classifies participants simply as “com-
pleters” or “dropouts” based on the presence or absence of data at the final measurement
occasion. Using generic notation, the fitted growth curve model is

Yti = β(0 ) + β1( ) ( MONTH ti ) + β(0 ) ( Mi ) + β1( ) ( MONTH ti )( Mi )


0 0 diff diff
(9.46)
+ b1i ( MONTH ti ) + b0i + εti

where β0(0) and β1(0) are the intercept and slope (e.g., baseline average and monthly change
rate) for the completers in pattern M = 0, and β0(diff) and β1(diff) capture the difference in
the dropout group’s intercept and slope coefficients, respectively. The weighted averages
from Equation 9.27 give the overall (marginal) population estimates of β0 and β1.
Adding a between-­person (time-­invariant) predictor of the individual intercepts
and slopes (e.g., a treatment assignment indicator) gives the following model:

SEVERITYti = β(0 ) + β1( ) ( MONTH ti ) + β2( ) ( X i ) +β(3 ) ( MONTH ti )( X i )


0 0 0 0

+β(0 ) ( Mi ) + β1( ) ( Mi )( MONTH ti ) +β2( ) ( Mi )( X i )


diff diff diff
(9.47)

+β(3 ) ( Mi )( MONTH ti )( X i ) + b0i + b1i ( MONTH ti ) + εti


diff

X M

Intercept Slope

1 0

1 1 1 2

Y1 Y2 Y3

FIGURE 9.13. Path diagram depicting the random coefficient pattern mixture model, where
the rectangle labeled M is the binary dropout indicator. The structural equation model features
the missing data indicator, covariate, and the interaction of the two predicting the intercept and
slope growth factors.
Missing Not at Random Processes 387

The terms not involving M are the completer group’s parameters (these quantities have
the same interpretation as the βs from Equation 9.39), and terms that include M reflect
coefficient differences for the dropout group. Figure 9.13 shows a path diagram of the
corresponding structural equation model. Following Figure 9.2, the dashed line indi-
cates an interaction effect where the missing data indicator moderates the influence of
the predictor.

Specifying Inestimable Parameters


Pattern mixture models usually feature one or more inestimable parameters, but the
simple linear model in Equations 9.46 and 9.47 are estimable from the data, because
combining the two dropout patterns creates a subgroup where some of its members
have two observations. This is enough information to estimate the linear slope, which
is effectively a change score that captures the difference between the baseline measure-
ment and 1-month follow-­up. Two measurements do not provide enough information to
estimate a full collection of pattern-­specific variances and covariances, but that part of
the model is shared among all groups.
While reducing the number of missing data patterns is often warranted and neces-
sary, representing missingness with just two categories runs the risk of mingling quali-
tatively different subgroups. Consider what happens in a more complex analysis that
treats the three missing data patterns as distinct. The pattern mixture model for the
linear growth curve is

Yti = β(0 ) + β1( ) ( MONTH ti ) +β(0 ) ( M2i ) +β(0 ) ( M 3i )


0 0 diff1 diff 2
(9.48)
+ β1(
diff1)
( MONTHti )( M2i ) + β1(diff 2 ) ( MONTHti )( M3i ) + residuals
where M2 and M3 are dummy codes that indicate whether a participant quit prior to the
second or third assessment, respectively. The β0(0) and β1(0) coefficients again represent
the baseline depression level and average monthly growth rate for the completers; β0(diff1)
and β0(diff2) are the baseline mean differences for the M2 = 1 and M3 = 1 patterns, respec-
tively; and β1(diff1) and β1(diff2) capture group differences in the average change rates. The
population growth trajectory is now a weighted average over three missing data pat-
terns, as follows:

β0 =
1 2
( diff1 3
) 0
(
π( )β(0 ) + π( ) β(0 ) + β(0 ) + π( ) β(0 ) + β(0 )
0 0 diff 2
) (9.49)
= π( )β(0 ) + π( )β(0 ) + π( )β(0 )
1 1 2 2 3 3

β1 =
1 2
( diff1 3
) 0
(
π( )β1( ) + π( ) β1( ) + β1( ) + π( ) β1( ) + β1( )
0 0 diff 2
)
= π( )β1( ) + π( )β1( ) + π( )β1( )
1 1 2 2 3 3

Creating and analyzing model-based multiple imputations is an alternative to explicitly


pooling over the missing data patterns (Demirtas & Schafer, 2003).
Both β0(diff1) and β0(diff2) are estimable, because all three groups have baseline scores,
and two measurements again suffice for estimating the slope difference for people who
388 Applied Missing Data Analysis

quit prior to the final follow-­up (the β1(diff2) coefficient). However, the growth rate differ-
ence for the early dropouts (M2 = 1) is not estimable from a single observation. Given a
suitable estimate of the standard deviation (e.g., an MAR-based estimates of the baseline
standard deviation or the within-­cluster residual standard deviation), Equation 9.31 (or
Equation 9.33) can provide a value for the inestimable β1(diff1) coefficient. In this context,
dΔ can be viewed as the standardized mean difference that results from a one-unit incre-
ment in the temporal predictor (e.g., one additional month in the study). Returning to
the hypothetical depression scenario, suppose that the baseline standard deviation from
a preliminary analysis was σ̂Y = 6. Furthermore, suppose that the completer group’s
depression scores improve (decrease) by one-fifth of a standard deviation per month,
on average (i.e., β1(0) = –1.20). Specifying dΔ = +0.10 means that every additional month
in the study changes the early dropout group’s mean by a value that is one-tenth of a
standard deviation higher (more positive) than that of the completers. This specification
induces a pattern-­specific growth rate that halves the complete-­case change rate (i.e.,
β1(2) = β1(0) + β1(0) = –1.20 + (0.10 × 6) = –0.60).
So-­called identifying restrictions that replace inestimable parameters with esti-
mates from another pattern are an alternative to an effect-­size-based specification. Three
such restrictions—­the complete-­case, neighboring-­case, and available-­case identifying
restrictions—­have received considerable attention in the literature (Demirtas & Schafer,
2003; Molenberghs, Michiels, Kenward, & Diggle, 1998; Thijs, Molenberghs, Michiels,
Verbeke, & Curran, 2002; Verbeke & Molenberghs, 2000). As its name implies, the
complete-­case restriction sets any inestimable parameters equal to those of the com-
pleter group. Applied to the depression example, this restriction sets β1(diff1) equal to
0, such that participants who quit after the baseline assessment follow the same linear
trajectory as the people who complete the study (albeit with a different baseline aver-
age). The neighboring-­case restriction instead borrows a coefficient from the nearest
group for which an effect is estimable. Participants who quit prior to the final assess-
ment are the early dropout group’s nearest neighbors, so this strategy sets β1(diff1) equal
to β1(diff2). Finally, the available-­case restriction sets the inestimable coefficient equal
to the weighted average across all patterns for which an effect is estimable. I illustrate
some of these restrictions in the next section, and Demirtas and Schafer (2003) describe
a detailed sensitivity analysis for the same psychiatric trial data.

9.13 LONGITUDINAL DATA ANALYSIS EXAMPLES

I use the psychiatric trial data on the companion website to illustrate a sensitivity
analysis for a longitudinal growth curve model with an outcome that could be MNAR.
The data, which were collected as part of the National Institute of Mental Health
Schizophrenia Collaborative Study, comprises four illness severity ratings, measured
in half-point increments ranging from 1 (normal, not at all ill) to 7 (among the most
extremely ill). In the original study, the 437 participants were assigned to one of four
experimental conditions (a placebo condition and three drug regimens), but the data
collapse these categories into a dichotomous treatment indicator (DRUG = 0 for the
placebo group, and DRUG = 1 for the combined medication group). The researchers
Missing Not at Random Processes 389

collected a baseline measure of illness severity prior to randomizing participants to


conditions, and they obtained follow-­up measurements 1 week, 3 weeks, and 6 weeks
later. The overall missing data rates for the repeated measurements were 1, 3, 14, and
23%, and these percentages differ by treatment condition; 19 and 35% of the placebo
group scores were missing at the 3-week and 6-week assessments, respectively, versus
13 and 19% for the medication group.
As explained previously, researchers usually code one of the measurement occa-
sions as 0 and set the others relative to that fixed point. I adopt the common practice
of coding time scores relative to the baseline assessment (i.e., WEEK = 0, 1, 3, 6). The
observed means follow a nonlinear trend with a pronounced reduction at the 1-week
follow-­up and more gradual changes after that. Following other published illustrations
with these data (Demirtas & Schafer, 2003; Hedeker & Gibbons, 1997), I linearized
the trajectory by modeling changes in severity as a function of the square root of weeks
since the baseline assessment. Taking the square root of the time scores creates a vari-
able SQRTWEEK that codes the measurement occasions as 0 = 0, 1 = 1, 3 = 1.73,
and 6 = 2.45. To illustrate the effect of this transformation, Figure 9.14 shows the ill-
ness severity means for the placebo and medication condition. The white squares and
circles reflect time in weeks, and the black squares and circles show the means on the
transformed time scale. As you can see, the total change for each group is the same in
both cases, but the transformation compresses elapsed time after the 1-week follow-­up.
The focal analysis is a linear growth curve model that expresses illness severity
ratings as a function of the temporal predictor, treatment assignment indicator, and the
group-by-time interaction.

SEVERITYti = ( β0 + b0i ) + ( β1 + b1i )( SQRTWEEKti ) + β2 ( DRUGi )


(9.50)
+β3 ( SQRTWEEKti )( DRUGi ) + εti

The β0 and β1 coefficients are the placebo group’s baseline average and linear change
rate, respectively, β2 is the medication group’s baseline difference, and β3 is the slope dif-
ference. By assumption, the individual intercepts and slopes are bivariate normal with a
covariance matrix Σb, and the within-­person residuals are normally distributed around
the individual trajectories with constant variance σε2.
I used maximum likelihood estimation for the examples, and the companion web-
site also provides scripts for Bayesian estimation and model-based multiple imputa-
tion. Longitudinal selection and shared parameter models can be difficult to estimate,
because the observed data often contain very little information about the missingness
model. Monitoring convergence is especially important in this context, as invalid solu-
tions are common. For the maximum likelihood analyses, I refit the model with several
sets of random starting values and examined the final log-­likelihood values to verify
that different runs achieved the same solution. For a Bayesian analysis, specifying mul-
tiple MCMC chains with different starting values provides analogous information along
with convergence diagnostics such as trace plots and the potential scale reduction factor
diagnostic (Gelman & Rubin, 1992). It is also important to check whether the MNAR
results are unduly influenced by unusual data records, and I use individual influence
diagnostics for this purpose (Sterba & Gottfredson, 2014).
390 APPLIED MISSING DATA ANALYSIS

7
6

Placebo
5
Illness Severity
4
3

Medication
2
1

0 1 2 3 4 5 6
Weeks Since Baseline

FIGURE 9.14. Illness severity means for the placebo and medication condition. The black
squares and circles show the means on the transformed time scale, and the white squares and
circles reflect time in weeks. The total change for both groups is the same, but the transformation
compresses elapsed time after the 1-week follow-up.

Analysis 1: Diggle–Kenward Selection Model


Diggle and Kenward’s (1994) selection model addresses an outcome-dependent missing-
ness process where time-specific realizations of the dependent variable predict missing-
ness. Returning to Table 9.3, the first four rows follow a monotone missing data pattern,
and the remaining patterns (about 5.5% of the total sample) have intermittent miss-
ing values. Dropout (discrete-time survival) indicators are well suited for this predomi-
nantly monotone pattern. The group of individuals who quit after the baseline assess-
ment (about .7% of the total sample) is too small to form a dropout indicator for the
1-week assessment, so I use a pair of indicators that code attrition prior to the 3-week
and 6-week measurements. The binary variables had the following coding scheme: Indi-
viduals who quit prior to the 3-week follow-up were coded M = (M3, M6) = (1, missing),
participants who drop out prior to the final measurement had indicator scores equal to
M = (0, 1), and all other patterns were coded M = (0, 0). This coding scheme effectively
views the small percentage of participants with intermittent missing values as being like
those who complete the study.
Diggle and Kenward (1994) used logistic regression, but I again use probit missing-
ness models for consistency with earlier material. Following the path diagram in Figure
Missing Not at Random Processes 391

9.11, these regressions link the latent response variable at occasion t to the unseen score
values at that measurement and the observed data from prior occasions.

M 3*i = γ 02 + γ1 ( SEVERITY3i ) + γ 2 ( SEVERITY1i ) + r3i (9.51)

M6*i = γ 03 + γ1 ( SEVERITY6i ) + γ 2 ( SEVERITY3i ) + r6i


rti ~ N1 ( 0,1)

The regressions feature occasion-­specific intercepts that allow the conditional probabil-
ity of missing data to vary over time, but they place equality constraints on the concur-
rent and lagged effects (i.e., the γ1 and γ2 pairs share the same value). As noted earlier,
estimating time-­specific effects introduces complex outcome-­by-­occasion interactions
that may be difficult to estimate.
Equation 9.51 is consistent with a diffuse MNAR process where observed scores
from the prior occasion uniquely predict missingness above and beyond the unseen
score values (i.e., M depends on both Y(mis) and Y(obs)). The AIC and BIC indicated sub-
stantial support for this model relative to a focused one that omitted the lagged effect
(e.g., BIC = 5223.09 vs. 5242.88). I also considered a more diffuse process that included
treatment assignment as a predictor of missingness, but adding that regressor had a neg-
ligible impact on the estimates (the AIC and BIC offered modest support for the simpler
model).
Table 9.7 shows parameter estimates and standard errors from the Diggle–­Kenward
model along with those of an analysis that assumes a conditionally MAR process. The
MAR-based analysis featured an empty selection model as a device for equating the
metric of its log-­likelihood values (and thus AIC and BIC) to that of the MNAR analysis
(Muthén et al., 2011; Sterba & Gottfredson, 2014). The Diggle–­Kenward model pre-
dicted a substantially slower (less negative) growth rate for the placebo group (β̂1 = –0.15
vs. –0.35), and it also showed a larger (more negative) slope difference for the medica-
tion condition (β̂3 = –0.70 vs. –0.63). I judge these changes to be quite large, as the β1
coefficients differ by nearly three standard error units, and the β3 coefficients differ by
approximately one standard error. To further illustrate the analysis results, Figure 9.15
shows the average growth curves from the MAR analysis as solid lines, and it depicts the
Diggle–­Kenward trajectories as dashed lines. The figure highlights that the two analyses
made different predictions for both groups, although the direction of the effects were
consistent (as were significance tests of β̂1 and β̂3).
Looking at the fit information near the bottom of Table 9.7, both the AIC and BIC
favored the Diggle–­Kenward model over the MAR analysis (ΔAIC = AICMAR – AICMNAR
= 21.13. and ΔBIC = BICMAR – BICMNAR = 12.97). Conditional on the validity of the Diggle–­
Kenward model, this ΔBIC represents “very strong” evidence (Raftery, 1995) of MNAR
dropout. I further used the influence diagnostics from Equation 9.15 to identify indi-
vidual data records that unduly impact this conclusion (Sterba & Gottfredson, 2014). An
index plot like the one in Figure 9.6 revealed no such outliers, thus lending credence to
the conclusion that the Diggle–­Kenward model is plausible for these data.
392 Applied Missing Data Analysis

TABLE 9.7. Growth Curve Estimates from Diggle–Kenward and Shared


Parameter Models
MAR Diggle–Kenward Shared parameter
Effect Est. SE Est. SE Est. SE
Intercept (β0) 5.35 0.09 5.28 0.09 5.36 0.09
SQRTWEEK (β1) –0.35 0.07 –0.15 0.07 –0.38 0.07
DRUG (β2) 0.05 0.10 0.07 0.10 0.04 0.10
SQRTWEEK × DRUG (β3) –0.63 0.08 –0.70 0.08 –0.62 0.08
Intercept variance (σb20) 0.36 0.06 0.30 0.06 0.35 0.06
Slope variance (σb21) 0.23 0.03 0.23 0.04 0.23 0.03

Missingness model
Intercept Week 3 –1.23 0.08 –2.58 0.56 –1.72 1.01
Intercept Week 6 –1.09 0.08 –2.16 0.49 –1.56 0.99
SEVERITYt — — 1.18 0.23 — —
SEVERITYt–1 — — –0.98 0.18 — —
Random intercepts (b0) — — — — 0.11 0.17
Random slopes (b1) — — — — –0.44 0.24
DRUG — — — — –0.70 0.21

Model fit
AIC (no. of parameters) 5197.24 (10) 5176.11 (12) 5189.31 (13)
BIC (no. of parameters) 5238.04 (10) 5225.07 (12) 5242.35 (13)

Analysis 2: Shared Parameter (Random Coefficient) Selection Model


The shared parameter model considers an MNAR process where the progression of ill-
ness severity rather than occasion-­specific realizations of the dependent variable deter-
mines dropout (e.g., participants who experienced a rapid decline quit, because they
achieved adequate relief, individuals with high stable trajectories quit to seek treatment
elsewhere). Following the path diagram in Figure 9.12, I started by fitting a model with
the random intercepts and slopes predicting the dropout indicators, and then added the
treatment assignment indicator as an additional regressor in the missingness model, as
follows:

M 3*i = γ 02 + γ1b0i + γ 2 b1i + γ 3 ( DRUGi ) + r3i (9.52)

M6*i = γ 03 + γ1b0i + γ 2 b1i + γ 3 ( DRUGi ) + r6i


rti ~ N1 ( 0,1)

The AIC and BIC supported the more diffuse model with treatment condition predict-
ing missingness (e.g., BIC = 5242.35 vs. 5248.74 for the simpler model with only random
effects predicting dropout).
Missing Not at Random Processes 393

7
6

Placebo
5
Illness Severity
4
3

Medication
2
1

0.0 0.5 1.0 1.5 2.0 2.5 3.0


Square Root of Weeks Since Baseline

FIGURE 9.15. Solid lines depict the average growth curves from the MAR analysis, and the
dashed lines are the Diggle–­Kenward trajectories. Both analyses show that the medication condi-
tion achieved greater gains than the placebo group, but the models make different predictions
about the means.

The rightmost columns in Table 9.7 show the parameter estimates and standard
errors. The estimates are effectively equivalent to those of the conditionally MAR model,
and the average growth trajectories are indistinguishable from the solid lines in Fig-
ure 9.15. The negative γ2 and γ3 coefficients suggest that individuals who experience
the steepest declines (i.e., lowest, or most negative random slopes) and control group
participants are more likely quit the study. Turning to the fit information near the bot-
tom of the table, the AIC and BIC preferred the Diggle–­Kenward model, but they dis-
agreed about the shared parameter model and MAR analysis; the AIC favored the former
(ΔAIC = AICMAR – AICMNAR = 7.93), whereas the BIC selected the latter (BIC = BICMAR –
BICMNAR = –4.31). Discrepant information criteria are not a problem in this case, because
the analyses in question produced equivalent estimates. The individual influence diag-
nostics revealed no outliers that unduly affected these model comparisons.

Analysis 3: Random Coefficient Pattern Mixture Model


Hedeker and Gibbons (1997) describe a random coefficient pattern mixture model where
missing data patterns form qualitatively different subgroups with distinct growth tra-
jectories but common random effect parameters. I began by fitting a simple two-­pattern
version of their model that classifies participants as “completers” or “dropouts” based on
394 Applied Missing Data Analysis

the presence or absence of data at the final measurement occasion. Returning to Table 9.3,
this setup collapses the four patterns with missing 6-week follow-­up scores (Patterns 2,
3, 4, and 8) into a single group coded M = 0, and it combines participants with complete
data and intermittent missing values (Patterns 1, 5, 6, 7, and 9) into a group coded M = 1.
This coding scheme is a good starting point, because there are no inestimable parameters.
The fitted growth curve model casts the missing data indicator as a dummy code
that moderates the influence of one or more explanatory variables on the outcome.

SEVERITYti = β(0 ) + β1( ) ( SQRTWEEKti ) + β(2 ) ( DRUGi )


0 0 0

+β(3 ) ( SQRTWEEKti )( DRUGi ) + β(0 ) ( Mi )


0 diff
(9.53)
+β1( ) ( Mi )( SQRTWEEKti ) +β(2 ) ( Mi )( DRUGi )
diff diff

+β(3 ) ( Mi )( SQRTWEEKti )( DRUGi ) + b0i + b1i ( SQRTWEEKti ) + εti


diff

The β0(0) to β3(0) coefficients are the growth model parameters for the completer pattern
with M = 0, and β0(diff) to β3(diff) give the amount by which these coefficients differ in the
dropout group. The model resembles the path diagram in Figure 9.13 but has different
time scores (slope factor loadings). The overall population-­level parameters are again
weighted averages over the missing data patterns as follows:

β0 =
0 0 1
(
π( )β(0 ) + π( ) β(0 ) + β(0 ) =
0 diff 0
)
π( )β(0 ) + π( )β(0 )
0 1 1
(9.54)

π β + π ( β + β(
( ) ( ) () ( ) )
) =π( )β( ) + π( )β( )
0 0 1 0 diff 0 0 1 1
β =
1 1 1 1 1 1

π ( )β ( ) + π ( ) ( β ( ) + β ( )
) =π( )β( ) + π( )β( )
0 0 1 0 diff 0 0 1 1
β =
2 2 2 2 2 2

π ( )β ( ) + π ( ) ( β ( ) + β ( )
) =π( )β( ) + π( )β( )
0 0 1 0 diff 0 0 1 1
β =
3 3 3 3 3 3

Simultaneously estimating an empty probit or logit model for the missing data indicator
provides pattern proportions and standard errors for pooling, and creating and analyz-
ing model-based multiple imputations is an alternative to explicitly pooling over the
missing data patterns (Demirtas & Schafer, 2003).
Table 9.8 shows pattern-­specific and population-­level estimates for the Hedeker–­
Gibbons model, and it also shows the estimates from a corresponding MAR analysis that
fixed all pattern difference coefficients (β0(diff) to β3(diff)) equal to 0. Specifying an MAR
analysis in this framework gives the same population-­level estimates as a conventional
analysis that ignores missingness, but the AIC and BIC values are comparable to those
of the MNAR model. Looking at the pattern-­specific estimates, dropouts and completers
had very different growth trajectories; among the participants who quit, the placebo
group had a much higher baseline mean and a slower (less negative) growth rate, and
the treatment group had a much steeper decline in symptoms. The overall (marginal)
estimates differed somewhat from those of the MAR analysis, but the discrepancies were
not as large as those for the Diggle–­Kenward model (e.g., the treatment group growth
rate estimates differed by about three-­fourths of a standard error unit, and other differ-
ences were roughly equal to half a standard error). Figure 9.16 shows average growth
Missing Not at Random Processes 395

TABLE 9.8. Growth Curve Estimates from Two‑Pattern Random Coefficient


Pattern Mixture
MAR Hedeker–Gibbons Pattern-specific
Effect Est. SE Est. SE M=0 M=1
Intercept (β0) 5.35 0.09 5.30 0.09 5.22 5.55
SQRTWEEK (β1) –0.35 0.07 –0.34 0.07 –0.39 –0.17
DRUG (β2) 0.05 0.10 0.11 0.10 0.21 0.19
SQRTWEEK × DRUG (β3) –0.63 0.08 –0.69 0.08 –0.54 –1.18
Intercept variance (σb20) 0.36 0.06 0.35 0.06 — —
Slope variance (σb21) 0.23 0.03 0.22 0.03 — —

Model fit
AIC (no. of parameters) 5054.26 (9) 5036.43 (13) —
BIC (no. of parameters) 5090.98 (9) 5089.47 (13) —
7
6

Placebo
5
Illness Severity
4
3

Medication
2
1

0.0 0.5 1.0 1.5 2.0 2.5 3.0


Square Root of Weeks Since Baseline

FIGURE 9.16. Solid lines depict the average growth curves from the MAR analysis, and the
dashed lines are the Hedeker–­Gibbons pattern mixture model trajectories. Both analyses show
that the medication condition achieved greater gains than the placebo group, and the models
made similar predictions about the means.
396 Applied Missing Data Analysis

curves from the MAR analysis as solid lines, and it depicts the Hedeker–­Gibbons model
trajectories as dashed lines.
Looking at the information criteria near the bottom of Table 9.8, both the AIC
and BIC selected the MNAR analysis (ΔAIC = AICMAR – AICMNAR = 17.83 and ΔBIC =
BICMAR – BICMNAR = 1.52), although ΔBIC’s evidence is “weak” according to Raftery’s
(1995) effect size guidelines. Individual influence diagnostics revealed 15 participants
with positive indices larger than ΔBIC. Removing any one of these data records from
the analysis could switch the sign of ΔBIC from positive to negative, thereby favoring
the MAR model. Interestingly, most of these cases had response profiles with very large
score reductions (e.g., a change from 7 to 2), and all were medication recipients who
dropped out. Presumably, these individuals are mostly responsible for the very large
negative slope for the M = 1 pattern in Table 9.8. These influence diagnostics should be
reported as part of a broader sensitivity analysis, but finding influential participants is
not a prescription for removing data records from the analysis (Sterba & Gottfredson,
2014).
Next, consider a more refined analysis that collapsed the missing data patterns in
Table 9.3 into three groups: early dropouts who quit prior to the 3-week follow-­up (Pat-
terns 2 and 3), later dropouts who leave prior to the 6-week follow-­up (Patterns 4 and
8), and a completer group that includes participants with intermittent missing values
(Patterns 1, 5, 6, 7, and 9). Figure 9.17 shows the observed means for each pattern broken
down by treatment group. The fitted model includes two dummy codes that indicate
early and late dropout (EDROP and LDROP, respectively):

SEVERITYti = β(0 ) + β1( ) ( SQRTWEEKti ) + β(2 ) ( DRUGi )


0 0 0

+ β(3 ) ( SQRTWEEKti )( DRUGi ) + β(0


diff1)
0
( EDROPi )
+ β1( ) ( EDROPi )( SQRTWEEKti ) + β(2 ) ( EDROPi )( DRUGi )
diff1 diff1

(9.55)
+ β(3 ) ( EDROPi )( SQRTWEEKti )( DRUGi ) + β(0 ) ( LDROPi )
diff1 diff 2

+ β1( ) ( LDROPi )( SQRTWEEKti ) + β(2 ) ( LDROPi ) ( DRUGi )


diff 2 diff 2

+ β(3 ) ( LDROPi )( SQRTWEEKti )( DRUGi ) + b0i


diff 2

+ b1i ( SQRTWEEKti ) + εti


The β0(0) to β3(0) coefficents represent the growth model parameters for the completers,
β0(diff1) to β3(diff1) give the amount by which these coefficients differ among the early drop-
outs with EDROP = 1, and β0(diff2) to β3(diff2) are the corresponding coefficient differences
for the later dropout group with LDROP = 1.
This model is estimable, because each group has at least two data points and all
groups share common random effect parameters, but measurement error alone might
give you pause about extrapolating a linear growth rate from just two observations.
Instead, I use this opportunity to illustrate complete-­case and neighboring-­case identi-
fying restrictions that define the early dropout group’s trajectories as a function of other
patterns. As its name implies, the complete-­case restriction sets the placebo group’s
Missing Not at Random Processes 397

7
6

Placebo
5
Illness Severity
4
3

Medication
2
1

0 1 2 3
Square Root of Weeks Since Baseline

FIGURE 9.17. Observed means for three missing data patterns broken down by treatment
group. The early dropout group has two observations, the later dropout group has three observa-
tions, and the completers have four observations. Dashed lines are the placebo condition, and
solid lines are the medication group.

growth rate and the medication group’s slope difference equal to the corresponding
parameters in the completer group (i.e., β1(diff1) = β3(diff1) = 0). Figure 9.17 suggests that this
strategy isn’t as unreasonable as it might sound, as the two groups show similar changes
between the baseline and 1-week follow-­up (albeit with different baseline averages). The
neighboring-­case restriction instead sets the early dropout group’s growth parameters
equal to those of the latter dropouts (i.e., β1(diff1) = β1(diff2) and β3(diff1) = β3(diff2)). Although
I don’t illustrate the procedure here, the available case restriction would fix the early
dropout pattern’s growth rate parameters equal to a weighted average across the other
patterns.
Table 9.9 shows the parameter estimates and standard errors from the two mod-
els. The corresponding MAR analysis produced the same estimates as the two-­pattern
model in Table 9.8, albeit with different fit values (because the empty model for the indi-
cators differs). The complete-­case restriction produced estimates that were like those of
the MAR analysis; the placebo group average and baseline mean difference coefficients
changed by about half a standard error unit, but the growth rate parameters were effec-
tively equivalent. The neighboring-­case restriction estimates were effectively identical
to those of the two-­pattern model in Table 9.8. This isn’t necessarily surprising given
that the equality constraints are a more elaborate way of combining the two dropout
398 Applied Missing Data Analysis

TABLE 9.9. Growth Curve Estimates from Three‑Pattern Random Coefficient


Pattern Mixture
Complete-case Neighboring-case Effect-size-based
Effect Est. SE Est. SE Est. SE
Intercept (β0) 5.30 0.09 5.30 0.09 5.29 0.09
SQRTWEEK (β1) –0.35 0.07 –0.34 0.07 –0.34 0.07
DRUG (β2) 0.10 0.10 0.11 0.10 0.11 0.10
SQRTWEEK × DRUG (β3) –0.65 0.08 –0.69 0.08 –0.70 0.08
Intercept variance (σb20) 0.34 0.06 0.35 0.06 0.35 0.06
Slope variance (σb21) 0.22 0.03 0.22 0.03 0.22 0.03

Model fit
AIC (no. of parameters) 5182.55 (16) 5181.63 (16) 5181.65 (16)
BIC (no. of parameters) 5247.83 (16) 5246.91 (16) 5246.93 (16)

patterns. The AIC and BIC disagreed, as the former favored the MNAR analyses (ΔAIC
values were positive), and the latter supported the MAR analysis (ΔBIC were negative).
These discrepancies don’t pose much of a dilemma given the relative stability of the
estimates.
As a final example, I used the effect-­size-based strategy from Section 9.7 to specify
the early dropout group’s growth rate parameters. This procedure allows you to exam-
ine the stability of the results across a wide range of plausible parameter values, but I
used the method to induce a more extreme MNAR process than the one implied by the
neighboring-­case restriction. Returning to Figure 9.17, the two dropout groups exhib-
ited relatively similar change rates during the first week of the study, but they could have
diverged after the 1-week follow-­up. To examine this possibility, I selected growth rate
parameters for the early dropout pattern that induced a flatter (more positive) trajectory
among placebo group participants and an even steeper decline for people who received
medication. Using the residual standard deviation and a standardized effect size of dΔ
= +0.10, I specified the placebo group’s relative growth rate as dΔ × σ̂Y = 0.10 × 0.77 =
0.077. In this context, setting β1(diff1) equal to β1(diff2) + 0.077 implies that, relative to the
later dropouts, placebo group participants who quit the study early follow a trajectory
that increases the mean by one-tenth of a standard deviation more during the first week
of the study (or about one-­fourth of a standard deviation over the duration of the study).
Setting β3(diff1) equal to β3(diff2) – 0.077 induces a comparable relative decline for early
dropouts in the treatment group.
The rightmost panel in Table 9.9 shows the parameter estimates from the analysis.
As expected, the analysis produced a flatter (less negative) placebo group growth rate
and a steeper (more negative) trajectory for the medication condition. The effect-­size-
based identifying restriction can be viewed as a worst-case scenario among the pattern
mixture models, because it extrapolates the group means in a way that exaggerates the
MNAR process.
Missing Not at Random Processes 399

Summary
The real data examples comprise a sensitivity analysis that examined the stability of
the growth model parameters across four different missingness processes. The Diggle–­
Kenward selection model and the Hedeker–­Gibbons pattern mixture model with an
effect-­size-based identifying restriction produced nontrivial differences in some key
parameters. Both analyses suggested a flatter (less negative) trajectory for the placebo
group and a steeper decline for the medication condition, with changes to the param-
eters as large as one standard error unit in some cases. Moreover, the ΔBIC offered
“strong” evidence favoring these models (with all the usual caveats about such model
comparisons), and individual influence diagnostics identified a subgroup of medica-
tion recipients who left the study after experiencing very large score reductions (e.g., a
change from 7 to 2).
Considered as a whole, the sensitivity analysis results suggest that a MNAR process
is quite plausible for these data. As is often the case, fitting different models induced
relatively large changes to some key parameters, albeit with no changes to their statisti-
cal significance tests. Although such differences might seem troubling, the results sim-
ply reflect different, plausible assumptions about the missing data. To reiterate, there is
no way of knowing whether the MNAR analyses are better than a simpler analysis that
assumes a conditionally MAR mechanism. Both sets of results are defensible and could
(and should) be presented in a research report. Chapter 11 offers recommendations for
reporting sensitivity analysis results.

9.14 SUMMARY AND RECOMMENDED READINGS

Analyses that assume a conditionally MAR process have been our go-tos throughout
the book. This mechanism stipulates that unseen score values carry no unique informa-
tion about missingness beyond that contained in the observed data. This assumption
is convenient, because there is no need to specify and estimate a model for the missing
data process. Although the MAR assumption is quite reasonable for a broad range of
applications, an MNAR process where the unseen score values carry unique information
about missingness may also be plausible in some settings. This chapter has outlined two
major modeling frameworks for such processes: selection models and pattern mixture
models. Both approaches introduce an additional model that describes the occurrence
of missing data, but they do so in very different ways: A typical selection model features
a regression equation with a missing data indicator as a dependent variable, whereas a
pattern mixture model uses the indicator as a moderator variable. I have described both
approaches in the context of regression models and longitudinal growth models.
Fitting selection models requires a researcher to proactively search for a model that
has support from the data while looking for subtle clues that signal a misspecification
or identification problem. In contrast, pattern mixture models require the researcher
to specify values for one or more inestimable parameters (or impose comparable con-
straints). Although their implementation details are very different, both modeling
frameworks require strict, untestable assumptions, and model misspecifications could
400 Applied Missing Data Analysis

produce estimates that contain more bias than those from an MAR analysis. Accord-
ingly, the literature often recommends sensitivity analyses that examine the stability of
one’s substantive conclusions across different assumptions, and the analysis examples
demonstrated that process. The examples highlighted that invoking different assump-
tions about missingness can have relatively large impacts on key model parameters.
Although such differences might seem troubling, they simply reflect different, plausible
assumptions about the missing data. Viewed through that lens, discrepant results are
still defensible and should be presented in a research report. Finally, I recommend the
following articles for readers who want additional details on topics from this chapter.

Diggle, P., & Kenward, M. G. (1994). Informative drop-out in longitudinal data analysis. Journal
of the Royal Statistical Society C: Applied Statistics, 43, 49–93.

Enders, C. K. (2011). Missing not at random models for latent growth curve analyses. Psychologi-
cal Methods, 16, 1–16.

Hedeker, D., & Gibbons, R. D. (1997). Application of random-­effects pattern-­mixture models for
missing data in longitudinal studies. Psychological Methods, 2, 64–78.

Kenward, M. G. (1998). Selection models for repeated measurements with non-­random dropout:
An illustration of sensitivity. Statistics in Medicine, 17, 2723–2732.

Little, R. (2009). Selection and pattern-­


mixture models. In G. Fitzmaurice, M. Davidian, G.
Vebeke, & G. Molenberghs (Eds.), Longitudinal data analysis (pp. 409–431). Boca Raton,
FL: Chapman & Hall.

Muthén, B., Asparouhov, T., Hunter, A. M., & Leuchter, A. F. (2011). Growth modeling with non-
ignorable dropout: Alternative analyses of the STAR*D antidepressant trial. Psychological
Methods, 16, 17–33.

Sterba, S. K., & Gottfredson, N. C. (2014). Diagnosing global case influence on MAR versus
MNAR model comparisons. Structural Equation Modeling: A Multidisciplinary Journal, 22,
294–307.
10

Special Topics and Applications

10.1 CHAPTER OVERVIEW

This chapter uses a series of data analysis examples to illustrate a collection of odds and
ends that include specialized topics, advanced applications, and practical issues. Earlier
data analysis examples demonstrated that given the same data and similar assumptions,
maximum likelihood, Bayesian estimation, and multiple imputation generally produce
the same numerical results. Several examples in this section highlight use of cases that
differentiate the three methods or favor one approach over another. Analysis scripts for
all examples are available on the companion website.

10.2 DESCRIPTIVE SUMMARIES, CORRELATIONS,


AND SUBGROUPS

Nearly every data analysis project begins with descriptive summaries of sample demo-
graphics and key study variables. Maximum likelihood and Bayesian analyses are not
well suited for bread-and-­butter descriptive summaries (e.g., cross-­tabulation tables for
categorical variables, means and standard deviations of continuous variables), because
they are designed around one specific analysis model. In contrast, agnostic multiple
imputation procedures such as the joint model and fully conditional specification are
ideal for this task, because they apply a flexible model that can preserve associations
among a diverse collection of variables with different metrics.
The literature offers surprisingly little guidance on applying multiple imputation to
basic descriptive quantities, cross-­tabulation tables, and the like. A quick scan of online
question-­and-­answer communities suggests that there is disagreement about the use of
imputation for generating descriptive summaries, with numerous authors stating they
are meaningless or invalid. These objections often stem from the legitimate concern that
estimands such as standard deviations and percentages may not have normal distribu-
tions, and some people also argue that Rubin’s (1987) pooling rules should be reserved
401
402 Applied Missing Data Analysis

for inferential analyses. Yet others suggest that descriptive summaries of background
and demographic variables should be based on the observed data. I take the view that
you can and should use multiple imputation for descriptive analyses, because doing so
provides a logical consistency across analyses within a given project. For example, a
cross-­tabulation table of imputed demographic variables describes respondent charac-
teristics most likely associated with the main analysis results.
I use the chronic pain data on the companion website to illustrate multiple impu-
tation for descriptive summaries and correlations. The data include psychological cor-
relates of pain severity from a sample of N = 275 individuals with chronic pain (e.g.,
depression, pain interference with daily life, perceived control). The illustration pig-
gybacks on earlier moderated regression examples where the influence of depression
on psychosocial disability (a construct capturing pain’s impact on emotional behaviors
such as psychological autonomy and communication, emotional stability, etc.) differed
by gender. A research paper with this focal analysis would likely report descriptive
statistics and correlations by gender, and this example shows how to obtain such sum-
maries from multiple imputation.

Imputation Models
While it is ideally suited for the moderated regression, model-based imputation is nar-
row in scope and tailors imputations around that one analysis. Reporting descriptive
statistics and correlations for males and females requires an imputation scheme capable
of preserving several interaction effects at once (e.g., a correlation that differs by gender
implies a two-way interaction). Model-based imputation accommodates more than one
interaction effect, but using product terms to preserve gender-­specific correlations is
cumbersome. A simple alternative is to impute the male and female data separately. This
multiple-­group imputation strategy (Enders & Gottschall, 2011; Graham, 2009) gener-
ates imputations that preserve all possible mean differences and two-way interactions
with gender. The imputation phase can employ either the joint model or fully condi-
tional specification, and there is generally no reason to prefer one to the other. The main
requirements are that the grouping variable must be complete and group sizes must be
large enough to support imputation (in the limit, the number of variables can’t exceed
the number of observations).
To illustrate the procedure, I applied fully conditional specification with latent vari-
ables to data sets that comprised n = 166 women and n = 109 men. Each group’s imputa-
tion model includes four complete variables with mixed response types (age, stress, per-
ceived control over pain, educational attainment categories), six incomplete numerical
variables (work hours per week, exercise frequency, anxiety, pain interference with daily
life, depression, psychosocial disability), and one incomplete nominal variable (a three-­
category pain severity rating). The stress and exercise frequency variables are 7- and
8-point ordinal scales, respectively. After stratifying by gender, exercise frequency had
relatively few responses in some of the higher bins. The sparse data would not support
group-­specific threshold estimates, so I simplified imputation by treating this variable
as continuous. Other options include combining categories into a smaller number of
ordered bins or treating the categories as nominal groups (the multinomial probit model
Special Topics and Applications 403

is often easier to estimate, because it doesn’t require threshold parameters). I also treated
the complete stress rating scale as continuous, because this variable has a sufficient
number of scale points and is relatively symmetric (Rhemtulla et al., 2012).
Fully conditional specification imputes variables one at a time by stringing together
a series of regression models, one per incomplete variable. This example requires seven
regressions per group, six of which feature a continuous target variable. To illustrate, the
anxiety scale’s imputation model is
g g g
(
ANXIETYi( ) = γ (0 ) + γ1( ) INTERFEREi( ) + γ 2( ) DEPRESSi( )
g g g
) ( )
g
( g
) ( g g
) (
+ γ (3 ) DISABILITYi( ) + γ (4 ) WORKHRSi( ) + γ (5 ) EXERCISEi( )
g g
)
+ γ ( ) ( MODERATE ( ) ) + γ ( ) ( SEVERE ( ) ) + γ ( ) ( AGE ( ) )
g * g g * g g g
6 i 7
(10.1) i 8 i

+ γ ( ) ( STRESS( ) ) + γ ( ) ( CONTROL( ) ) + γ ( ) ( COLLEGE ( ) )


g g g g g g
9 i 10 i 11 i

+ γ ( ) ( POSTBA ( ) ) + r ( )
g g g
12 i i

where the g superscript indicates that each group has unique parameter values. Notice
that the incomplete nominal pain ratings appear as a pair of latent response difference
scores on the right side of the equation (MODERATE * and SEVERE *), whereas a pair of
dummy codes represent the educational attainment groups (COLLEGE and POSTBA);
the complete variables function like known constants, because I did not assign them a
distribution.
As a second example, the imputation model for the nominal pain rating requires the
multivariate regression of the latent difference scores on the remaining variables. The
probit regression model is
 MODERATEi*( g ) 

 SEVERE*( g ) 
= g
(
γ (0 ) + γ 1( ) INTERFEREi( ) + γ (2 ) DEPRESSi( )
g g g g
) ( )
 i 
g
( g
) (
+ γ (3 ) DISABILITYi( ) + γ (4 ) WORKHRSi( )
g g
)
+ γ ( ) ( EXERCISE ( ) ) + γ ( ) ( ANXIETY ( ) ) + γ ( ) ( AGE ( ) )
g g g (10.2) g g g
5 i 6 i 7 i

+ γ ( ) ( STRESS( ) ) + γ ( ) ( CONTROL( ) ) + γ ( ) ( COLLEGE ( ) )


g g g g g g
8 i 9 i 10 i

+ γ ( ) ( POSTBA ( ) ) + r ( )
g g g
11 i i

where each vector γ includes one coefficient per difference score, and r contains a pair of
correlated residuals. Aside from stratifying the sample by gender and imputing within
each group, all other aspects of imputation are the same as Chapter 7.

Descriptive and Summary Statistics


Consistent with earlier multiple imputation examples, I created M = 100 data sets, as
this value is likely large enough to maximize power and minimize the impact of Monte
404 Applied Missing Data Analysis

Carlo error (Bodner, 2008; Graham et al., 2007; Harel, 2007; von Hippel, 2020). Prior
to creating imputations, I performed an exploratory analysis and used trace plots and
potential scale reduction factor diagnostics (Gelman & Rubin, 1992) to evaluate con-
vergence. Based on this diagnostic run, I specified 100 parallel imputation chains with
2,000 iterations each, and I saved a data set at the final iteration of each chain. After
applying fully conditional specification to each group, I generated basic summaries that
you might see in a published manuscript, including descriptive statistics and corre-
lations by gender and cross-­tabulation tables. As mentioned previously, these simple
estimands seem to pose the most ambiguity for applying Rubin’s (1987) pooling rules.
To begin, consider the categorical pain rating variable, which had about a 7% miss-
ing data rate in both groups. Summarizing the chronic pain ratings for men and women
is an important preliminary step, because this variable is a defining feature of the tar-
get population. However, Rubin’s pooling rule assumes that estimands follow a normal
distribution. Row or column percentages from a cross-­tabulation table probably do not
satisfy this requirement, but averaging the percentages is nevertheless a viable strategy
(U.S. Census Bureau, 2019). To illustrate, Table 10.1 shows cross-­tabulation tables of
educational attainment and chronic pain ratings by gender. Educational attainment is
complete, but I include it here to illustrate what a table with multiple categorical vari-
ables might look like. I computed the cell sizes by multiplying the pooled column per-
centages by the male and female sample sizes (n = 109 and 166, respectively). Although
reporting fractional cell sizes might seem odd, they are routine in analyses with latent
categorical variables (e.g., a latent class analysis where class sizes are computed by mul-
tiplying group probabilities by the sample size). I would argue that this application isn’t
so different and that fractional cell sizes emphasize the uncertainty in the descriptive
summaries.
Taking the cross-­tabulation table one step further, suppose it is of interest to deter-
mine whether the pain distributions differ by gender. Software packages routinely aug-
ment contingency tables with Pearson chi-­square and likelihood ratio chi-­square sta-
tistics, among others. Although the inputs for computing Wald or likelihood ratio tests

TABLE 10.1. Cross-Tabulation Tables from Imputed Data Sets


Female Male
Variable n % n %
Educational attainment
Less than BA 79 47.6 51 46.8
College degree 55 33.1 36 33.0
Post-Bachelor’s 32 19.3 22 20.2

Chronic pain rating


No to little pain 38.5 23.2 19.0 17.4
Moderate pain 90.3 54.4 39.5 36.2
Severe pain 37.2 22.4 50.5 46.3
Special Topics and Applications 405

probably aren’t available, the TD2 (or D2) statistic from Section 7.12 (Li, Meng, et al.,
1991) provides a straightforward tool for pooling virtually any chi-­square statistic. Con-
sistent with the usual test of independence, the significant test statistic indicates that
males and females have different pain rating distributions, TD2 = χ2(2) = 16.48, p < .001.
Li, Raghunathan, et al. (1991) also describe an F distribution for the test statistic, but
the denominator degrees of freedom for this example was so large (dfD2 = 30,678.57) that
the chi-­square and F versions are effectively identical tests. I use a chi-­square reference
distribution, because this is the norm for complete-­data contingency tables.
Table 10.2 gives means, standard deviations, and correlations by gender. Rubin’s
rule is absolutely appropriate for means, but there is not universal agreement about stan-
dard deviations; some authors recommend transforming standard deviations prior to

TABLE 10.2. Descriptive Statistics and Correlations by Gender


from Imputed Data
Variable 1. 2. 3. 4. 5. 6. 7. 8. 9.
FEMALES (n = 166)
1. AGE 1.00
2. WORKHRS .14 1.00
3. EXERCISE .09 –.15 1.00
4. ANXIETY –.09 .10 .00 1.00
5. STRESS –.11 –.02 .04 .70 1.00
6. CONTROL .24 .05 .21 –.31 –.35 1.00
7. INTERFERE –.05 –.06 –.39 .21 .24 –.40 1.00
8. DEPRESS –.16 –.10 –.18 .54 .49 –.35 .36 1.00
9. DISABILITY –.11 –.13 –.11 .23 .26 –.35 .25 .51 1.00
Means 43.67 30.55 2.96 11.34 3.77 21.01 27.00 14.33 21.92
SD 10.58 18.58 1.82 4.43 1.82 5.11 8.77 5.83 4.76

MALES (n = 109)
1. AGE 1.00
2. WORKHRS –.24 1.00
3. EXERCISE .05 –.13 1.00
4. ANXIETY –.19 .20 –.18 1.00
5. STRESS –.16 .23 –.10 .69 1.00
6. CONTROL .25 –.12 .17 –.28 –.11 1.00
7. INTERFERE .06 .12 –.23 .25 .12 –.44 1.00
8. DEPRESS –.26 –.01 –.33 .54 .51 –.35 .25 1.00
9. DISABILITY .00 –.02 .05 .31 .24 –.28 .37 .29 1.00
Means 48.64 34.59 2.50 12.16 4.10 20.39 27.96 15.32 21.83
SD 11.67 19.10 1.43 4.90 1.76 5.46 8.64 6.74 4.52

Note. Bold typeface denotes significant at p < .05.


406 Applied Missing Data Analysis

pooling (White, Royston, & Wood, 2011, p. 389), and others suggest a normal approxi-
mation is appropriate (Marshall, Altman, Holder, & Royston, 2009; van Buuren, 2012,
p. 155). Limited computer simulation evidence suggests that pooling standard devia-
tions without transformation works just fine unless the sample size is very small and the
missing data rate is very large (e.g., less than N = 50 and 30% missing data), in which
case pooling after an inverse transformation is preferable (Hubbard & Enders, 2022).
Although the transformation makes no difference here, the pooling equations for an
inverse transformation are
M
1 1
ϑˆ = ∑
M m =1 θˆ m
(10.3)

1
θˆ = ˆ
ϑ
where θ̂m is the untransformed estimate from data set m, ϑ̂ is the average inverse estimate
(e.g., the average reciprocal of the standard deviation), and θ̂ is the back-­transformed
point estimate.
There is widespread agreement that Rubin’s rule is not appropriate for a correlation,
because the sampling distribution of r becomes increasingly skewed as the population
correlation moves away from zero. Here, again, simulation results suggest that transfor-
mation only helps at small sample sizes (Hubbard & Enders, 2022), but applying Fisher’s
(1915) r-to-z transformation facilitates significance testing. Schafer (1997) and others
recommend the following procedure: (1) Apply the r-to-z transformation to each of the
M correlation estimates, (2) average the transformed estimates, then (3) back-­transform
the pooled z-statistic to the correlation metric. The expression for the pooled z-statistic
is

1  1 + rm 
M
1
z= ∑ ln  
M m =1 2  1 − rm 
(10.4)

where the collection of terms to the right of the summation is the r-to-z transformation.
The back-­transformation to the correlation metric is as follows:
exp ( 2z ) − 1
r= (10.5)
exp ( 2z ) + 1
Between-­group mean comparisons would likely be standard fare in a project con-
cerned with gender differences. Casting independent-­samples t-tests as simple regres-
sion models is useful when using general-­use statistical software to analyze multiply
imputed data sets, because most programs offer this functionality. The analysis model
features background or outcome variables regressed on a gender dummy code. For
example, the following simple regression gives the depression group comparison:

DEPRESSi = β0 + β1 ( MALEi ) + ε i (10.6)

where β0 is the female average, and β1 is the mean difference for males. Table 10.3 gives
Special Topics and Applications 407

TABLE 10.3. Multiple Imputation Gender Means and t‑Tests


Variable MFemale MMale t df p FMI
AGE 43.67 48.64 3.65 271.02 < .001 .01
WORKHRS 30.55 34.59 1.62 223.37 0.16 .15
EXERCISE 2.96 2.50 –2.20 266.42   .02 .02
ANXIETY 11.34 12.16 1.42 262.42   .03 .04
STRESS 3.77 4.10 1.49 271.02 < .001 .01
CONTROL 21.01 20.39 –0.97 271.02 < .001 .01
INTERFERE 27.00 27.96 0.87 251.39   .07 .07
DEPRESS 14.33 15.32 1.23 243.51   .09 .09
DISABILITY 21.92 21.83 –0.14 252.23   .06 .07

Note. FMI, fraction of missing information.

the group means, t-statistics with the Barnard and Rubin (1999) degrees of freedom
adjustment, and fractions of missing information (i.e., the proportion of the squared
standard errors due to missing data). Unlike the classic expression from Rubin (1987,
Eq. 3.1.6), the adjusted degrees of freedom values never exceed the sample size and
decrease as the fractions of missing data increase.
Finally, I previously noted that multiple-­group imputation generates imputations
that preserve all possible two-way interactions with gender, because it allows every pair
of means and correlations to differ for men and women. At least in this example, where
the moderator variable is complete, multiple-­group imputation is a flexible alternative
to model-based multiple imputation. Fitting the moderated regression model (see Equa-
tion 7.26) to the imputed data sets gave similar (but not identical) estimates to model-
based imputation procedure described in Section 7.11. These minor differences likely
owe to the fact that the multiple-­group procedure is more complex and estimates far
more parameters.

10.3 NON‑NORMAL PREDICTOR VARIABLES

Most methods in this book leverage the normal distribution in important ways; Bayesian
estimation and multiple imputation make this dependence explicit by sampling impu-
tations from normal curves, and maximum likelihood estimation similarly intuits the
location of missing values by assuming they are normal. Of course, the normal distribu-
tion is often a rough approximation for real data where variables are asymmetrical and/
or kurtotic. Using the normal curve for missing data handling is fine in many situations,
but misspecifications can introduce bias if the data diverge too much from this ideal
(some estimands are more robust than others). Not surprisingly, the impact of applying
a normal curve to non-­normal data depends on the missingness rate, as misspecifica-
tions are unlikely to cause problems if the non-­normal variable has relatively few miss-
ing data points.
408 Applied Missing Data Analysis

Bayesian estimation and multiple imputation are particularly useful for evaluating
the impact of non-­normality, because they produce estimates of the missing values.
Graphing imputations next to the observed data can provide a window into an estima-
tor’s inner machinery, as severe misspecifications can produce large numbers of out-
of-range or implausible values (e.g., negative imputes for a strictly positive variable).
Maximum likelihood estimation is a bit more of a black box in this regard, because it
does the same thing—­intuits that missing values extend to a range of implausible score
values—­w ithout producing explicit evidence of its assumptions.
When applying maximum likelihood estimation, researchers routinely use correc-
tive procedures such as robust (sandwich estimator) standard errors and the bootstrap
to counteract the influence of non-­normal data (see Chapters 2 and 3). This section
focuses on data transformations as an alternative strategy for treating non-­normal miss-
ing values. In particular, I focus on the Yeo–­Johnson power transformation (Yeo &
Johnson, 2000), because it subsumes a broad range of transformations used in applied
practice (e.g., logarithmic, inverse, Box–Cox). The procedure effectively estimates the
shape of the data as MCMC iterates, such that the distribution of missing values matches
the observed-­data distribution. This approach has shown promise when paired with a
factored regression specification (Lüdtke et al., 2020b). The interpretation of the regres-
sion model parameters depends on whether the non-­normal variable is a regressor or
outcome, so I illustrate each situation separately.

Non‑Normal Regressor Variable


To begin, I use the substance use data on the companion website to illustrate normal-
izing transformations for a skewed predictor. The data set includes a subset of N = 1,500
respondents from a national survey on substance use patterns and health behaviors. The
focal analysis is a logistic regression where age at first alcohol use, educational attain-
ment, gender, and age predict a binary measure of drinking frequency that classifies
respondents as intermittent or steady drinkers. The logistic regression model is

 Pr ( DRINKER i = 1) 
ln 
 1 − Pr ( DRINKER =  = β0 + β1 ( AGETRYALCi ) + β2 ( COLLEGEi )
 i 1)  (10.7)

+ β3 ( AGEi ) + β4 ( MALEi )
where the logit function on the left side of the equation is the log odds of steady drinking
(DRINKER = 0 if the respondent drinks less than once a week on average, and DRINKER
= 1 if the respondent consumes quantities of alcohol at least once per week), ­AGETRYALC
is the age at which the respondent first tried alcohol, COLLEGE is a dummy code indicat-
ing some college or a college degree, and MALE is a gender dummy code with females as
the reference group. Approximately 9.7% of the dependent variable scores are missing,
age at first use has a 32.9% missing data rate, and 27.9% of the educational attainment
values are unknown. The age at first use distribution is markedly peaked and positively
skewed, with scores ranging from 10 to 47.
The age at first alcohol use variable could be missing, because the respondent
refused, didn’t know the answer, or, because the question wasn’t applicable (e.g., the
Special Topics and Applications 409

question was skipped, because a respondent reported no lifetime alcohol use). Some
researchers may choose to restrict the population of interest to persons who have tried
alcohol on the grounds that age at first use is not a relevant concept otherwise. This
approach would treat missing values that arise from a survey skip pattern as out of
scope, and it would exclude respondents with no lifetime alcohol use. I take the alterna-
tive tack of imputing missing responses regardless of origin.
Authoritative imputers argue that it is permissible to fill in “not applicable” or “don’t
know” responses like the ones here. For example, Rubin, Stern, and Vehovar (1995) sug-
gest that a “don’t know” response could be viewed as concealing some intention or future
behavior. Furthermore, imputing age at first use scores for respondents who report no
lifetime alcohol use may be justified on grounds that the answer to the lifetime use ques-
tion could be incorrect due to measurement or response error (Schafer & Graham, 2002,
p. 148). In these data, a logical skip may reflect a respondent’s uncertainty about a past
behavior, obscuring a score that truly exists. Schafer and Graham also argue that treat-
ing “not applicable” responses as missing values can provide a convenience feature that
facilitates missing data handling. For respondents who report no lifetime alcohol use, I
am effectively imputing the hypothetical age at first use, had the participant ever tried or
if he or she eventually will try alcohol.

Factored Regression Specification


The Yeo–­Johnson transformation described later in this section integrates with the
familiar factored regression specification (Lüdtke et al., 2020a, 2020b). As you know,
the factored regression (or sequential) specification expresses the multivariate distri-
bution of the analysis variables as a sequence of univariate distribution, each of which
corresponds to a regression model. The factorization for the logistic analysis model in
Equation 10.7 is as follows:

f ( DRINKER | AGETRYALC, COLLEGE, AGE, MALE ) ×


f ( COLLEGE | AGETRYALC, AGE, MALE ) × f ( AGETRYALC | AGE, MALE ) × (10.8)

(
f ( AGE | MALE ) × f MALE* )
The first term is the binomial distribution associated with the logistic model, the next
two terms are supporting models for the incomplete regressors, and the two terms in
the bottom row are unnecessary distributions for the complete predictors (I ignore these
terms going forward).
The f(AGETRYALC|AGE, MALE) term is the focus of this example, because the age
at first use variable is substantially skewed and kurtotic (skewness = 1.82 and excess
kurtosis = 8.12). I start by applying a linear regression model with a normal residual
distribution to this variable.

AGETRYALCi = γ 01 + γ11 ( AGEi ) + γ 21 ( MALEi ) + r1i (10.9)

(
AGETRYALCi ~ N1 E ( AGETRYALCi | AGEi , MALEi ) , σ12 )
410 Applied Missing Data Analysis

Following established notation, the bottom row says that age scores are normally dis-
tributed around predicted values on the regression line and have constant variation.
Either a probit or logistic regression could be used to model the college indicator’s dis-
tribution, and the specification of the dependent variable’s model has no bearing on this
choice. I use probit regression and a latent response variable formulation for consistency
with earlier material.

COLLEGEi* = γ 02 + γ12 ( AGETRYALCi ) + γ 22 ( AGEi ) + γ 32 ( MALEi ) + r2i (10.10)


COLLEGEi* ~ N1 E(( COLLEGEi* | AGETRYALCi , AGEi , MALEi ,1 ) )
Following procedures from Chapter 6, the residual variance is fixed at one to establish
a metric, and the model includes a fixed threshold parameter that divides the latent
response distribution into two segments.
The factored regression specification can be deployed with either maximum likeli-
hood or Bayesian estimation. I focus on the latter, because the resulting imputations
make the normality assumption fully transparent. As you know from earlier chapters,
the MCMC algorithm updates the regression model parameters conditional on the filled-
­in data, after which it samples new imputations from posterior predictive distributions
based on the newly minted model parameters. Following ideas established in Chapters
5 and 6, the conditional distributions of the incomplete regressors are complex, multi-
part functions that depend on every model in which a variable appears. In practice, the
Metropolis–­Hastings algorithm does the heavy lifting of sampling latent imputations
from these complicated distributions. As always, the distribution of the missing depen-
dent variable scores depends solely on the focal model’s parameters.

Normal Distribution Imputation


The first analysis used Bayesian estimation and normally distributed imputations to
fill in the missing age at first alcohol use values. After inspecting trace plots and poten-
tial scale reduction factor diagnostics (Gelman & Rubin, 1992), I specified an MCMC
process with 10,000 iterations following a 1,000-iteration burn-in period, and I created
100 filled-­in data sets by saving the imputations from the final iteration of 100 paral-
lel chains. Figure 10.1 shows overlayed histograms with the observed data as gray bars
and the missing values as white bars with a kernel density function. As you can see,
the observed data are markedly peaked and skewed, with scores ranging from 10 to
47, whereas the imputations follow a symmetric distribution that extends from 0.30 to
34.93. Considered as a whole, the imputed data are a weighted mixture of a normal dis-
tribution and a skewed distribution, and about 2% of the missing values fell below the
lowest reported age.
The middle panel of Table 10.4 shows the multiple imputation estimates with
robust standard errors (not surprisingly, the Bayesian posterior medians and standard
deviations were numerically equivalent to the frequentist point estimates and standard
errors), and the top panel shows full information maximum likelihood (FIML) estimates
from the same factored regression. Perhaps not surprisingly, the two sets of results are
Special Topics and Applications 411

Frequency

0 10 20 30 40
Age at First Alcohol Use

FIGURE 10.1. Overlaid histograms with the observed data as gray bars and the missing values
as white bars with a kernel density function. The observed data are markedly peaked and skewed
with scores ranging from 10 to 47, whereas the imputations follow a symmetrical distribution
that extends from 0.30 to 34.93.

virtually indistinguishable, because they apply the same assumptions to the same data.
Substantively, the results show that the probability of steady drinking increased for indi-
viduals who tried alcohol at an earlier age, attended at least some college, are older, and
are males. The implausible score values in Figure 10.1 clearly offend our aesthetic sensi-
bilities, but out-of-range imputations don’t necessarily invalidate the results and trans-
late into biased parameter estimates; computer simulation studies show that a normal
imputation model can work surprisingly well when estimating means and regression
coefficients (Demirtas et al., 2008; Lee & Carlin, 2017; von Hippel, 2013; Yuan et al.,
2012), although other studies suggest that applying a normal curve to a heavily skewed
predictor like the one from this example can introduce bias and distort significance tests
(Lüdtke et al., 2020a, 2020b).

Transformations and Other Solutions


The literature describes several approaches to treating non-normal missing data that
don’t involve the normal curve. One strategy is to sample imputations directly from a
non-normal distribution, including the t distribution (Liu, 1995), beta and Weibull dis-
tributions (Demirtas & Hedeker, 2008a), Tukey’s gh distribution (He & Raghunathan,
2009), and distributions based on power polynomials (Demirtas & Hedeker, 2008b),
among others. At present, most of these options are not widely available in statisti-
412 Applied Missing Data Analysis

TABLE 10.4. Logistic Regression Analysis with a Non‑Normal Predictor


Effect Est. SE z p OR FMI
Full information maximum likelihood
β0 –2.72 0.18 –14.80 < .001 — —
β1 (AGETRYALC) –0.07 0.02 –3.49 < .001 0.93 —
β2 (COLLEGE) 0.42 0.16   2.71   .01 1.53 —
β3 (AGE) 0.02 0.00   5.90 < .001 1.03 —
β4 (MALE) 0.84 0.15   5.67 < .001 2.31 —

Normal distribution imputation


β0 –2.71 0.19 –14.52 < .001 — .15
β1 (AGETRYALC) –0.07 0.02 –3.35 < .001 0.93 .25
β2 (COLLEGE) 0.43 0.15   2.75   .01 1.53 .23
β3 (AGE) 0.02 0.004   5.83 < .001 1.02 .11
β4 (MALE) 0.84 0.15   5.72 < .001 2.31 .09

Yeo–Johnson imputation
β0 –2.73 0.19 –14.75 < .001 — .12
β1 (AGETRYALC) –0.08 0.02 –3.54 < .001 0.93 .23
β2 (COLLEGE) 0.42 0.16   2.73   .01 1.53 .23
β3 (AGE) 0.03 0.004   5.92 < .001 1.03 .11
β4 (MALE) 0.83 0.15   5.66 < .001 2.31 .10

Note. OR, odds ratio; FMI, fraction of missing information.

cal software packages, and some are limited to very simple applications. A variant of
fully conditional specification known as predictive mean matching (Kleinke, 2017; Lee
&C ­ arlin, 2017; van Buuren, 2012; Vink et al., 2014) achieves a similar end by sampling
imputations from a donor pool of observed scores taken from participants whose pre-
dicted values are similar to that of the person with missing data. Van Buuren (2012)
provides a detailed discussion of predictive mean matching, and the procedure is avail-
able in his popular R package MICE (van Buuren et al., 2021; van Buuren & Groothuis-­
Oudshoorn, 2011).
A second option is to apply a normalizing transformation to the skewed variable
prior to imputation and then back-­transform the filled-­in data to the original metric
prior to analysis. Given the right transformation, this procedure can produce imputa-
tions that have approximately the same shape as the observed scores. Common recom-
mendations for positively skewed variables include Box–Cox transformations (Box &
Cox, 1964; Goldstein et al., 2009) and simple logarithmic, square root, or inverse trans-
formations (Schafer & Olsen, 1998; Su et al., 2011; van Buuren, 2012; von ­Hippel, 2013).
These transformations can be applied to a negatively skewed variable after first reflect-
ing its distribution by subtracting scores from the maximum value plus one. Other
Special Topics and Applications 413

options include a fourth-­root transformation for variables that mimic an exponential


distribution (von Hippel, 2013) and a logit transformation for proportion variables (Su
et al., 2011).
The Yeo–­Johnson power transformation (Yeo & Johnson, 2000) is a flexible pro-
cedure that subsumes a number of common functions, including inverse, logarithmic,
square root, and Box–Cox transformations. Unlike the popular Box–Cox transforma-
tion, the Yeo–­Johnson procedure accommodates negative score values, and the non-­
normal variable can be positively or negatively skewed. The transformation has shown
promise when paired with a factored regression specification (Lüdtke et al., 2020b) and
is readily available in statistical software (Keller & Enders, 2021; Robitzsch & Lüdke,
2021).
To illustrate the procedure, consider a skewed regressor X (e.g., age at first alcohol
use). Yeo and Johnson (2000) assume that a transformed version of X follows a normal
distribution with a mean and variance (or predicted score and residual variance in the
case of regression); that is

(
X i† ~ N1 μ, σ2 ) (10.11)

where X†is the transformed score. The function that returns the transformed scores is

( ( X + 1)λ − 1 / λ
 i ) if X i ≥ 0 and λ ≠ 0
ln ( X + 1) if X i ≥ 0 and λ =0
 i
X i† =  (10.12)


(
− ( − X i + 1) − 1 / ( 2 − λ )
2 −λ
) if X i < 0 and λ ≠ 2
−ln ( − X i + 1) if X i < 0 and λ =
2
where λ is the shape parameter.
The skewed variable, in turn, follows a Yeo–­Johnson normal distribution with the
mean, variance, and shape coefficient as parameters.

(
X i ~ YJN μ, σ2 , λ ) (10.13)

This distribution is essentially a two-part function based on a normal curve and a com-
ponent for the shape coefficient. To illustrate, the distribution’s log-­likelihood function
is

 N 
N
( ) ( ) ( ) ∑(X )
N 1 −1 2
LL μ, σ2 , λ | data = − ln ( 2π ) − ln σ2 − σ2 †
i −μ 
 2 2 2 i =1 
(10.14)
N
+ ( λ − 1) ∑sign ( X ) ln ( X
i =1
i i + 1)

where the terms in curly braces correspond to a normal distribution for the transformed
variable, and the second collection of terms owes to the shape parameter and its linkage
414 APPLIED MISSING DATA ANALYSIS

to the raw scores (Yeo & Johnson, 2000, Eq. 3.1). To illustrate, Figure 10.2 shows the
distributions that result from applying an inverse transformation with shape parameters
of λ = 0.50, 1.00 (no transformation), and 1.50 to a standard normal variable (i.e., the
raw score distributions that are normalized, because of applying the transformation
in Equation 10.12). The shape parameter doesn’t have a clear interpretation, because it
works differently depending on whether the raw score is positive or negative (see Equa-
tion 10.12). Nevertheless, the figure highlights that the transformation can map highly
skewed distributions to the normal curve.
To illustrate the application of this distribution to the factored regression specifica-
tion, consider a linear regression model with a skewed regressor:

Yi = β0 + β1 X i + ε i (10.15)

(
Yi ~ N1 E ( Yi | X i ) , σ2ε )
The factored regression specification for this analysis expresses the joint distribution of
the two variables as the product of two univariate distributions.

(Y, X ) f (Y | X ) × f ( X )
f= (10.16)
Relative Probability

–4 –2 0 2 4
Score Value

FIGURE 10.2. The distributions that result from applying an inverse Yeo–Johnson transfor-
mation with shape parameters of λ = 0.50, 1.00 (no transformation), and 1.50 to a standard
normal variable (i.e., the raw score distributions that are normalized, because of applying the
transformation).
Special Topics and Applications 415

The first term after the equal sign corresponds to the focal analysis model, and the
second term corresponds to the Yeo–­Johnson normal distribution in Equation 10.13.
As with any incomplete predictor, the conditional distribution of the missing values
depends on all distributions or models in which the variable appears. In this case, the
distribution of missing values is a two-part function that depends on a normal curve for
Y and a Yeo–­Johnson normal curve for X. Importantly, the parameters of f(X) are on the
transformed metric, but X and its imputations are on the raw score metric. From a prac-
tical perspective, this means that the interpretation of the focal parameters is unaffected
by the transformation; conceptually, the procedure samples skewed imputations that,
when transformed using λ and Equation 10.12, approximate a normal curve. As you will
see, the imputations closely mimic the distribution of the observed data.
Implementing the Yeo–­ Johnson transformation requires a value for the shape
parameter λ. It is convenient to embed this parameter into the iterative estimation pro-
cess. For example, the MCMC recipe for the factored regression specification from Equa-
tion 10.16 has the following major steps: (1) Estimate the focal model parameters, condi-
tional on the current values of Y and X; (2) estimate the Yeo–­Johnson model parameters,
conditional on the transformed X scores; (3) estimate the shape parameter λ, conditional
on the current data; (4) impute Y conditional on the focal model parameters and the
current values of X; and (5) impute X conditional on two sets of model parameters, the
shape parameter, and the new values of Y. As always, the Metropolis–­Hastings algorithm
can draw imputations from complex, nonstandard distributions.

Imputation with a Yeo–Johnson Predictor Distribution


I used Bayesian estimation and model-based multiple imputation to illustrate the Yeo–­
Johnson transformation for the missing age at first alcohol use values. Returning to the
factored regression specification in Equation 10.8, the f(AGETRYALC|AGE, MALE) term
now corresponds to a Yeo–­Johnson normal distribution for the age at first alcohol use
variable.

AGETRYALCi† = γ 01 + γ11 ( AGEi ) + γ 21 ( MALEi ) + r1i (10.17)

((
AGETRYALCi ~ YJN E AGETRYALCi† | COLLEGEi , AGEi , MALEi , σ2r1 , λ ) )
The model is a linear regression linking the transformed age scores to other predictors,
and the predicted score and residual variance from this regression define the center and
spread of the Yeo–­Johnson normal distribution for the skewed age variable. All other
aspects of the factorization are the same as the previous example.
The Yeo–­Johnson transformation can be finicky to implement, and MCMC can be
very slow to converge if the skewed variable’s mean is far from zero. To facilitate conver-
gence, I centered the age scores at their median value of 16 prior to fitting the model. After
inspecting trace plots and potential scale reduction factor diagnostics (­Gelman & Rubin,
1992), I specified an MCMC process with 10,000 iterations following a 2,000-­iteration
burn-in period, and I created 100 filled-­in data sets by saving the imputations from the
416 APPLIED MISSING DATA ANALYSIS

final iteration of 100 parallel chains. Figure 10.3 shows overlaid histograms with the
observed data as gray bars and the missing values as white bars with a kernel density
function. As you can see, the imputations follow a positively skewed distribution that
mimics the shape of the observed data, ranging from 5.89 to 42.83. Any continuous dis-
tribution is likely to produce out-of-range values, and the Yeo–Johnson procedure is no
different. However, only 0.60% of the missing values now fall below the lowest reported
age (there is no need to round or truncate these values or any other fractional imputes).
As mentioned previously, the multiple imputations are on the original metric, so
you can simply fit the analysis model to the filled-in data sets without regard to the
transformation (unless you want to normalize the variable first, in which case you use
the estimated shape parameter and Equation 10.12). The bottom panel of Table 10.4
shows the multiple imputation estimates with robust standard errors. Following the
earlier example, the results suggest that the probability of steady drinking increased
for individuals who tried alcohol at an earlier age, attended at least some college, are
older, and are males. Applying the Yeo–Johnson transformation changed the age at first
alcohol use slope coefficient by nearly half of a standard error unit. I judge this to be a
nontrivial difference, because it could potentially alter the inference about this variable
if the sample size or effect size was smaller. Viewed through the lens of a sensitivity
analysis, the tabled results simply reflect two different assumptions about the missing
data distribution, and there is no way to know for sure which analysis is more correct.
Reporting results for both sets of assumptions is an appropriate option.
Frequency

0 10 20 30 40
Age at First Alcohol Use

FIGURE 10.3. Overlaid histograms with the observed data as gray bars and the missing val-
ues as white bars with a kernel density function. The imputations follow a skewed distribution
that mimics the shape of the observed data and ranges from 5.89 to 42.83.
Special Topics and Applications 417

10.4 NON‑NORMAL OUTCOME VARIABLES

The Yeo–­Johnson transformation is also applicable to a non-­normal dependent variable


with missing data. Returning to an earlier example from Section 3.6, I use the smok-
ing data from the companion website to illustrate a multiple regression analysis with a
skewed outcome. The data set includes sociodemographic correlates of smoking inten-
sity from a survey of N = 2,000 young adults (e.g., age, whether a parent smoked, gender,
income). The focal model uses a parental smoking indicator (0 = parents did not smoke, 1
= parent smoked), age, and income to predict smoking intensity, defined as the number
of cigarettes smoked per day. The linear regression with normally distributed residuals
is shown below:

INTENSITYi = β0 + β1 ( PARSMOKEi ) + β2 ( INCOMEi ) + β3 ( AGEi ) + ε i (10.18)

The smoking intensity variable has 21.2% missing data, 3.6% of the parental smoking
indicator scores are missing, and 11.4% of the income values are unknown. The smoking
intensity distribution is markedly peaked and positively skewed, with scores ranging
from 2 to 29 (skewness = 1.46 and excess kurtosis = 2.88). Although a count or negative
binomial imputation model might be more appropriate for these data (see Section 10.10),
I use the Yeo–­Johnson transformation as a continuous approximation to the discrete
distribution.

Factored Regression Specification


The factored regression specification for the analysis model in Equation 10.18 is as fol-
lows:

f ( INTENSITY | PARSMOKE, INCOME, AGE ) ×


(10.19)
f ( PARSMOKE | INCOME, AGE ) × f ( INCOME | AGE ) × f ( AGE )

The first term is the focal linear regression model, the next two terms are supporting
models for the incomplete regressors, and the final term is an unnecessary distribution
for the complete predictor. I use a probit model for the parental smoking indicator and a
linear regression model for the income variable. The composition of the regressor mod-
els is familiar by now, so I omit these equations in the interest of space.

Normal Distribution Imputation


The first analysis used Bayesian estimation and normally distributed imputations to fill
in the missing smoking intensity values. After inspecting trace plots and potential scale
reduction factor diagnostics (Gelman & Rubin, 1992), I specified an MCMC process with
10,000 iterations following a 1,000-iteration burn-in period, and I created 100 filled-­in
data sets by saving the imputations from the final iteration of 100 parallel chains. Figure
10.4 shows overlayed histograms with the observed data as gray bars and the missing
values as white bars with a kernel density function. As you can see, the observed data
418 APPLIED MISSING DATA ANALYSIS

Frequency

–10 0 10 20 30
Smoking Intensity

FIGURE 10.4. Overlaid histograms with the observed data as gray bars and the missing val-
ues as white bars with a kernel density function. The observed data are markedly peaked and
skewed with scores ranging from 2 to 29, whereas the imputations follow a symmetrical distribu-
tion that extends from –5.52 to 27.12.

are markedly peaked and skewed with scores ranging from 2 to 29, whereas the imputa-
tions follow a symmetric distribution that extends from –5.52 to 27.12. Although they
are very small in number, negative imputations are clearly illogical.
The middle panel of Table 10.4 shows the multiple imputation estimates with robust
standard errors (the Bayesian posterior medians and standard deviations were numeri-
cally equivalent to the frequentist point estimates and standard errors), and the top
panel shows FIML estimates as a comparison. The similarity of the two sets of point
estimates highlights that maximum likelihood assumes the same distribution for the
missing values as Figure 10.4 without producing explicit evidence of that assumption.
The one noticeable difference was the standard error of the residual variance, which
was smaller in the multiple imputation analysis. Not surprisingly, the sandwich estima-
tor produced different results when applied to just the observed data versus imputed
data sets that are a mixture of a normal and skewed distributions. Substantively, the
results show that smoking intensity increased for respondents whose parents smoked,
decreased for people with higher incomes, and increased as age increased.

Imputation with a Yeo–Johnson Outcome Distribution


I next used Bayesian estimation and model-based multiple imputation to illustrate the
Yeo–Johnson transformation for the missing smoking intensity scores. Returning to the
Special Topics and Applications 419

factored regression specification in Equation 10.19, the focal model now corresponds to
a Yeo–Johnson normal distribution for the smoking intensity variable.

INTENSITYi† = γ 0 + γ1 ( PARSMOKEi ) + γ 2 ( INCOMEi ) + γ 3 ( AGEi ) + ri (10.20)

((
INTENSITYi ~ YJN E INTENSITYi† | PARSMOKEi , INCOMEi , AGEi , σ2r , λ ) )
Importantly, the linear regression model links the transformed outcome to the regres-
sors, and the predicted score and residual variance from this regression define the center
and spread of the Yeo–Johnson normal distribution for the skewed smoking intensity
variable. All other aspects of the factorization are the same as the previous example. I
use γ’s to emphasize that the parameters have a different interpretation than those in
Equation 10.18. To reiterate, the imputed scores are on the original skewed metric.
As noted earlier, the MCMC algorithm can be very slow to converge if the skewed
variable’s mean is far from zero. To facilitate convergence, I centered the smoking inten-
sity scores at the median value of 9 prior to fitting the model. After inspecting trace plots
and potential scale reduction factor diagnostics (Gelman & Rubin, 1992), I specified an
MCMC process with 10,000 iterations following a 2,000-iteration burn-in period, and I
created 100 filled-in data sets by saving the imputations from the final iteration of 100
parallel chains. Figure 10.5 shows overlayed histograms with the observed data as gray
bars and the missing values as white bars with a kernel density function. As you can
Frequency

0 5 10 15 20 25 30
Smoking Intensity

FIGURE 10.5. Overlaid histograms with the observed data as gray bars and the missing val-
ues as white bars and a kernel density function. The imputations follow a skewed distribution
that mimics the shape of the observed data and ranges from 2.33 to 39.03.
420 Applied Missing Data Analysis

see, the imputations follow a positively skewed distribution that mimics the shape of the
observed data, ranging from 2.33 to 39.03. The Yeo–­Johnson procedure produced a very
small number of out-of-range values in the stacked data set with 200,000 observations
(100 data sets × 2,000 observations). There is no need to round or truncate these values
prior to analysis. As mentioned previously, the multiple imputations are on the original
metric, so you can simply fit the analysis model to the filled-­in data sets without regard
to the transformation. The bottom panel of Table 10.5 shows the multiple imputation
estimates with robust standard errors. Overall, the point estimates were like those of
the normal-­theory analyses, and the sandwich estimator standard errors were a better
match to maximum likelihood.
A final option is to analyze the normalized or transformed dependent variable, as
is common practice in many disciplines. To illustrate, I saved the transformed outcome
scores for each of the 100 imputed data sets alongside their skewed counterparts. Figure
10.6 shows overlayed histograms with the observed data as gray bars and the missing

TABLE 10.5. Linear Regression Analysis with a Non‑Normal Outcome


Effect Est. SE z p FMI
Full information maximum likelihood
β0 –2.71 0.84 –3.23 < .001 —
β1 (PARSMOKE) 2.66 0.18 14.98 < .001 —
β2 (INCOME) –0.13 0.03 –4.73 < .001 —
β3 (AGE) 0.58 0.04 15.56 < .001 —
σε2 11.23 0.74 15.12 < .001 —
R2   .25 .02 11.33 < .001 —

Normal distribution imputation


β0 –2.72 0.84 –3.24 < .001 .24
β1 (PARSMOKE) 2.65 0.17 15.25 < .001 .22
β2 (INCOME) –0.13 0.03 –4.93 < .001 .26
β3 (AGE) 0.58 0.04 15.64 < .001 .22
σε2 11.21 0.63 17.81 < .001 .08
R2   .25 .02 12.24 < .001 .15

Yeo–Johnson imputation
β0 –2.94 0.86 –3.44 < .001 .28
β1 (PARSMOKE) 2.73 0.18 15.06 < .001 .27
β2 (INCOME) –0.13 0.03 –4.92 < .001 .29
β3 (AGE) 0.60 0.04 15.52 < .001 .25
σε2 11.37 0.72 15.70 < .001 .20
R2   .26 .02 12.73 < .001 .14

Note. FMI, fraction of missing information.


Special Topics and Applications 421

Frequency

–15 –10 –5 0 5 10 15
Smoking Intensity

FIGURE 10.6. Overlaid histograms with the transformed observed data as gray bars and the
transformed missing values as white bars and a kernel density function.

values as white bars with a kernel density function. The Yeo–Johnson transformation
maintains the sign of the original scores, and the spread of the imputes around zero is a
result of centering the outcome prior to the analysis (doing so facilitated convergence).
Table 10.6 shows the multiple imputation results (the Bayesian posterior medians and
standard deviations were numerically equivalent to the frequentist point estimates and
standard errors). Substantively, the results show that smoking intensity increased for
respondents whose parents smoked, decreased for people with higher incomes, and
increased as age increased. Although the results are on a different metric, the signs and
interpretations of the coefficients are the same as the untransformed results.

TABLE 10.6. Linear Regression Analysis with a Transformed Outcome


Effect Est. SE z p FMI
γ00 –11.05 0.65 –16.99 < .001 .28
γ10 (PARSMOKE) 2.33 0.13 17.38 < .001 .26
γ20 (INCOME) –0.12 0.02 –5.63 < .001 .31
γ30 (AGE) 0.51 0.03 18.12 < .001 .23
σr20 6.53 — — — .23
R2 .31 — — — .23

Note. FMI, fraction of missing information.


422 Applied Missing Data Analysis

10.5 MEDIATION AND INDIRECT EFFECTS

A mediation analysis attempts to clarify the mechanism through which two variables
are related. A typical model features an explanatory variable affecting an intervening
variable (the mediator) that, in turn, transmits the predictor’s influence to the outcome.
Seminal mediation articles include Baron and Kenny (1986) and Judd and Kenny (1981),
and a number of excellent books are devoted to the topic (Hayes, 2013; Jose, 2013;
MacKinnon, 2008; Muthén et al., 2016). I use the chronic pain data on the companion
website to illustrate a mediation analysis with missing data. The data set includes psy-
chological correlates of pain severity (e.g., depression, pain interference with daily life,
perceived control) for a sample of N = 275 individuals with chronic pain. The single-­
mediator model for the illustration features a binary severe pain indicator (0 = no, little,
or moderate pain, 1 = severe pain) that influences depression indirectly via an intervening
or mediating variable, pain interference with daily life activities. Approximately 7.3% of
the binary pain ratings are missing, and the missing data rates for the depression and
pain interference scales are 13.5 and 10.6%, respectively.
Figure 10.7 depicts the mediation model as a path diagram, with straight arrows
denoting regression coefficients and double-­headed curved arrows representing vari-
ances or residual variances. To reduce visual clutter, I omit triangle symbols that
researchers sometimes use to denote grand means or intercepts. The model decomposes
the bivariate association between severe pain and depression into a direct pathway and
an indirect pathway via the mediator variable, pain interference with daily life. The path
diagram can alternatively be written as a pair of regression equations. Modifying my
established notation to align with the mediation literature, the regression models are

INTERFEREi= I1 + α ( PAIN i ) + ε1i (10.21)

DEPRESSi= I 2 + β ( INTERFEREi ) + τ′ ( PAIN i ) + ε 2i

where I1 and I2 are regression intercepts, α and β are slope coefficients that define the
indirect effect, τ′ is the direct effect of severe pain on depression, and ε1 and ε2 are nor-

PAIN

DEPRESS

INTERFERE

FIGURE 10.7. Path diagram of a single-­mediator model. A binary pain severity indicator
exerts a direct influence on depression, and it also exerts an indirect effect via an intervening or
mediating variable, pain interference with daily life.
Special Topics and Applications 423

mally distributed residuals. The two regressions align perfectly with the factored regres-
sion (sequential) specification we’ve used throughout the book, and they also integrate
with a multivariate (structural equation model) specification. I use the former for the
Bayesian analysis and the latter for maximum likelihood and multiple imputation.
Multiplying the α and β slopes (i.e., the indirect pathways) defines the so-­called
“product of coefficients” estimator of the mediated effect, αβ = α × β. Mediation infer-
ence is challenging, because the sampling distribution of the product of two coefficients
can be markedly asymmetric and kurtotic, even when estimates of α and β follow a nor-
mal distribution (MacKinnon, 2008; MacKinnon, Lockwood, Hoffman, West, & Sheets,
2002; MacKinnon, Lockwood, & Williams, 2004; Shrout & Bolger, 2002). I introduced
bootstrap resampling in Section 2.8 as a method for generating standard errors that are
robust to normality violations (Efron, 1987; Efron & Gong, 1983; Efron & Tibshirani,
1993), and this approach is also the preferred method for testing indirect effects in the
frequentist framework (MacKinnon, 2008; MacKinnon et al., 2004; Shrout & Bolger,
2002). In a Bayesian analysis, the MCMC algorithm iteratively estimates α and β, and
multiplying each pair of estimates creates a posterior distribution and credible intervals
that reflect the estimand’s natural shape.

Factored Regression Specification


The factored regression specification for the single-­mediator model in Equation 10.21 is
as follows:

(
f ( DEPRESS | INTERFERE, PAIN ) × f ( INTERFERE | PAIN ) × f PAIN * ) (10.22)

All three components are needed for this example, because the variables have miss-
ing data. The analysis also included perceived control over pain, stress, and anxiety as
auxiliary variables, as all have salient bivariate associations with the analysis variables.
Sequencing the models such that the analysis variables predict the auxiliary variables
and not vice versa maintains the desired interpretation of the focal model parameters.
The factored regression specification is as follows:

f ( CONTROL | STRESS, ANXIETY , DEPRESS, INTERFERE, PAIN ) ×


f ( STRESS | ANXIETY , DEPRESS, INTERFERE, PAIN ) ×
(10.23)
f ( ANXIETY | DEPRESS, INTERFERE, PAIN ) ×

(
f ( DEPRESS | INTERFERE, PAIN ) × f ( INTERFERE | PAIN ) × f PAIN * )
All models correspond to linear regressions with normal distributions except the final
term, which is an empty probit (or logit) model for the binary explanatory variable.

Bayesian Estimation
Yuan and MacKinnon (2009) describe complete-­data Bayesian estimation and infer-
ence for the model specification in Equation 10.22, and their approach readily accom-
424 Applied Missing Data Analysis

TABLE 10.7. Posterior Summary of a Mediation Analysis


with Auxiliary Variables
Parameter Mdn SD LCL UCL
PAIN → INTERFERE (α) 8.39 1.07 6.26 10.44
INTERFERE → DEPRESS (β) 0.18 0.05 0.09 0.28
PAIN → DEPRESS (τ′) 1.92 0.94 0.10 3.78
Indirect Effect (αβ) 1.52 0.47 0.68 2.52

Note. LCL, lower credible limit; UCL, upper credible limit.

modates missing data. The MCMC algorithm follows a familiar two-step recipe that
involves estimating multiple sets of regression model parameters conditional on the
filled-­in data, then sampling new imputations from distributions based on the updated
model parameters. The missing values follow complex, multipart functions that depend
on every model in which a variable appears. For example, the distribution of depression
scores depends on the focal model parameters (e.g., the β and τ′ paths) and three auxil-
iary variable models. Similarly, the conditional distribution of the severe pain indicator
involves the product of six distributions. In practice, the Metropolis–­Hastings algo-
rithm does the heavy lifting of sampling imputations from these complex, multipart
functions.
Relative Probability

0 1 2 3 4
Product of Coefficients Estimator (Indirect Effect)

FIGURE 10.8. Posterior distribution of 10,000 indirect effect (i.e., product of coefficients esti-
mator) estimates from MCMC estimation.
Special Topics and Applications 425

The potential scale reduction factor (Gelman & Rubin, 1992) diagnostic indicated
that the MCMC algorithm converged in fewer than 200 iterations, so I used 11,000 total
iterations with a conservative 1,000-iteration burn-in period. Table 10.7 summarizes the
posterior distributions of the mediation model parameters. In the interest of space, I omit
the auxiliary variable and covariate model parameters from the table, because they are
not the substantive focus. The product of coefficients estimator is a deterministic aux-
iliary parameter obtained by multiplying the α and β coefficients from each iteration.
Figure 10.8 shows the posterior distribution of the 10,000 indirect effect estimates, with
a solid line denoting the median estimate and dashed lines indicating the 95% credible
interval boundaries. The posterior median was Mdnαβ = 1.52, meaning that the change
from mild or moderate pain to severe pain increased depression by 1.52 points via pain
interference with daily life. The 95% credible intervals were asymmetrical around the
distribution’s center and spanned from 0.68 to 2.52. Applying null hypothesis-­like logic,
we can conclude that the parameter value is unlikely equal to zero, because the null
value falls outside the credible interval.

Maximum Likelihood
Next, I used maximum likelihood and structural equation modeling software to esti-
mate the mediation model in Figure 10.7. This approach incorrectly assumes that the
binary pain severity indicator is normally distributed, but computer simulations suggest
that this misspecification is often benign (Muthén et al., 2016); earlier analysis examples
support this conclusion. Finally, I used Graham’s (2003) saturated correlates approach
to incorporate the three additional auxiliary variables into the model. Recall that this
specification uses correlated residuals to connect the auxiliary variables to the analysis
variables and to each other (see Section 3.10).
The bootstrap is the preferred method for testing indirect effects in the frequentist
framework. As a quick review, the basic idea is to treat the sample data as a surrogate
for the population and draw B samples of size N with replacement. The sampling with
replacement scheme ensures that some data records—­and thus missing data patterns—­
appear more than once in each sample, whereas others do not appear at all. Drawing
many bootstrap samples (e.g., B > 1,000) and fitting the mediation model to each data set
produces an empirical sampling distribution of the product of coefficients estimator. The
percentile bootstrap defines the 95% confidence interval as the 2.5 and 97.5% quantiles
of the empirical sampling distribution, and the bias-­corrected bootstrap adjusts these
quantiles to compensate for any difference between the pooled point estimate and the
center of the bootstrap sampling distribution (Efron, 1987; MacKinnon, 2008, p. 334).
The top panel of Table 10.8 gives the path coefficients and 95% confidence inter-
val limits from the percentile and bias corrected bootstraps. The percentile bootstrap
defines the 95% confidence interval bounds as the 2.5% and 97.5% quantiles of the
empirical bootstrap distribution, and the biased-­corrected interval shifts these quan-
tiles to account for the discrepancy between the point estimate and the center of the
empirical sampling distribution (see MacKinnon, 2008, p. 334). Based on its Type I error
rates, the literature seems to favor the percentile bootstrap (Chen, 2018; Fritz, Taylor,
& MacKinnon, 2012), but I report both for completeness. The bias correction induced a
426 Applied Missing Data Analysis

TABLE 10.8. Product of Coefficients Estimator with Bootstrap


Confidence Intervals
Percentile Bias-corrected
Parameter Est. LCL UCL Est. LCL UCL
Maximum likelihood
PAIN → INTERFERE (α) 8.50 6.73 10.34 8.50 6.69 10.30
INTERFERE → DEPRESS (β) 0.19 0.09 0.28 0.19 0.09 0.28
PAIN → DEPRESS (τ′) 1.92 –0.07 3.88 1.92 –0.09 3.86
Indirect Effect (αβ) 1.57 0.73 2.45 1.57 0.73 2.45

Multiple imputation
PAIN → INTERFERE (α) 8.49 6.70 10.27 8.49 6.70 10.28
INTERFERE → DEPRESS (β) 0.18 0.09 0.28 0.18 0.09 0.28
PAIN → DEPRESS (τ′) 1.92 –0.03 3.87 1.92 –0.03 3.88
Indirect Effect (αβ) 1.56 0.73 2.45 1.56 0.76 2.48

Note. LCL, lower credible limit; UCL, upper credible limit.

slight adjustment to the indirect effect’s confidence limits, but it had virtually no impact
on other intervals.
Focusing on the mediated effect, the product of coefficients estimate was αβ  = 1.57,
meaning that the change from mild or moderate pain to severe pain increased depres-
sion by 1.57 points via pain interference with daily life. The 95% confidence interval
limits, which spanned from 0.73 to 2.45, were asymmetrical around the point estimate,
because the bootstrap sampling distribution was asymmetrical (its shape was similar
to the posterior distribution in Figure 10.8). The interval indicates that the indirect
effect is significantly different from zero, because it does not include the null value. The
component path coefficients were also statistically significant, but the direct effect of
severe pain on depression was not significant after accounting for the indirect pathway.
Despite attacking the problem very differently, maximum likelihood estimation and the
bootstrap produced results that were numerically equivalent to the Bayesian analysis,
albeit with different interpretations.

Multiple Imputation
I used fully conditional specification with latent variables to create multiple imputations
for the mediation analysis. The imputation model included the three focal variables
and the three auxiliary variables (control, stress, and anxiety). I used a latent response
formulation for the incomplete binary variable, and I treated the 7-point stress rating (a
complete variable) as continuous (Rhemtulla et al., 2012).
Following the procedure from Section 7.4, fully conditional specification imputes
variables one at a time by stringing together a series of regression models, one per incom-
plete variable. This example requires four regressions, three of which are linear and one
Special Topics and Applications 427

of which is a probit model. To illustrate, the depression scale’s imputation regression


equation is as follows:

( )
DEPRESSi = γ 01 + γ11 ( INTERFEREi ) + γ 21 PAIN i* + γ 31 ( ANXIETYi )
(10.24)
+ γ 41 ( CONTROLi ) + γ 51 ( STRESSi ) + r1i

Notice that the latent response variable appears as a regressor on the right side of the
equation (the classic formulation of fully conditional specification in MICE would
instead use the binary indicator). As a second example, the probit imputation model for
the latent pain severity scores is shown below:

PAIN i* = γ 03 + γ13 ( ANXIETYi ) + γ 23 ( DEPRESSi ) + γ 33 ( INTERFEREi )


(10.25)
+ γ 43 ( CONTROLi ) + γ 53 ( STRESSi ) + r3i

The residual variance is fixed at one to establish a metric, and the model also includes a
threshold parameter that divides the latent distribution into two regions.
After estimating various sets of regression model parameters, MCMC samples new
imputations from posterior predictive distributions based on the updated model param-
eters. For example, the missing depression scores are sampled from a normal distribu-
tion with center and spread equal to a predicted score and residual variance, respectively
(i.e., imputation = predicted value + noise). MCMC generates latent variable imputations
for the binary pain indicator, and the location of the continuous imputes relative to the
threshold parameter induces corresponding discrete values (e.g., a latent score below the
threshold implies little or moderate pain, and a continuous score above the threshold
implies a severe pain rating).
There are at least four ways to apply the bootstrap to multiply imputed data (Scho-
maker & Heumann, 2018). I focus on the two procedures that have the best support from
the literature. A multiple imputation nested within bootstrapping approach performs
resampling first and imputation second (Zhang & Wang, 2013; Zhang, Wang, & Tong,
2015). This procedure first creates B incomplete data sets by drawing bootstrap samples
with replacement from the original data, and it then applies multiple imputation to
create M complete data sets from each bootstrap sample. Reversing the process gives a
bootstrapping nested within multiple imputation procedure that performs imputation
first and resampling second (Wu & Jia, 2013). This approach first applies multiple impu-
tation to the data, and it then draws B bootstrap samples with replacement from each of
the M complete data sets. The analysis phase fits a model to each of the B × M data sets,
and the resulting estimates mix to form empirical sampling distributions.
I illustrate bootstrapping within multiple imputation, because it is more convenient
to implement. After creating M = 100 imputations, I fit the path model from Figure 10.7
to each imputed data set and used Rubin’s (1987) pooling rules to combine parameter
estimates. The pooled indirect effect is the average of the product of coefficient estimates
from the M filled-­in data sets (not the product of the average coefficients, α̂ and β̂).
M
1

=
αβ ∑
M m =1
αˆ mβˆ m (10.26)
428 Applied Missing Data Analysis

Next, I used sampling with replacement to create B = 500 nested bootstrap samples for
each of the M = 100 imputed data sets, and I fit the mediation model to each bootstrap
sample. The B × M = 50,000 estimates mix to form an empirical sampling distribution
that reflects within- and between-­imputation variation.
The bottom panel of Table 10.8 gives the path coefficients and 95% confidence
interval limits from the percentile bootstrap and bias-­corrected bootstrap. To reiterate,
the 2.5 and 97.5% quantiles of the empirical bootstrap distribution define the percentile
bootstrap confidence interval, and the biased-­corrected method shifts these quantiles
to account for the discrepancy between the pooled point estimate from Equation 10.26
and the center of the empirical sampling distribution (see MacKinnon, 2008, p. 334; Wu
& Jia, 2013). Focusing on the mediated effect, the product of coefficients estimate was
 = 1.56, meaning that the change from mild or moderate pain to severe pain increased
αβ
depression by 1.56 points via pain interference with daily life. The 95% confidence inter-
val limits, which spanned from 0.73 to 2.45, were asymmetrical around the point esti-
mate, because the bootstrap sampling distribution was skewed (its shape was similar to
the posterior distribution in Figure 10.8). The interval indicates that the indirect effect
is significantly different from zero, because it does not include the null value. The com-
ponent path coefficients were also statistically significant, but the direct effect of severe
pain on depression was not significant after accounting for the indirect pathway. The
multiple imputation and maximum likelihood results were effectively equivalent, and
the numeric estimates closely matched the Bayes analysis (albeit with different perspec-
tives on inference). The close correspondence of these methods has been a recurring
theme throughout the book.

10.6 STRUCTURAL EQUATION MODELS

Structural equation modeling analyses introduce unique challenges for missing data
handling, because they often involve large numbers of categorical variables (e.g., ques-
tionnaire or test items as indicators of a latent factor) and specialized analytic tasks
related to global and local fit assessments. I use the eating disorder risk data from the
companion website to illustrate a confirmatory factor analysis with item-level missing-
ness. The data comprise body mass index scores and 12 Eating Attitudes Test question-
naire items (Garner, Olmsted, Bohr, & Garfinkel, 1982) from a sample of N = 200 female
college athletes. Seven items were intended to measure a drive for thinness construct
that reflects excessive concern or preoccupation with weight gain, and five items mea-
sured dieting behaviors. Figure 10.9 show a path diagram of the two-­factor model. All
items used 6-point rating scales, and the stems are found in the Appendix and in Table
10.9.
Researchers have multiple options for fitting factor models with ordinal indicators
to complete data sets (e.g., see Jöreskog & Moustaki, 2001; Wirth & Edwards, 2007).
Perhaps the most common approach is to simply treat questionnaire items as continu-
ous and normally distributed. A second option is FIML estimation with a probit link
function for the discrete indicators. Following ideas from Chapter 6, the probit model
views each questionnaire item as arising from a normally distributed latent response
Special Topics and Applications 429

DRIVE 1 DRIVE 2 DRIVE 3 DRIVE 4 DRIVE 5 DRIVE 6 DRIVE 7

THINNESS
DRIVE

DIETING

DIETING1 DIETING2 DIETING3 DIETING4 DIETING5

FIGURE 10.9. Two-­factor structure for 12 questionnaire items. Seven items measure a drive
for thinness construct that reflects excessive concern or preoccupation with weight gain, and
five items measure dieting behaviors.

variable, the distribution of which is separated into discrete segments by a set of thresh-
old parameters. The resulting factor analysis model describes the correlation structure
of these latent response variables (i.e., polychoric correlations) rather than the vari-
ances and covariances of the discrete indicators. Weighted (or diagonally weighted) least
squares (Finney & DiStefano, 2013; Muthén, 1984; Muthén, du Toit, & Spisic, 1997) is
a third option that also targets latent variable associations.
All things being equal, you might expect that estimators for categorical data are
preferable, because they are theoretically more correct, but that isn’t necessarily the
case. With complete data, full information estimation for item-level factor analysis is
restricted to simple models with few factors and indicators, and missing data analy-
ses are no different. Although the two-­factor model in Figure 10.9 is relatively simple,
the corresponding saturated or unrestricted model—a multivariate contingency table
rather than the usual sample means and variance–­covariance matrix—­is too complex
to estimate. For example, a saturated model for just two of the 6-point items is a 6 × 6
contingency table with 36 cells, a model for three of the items has 6 × 6 × 6 = 216 cells,
and so on. As you can imagine, the multivariate contingency table for 12 ordinal items
430 Applied Missing Data Analysis

TABLE 10.9. FIML Standardized Loadings with Robust Standard Errors


Normality assumed Probit link
Item stem Est. SE Est. SE
Drive for thinness
Am terrified about being overweight. 0.67 0.05 0.71 0.05
Avoid eating when I am hungry. 0.61 0.08 0.69 0.06
Feel extremely guilty after eating. 0.71 0.05 0.76 0.05
Am preoccupied with a desire to be thinner. 0.81 0.04 0.86 0.03
Think about burning up calories when I exercise. 0.77 0.04 0.79 0.04
Am preoccupied with . . . fat on my body. 0.78 0.04 0.82 0.04
Like my stomach to be empty. 0.70 0.05 0.74 0.05

Dieting behavior
Aware of the calorie content of foods that I eat. 0.75 0.04 0.76 0.04
Particularly avoid food with a high carbohydrate. 0.58 0.06 0.65 0.06
Avoid foods with sugar in them. 0.59 0.06 0.67 0.06
Eat diet foods. 0.72 0.05 0.76 0.05
Engage in dieting behavior. 0.83 0.04 0.88 0.03

Note. FIML, full information maximum likelihood.

is intractably large. Unfortunately, the absence of a saturated model rules out global fit
assessments.
Although conceptually similar, weighted least squares is a limited information
estimator that works from bivariate contingency tables (and associations) rather than
high-­dimensional multivariate data. Estimation happens in two steps. The first stage
estimates threshold parameters and the polychoric correlation for each pair of latent
variables, and the second stage engages an iterative optimization routine that minimizes
the sum of squared standardized differences between the first stage (saturated model)
estimates and the thresholds and correlations predicted by the factor analysis model.
Simulation studies show that weighted least squares estimators tend to require relatively
large sample sizes to achieve their optimal properties (Rhemtulla et al., 2012; Satorra &
Bentler, 1988, 1994; Savalei, 2014), perhaps much larger than this data set. Importantly,
this estimator assumes an MCAR mechanism, because the first stage uses pairwise dele-
tion to estimate the polychoric correlations.
I focus on maximum likelihood and multiple imputation for this example, because
they readily connect with familiar complete-­data structural equation modeling proce-
dures. Maximum likelihood estimation provides two routes: Treat the ordered-­categorical
indicators as normally distributed variables or use full information estimation with a
probit link function to model the latent response variables. Multiple imputation creates
complete sets of item responses that are amenable to normal-­theory maximum likeli-
hood estimation or weighted least squares. A third option is to save the latest response
variables with the imputed data sets and use the continuous variables as indicators.
Special Topics and Applications 431

Maximum Likelihood Estimation


Maximum likelihood estimation is well suited for confirmatory factor analyses, and
structural equation modeling software packages have long offered full information
missing data-­handling routines. To begin, I incorrectly treated the questionnaire items
as normally distributed variables and applied robust corrections to standard errors and
model fit statistics (see Sections 2.8 and 2.12) to counteract the impact of non-­normality.
This method has support from the literature if the indicators have five or more response
options and symmetrical distributions (Rhemtulla et al., 2012). The Eating Attitudes
Test items use 6-point rating scales, but the distributions are mostly asymmetrical, with
fewer high ratings.
The goal of maximum likelihood estimation is to identify the factor model param-
eters that minimize the standardized distances between the observed data and model-­
implied mean vector and covariance matrix. Equation 3.26 shows observed-­data log-­
likelihood (the function to be maximized), and the predicted means and variance–­
covariance matrix are a function of the factor model parameters as follows:

μ (=
θ ) Λα + υ (10.27)

S ( θ ) ΛΨΛ ′ + Θ
=

where Λ is a 12 × 2 factor loading matrix, α is the vector of factor means, υ contains


the 12 regression intercepts, Ψ is the 2 × 2 factor covariance matrix, and Θ is a 12 × 12
residual covariance matrix with variances on the diagonal and zeros elsewhere. I scaled
the latent variables as z-scores by fixing the factor means and variances to 0 and 1,
respectively. This specification defines the intercepts as the item means.
Identifying impactful auxiliary variables for a factor model is usually difficult,
because the strongest correlates of the missing item responses are other items from
the same scale. To illustrate the process, I used Graham’s (2003) saturated correlates
approach to incorporate body mass index into the model, as this variable is a correlate
of the questionnaire items and potential determinant of nonresponse. This specification
uses correlated residuals to link the auxiliary variable to the 12 indicators, but it leaves
the extra variable correlated with the latent factors (see Section 3.10).
Table 10.9 gives the standardized factor loadings and robust (sandwich estimator)
standard errors from the analysis. The standardized loadings are essentially correlations
between the items and the latent factors. All items had strong positive associations with
their respective constructs, and the factors were highly correlated at .83. The right panel
of Table 10.9 also shows the standardized loadings from a full information analysis with
a probit link. As explained earlier, the probit model describes the correlation structure
of the latent response variables. The two analyses produced noticeable discrepancies in
several cases, with loadings that differed by nearly one standard error unit. I focus on
the normal-­theory results for fit assessments, because the probit model does not provide
indices beyond the AIC and BIC (the unstructured or saturated model is too complex to
estimate).
The Satorra–­Bentler chi-­square (Satorra & Bentler, 1994) uses higher moments
from the data to compute a correction term that rescales the likelihood ratio test sta-
432 Applied Missing Data Analysis

tistic to have the same expected value or mean as an optimal statistic computed from
multivariate normal data (see Section 2.11). Simulation studies suggest that robust test
statistics may counteract the biasing effects of non-­normal missing data in some situ-
ations (Enders, 2001; Rhemtulla et al., 2012; Savalei & Bentler, 2005, 2009; Savalei &
Falk, 2014; Yuan & Bentler, 2000; Yuan & Zhang, 2012). The rescaled chi-­square from
the analysis was statistically significant, TSB(53) = 106.48, p < .001, indicating that the
two-­factor model did not adequately explain the sample variances and covariances.
Researchers routinely supplement the model fit statistic with other absolute or rel-
ative fit indices (McDonald & Ho, 2002). Popular options include the Tucker–­Lewis
Index or non-­normed fit index (TLI or NNFI; Bentler & Bonett, 1980; Tucker & Lewis,
1973), the comparative fit index (CFI; Bentler, 1990), and the root mean square error of
approximation (RMSEA; Browne & Cudeck, 1992; Steiger, 1989, 1990; Steiger & Lind,
1980). Incremental fit indices such as the TLI and CFI compare the relative fit of two
nested models, the first of which is the hypothesized model (e.g., the confirmatory fac-
tor analysis model), and the second of which is a more restrictive null or baseline model.
With certain exceptions (e.g., longitudinal growth curves; Widaman & T ­ hompson,
2003), the usual baseline model includes means and variances but fixes all correlations
to zero.
The TLI and CFI give the proportional improvement of the hypothesized model rel-
ative to that of the baseline model (e.g., TLI = .95 means that the hypothesized model’s
fit is a 95% improvement over that of the baseline model). These indices are

(T0 ÷ df0 ) − (TLR ÷ df )


TLI = (10.28)
(T0 ÷ df0 ) − 1
max (T0 − df0 ,0 ) − max (TLR − df ,0 )
CFI =
max (T0 − df0 ,0 )

where T0 and TLR are the chi-­square statistics for the null (baseline) and hypothesized
models, respectively, and df0 and df are their corresponding degrees of freedom. In con-
trast, the RMSEA is an absolute index that estimates population misfit of the hypoth-
esized model per degree of freedom.
max (TLR − df ,0 )
RMSEA = (10.29)
df ( N − 1)
Robust versions of these indices replace normal-­theory test statistics with their rescaled
counterparts.
Consistent with the global chi-­square test, robust indices indicated that fit is inad-
equate by conventional standards: TLI = .921, CFI = .936, and RMSEA = .071 (Hu &
Bentler, 1999). Researchers routinely use modification indices (also known as score
tests) to identify specific sources of misfit in models such as this. The modification
index is a chi-­square statistic that reflects the predicted change in model fit that would
result from a single additional path (MacCallum, 1986; Sörbom, 1989). These tests have
a long history in the structural equation modeling literature but require caution, because
Special Topics and Applications 433

they capitalize on chance (Bollen, 1989; Byrne, Shavelson, & Muthén, 1989; Kaplan,
1990; MacCallum, 1986; MacCallum, Roznowski, & Necowitz, 1992; Whittaker, 2012).
The analysis produced three large modification indices. The first pointed to an omitted
cross-­loading from the dieting behavior factor to the drive for thinness item “Think
about burning up calories when I exercise.” The modification index was χ2(1) = 10.02
(p < .01), and the predicted value of the omitted standardized loading was .46. The other
two large indices involved residual covariances between pairs of dieting behavior items.
The first indicated that adding a covariance between “Particularly avoid food with a high
carbohydrate . . . ” and “Avoid foods with sugar in them” would significantly improve
model fit, χ2(1) = 18.92 (p < .001), and the second predicted a similar improvement for
the covariance between “Eat diet foods” and “Engage in dieting behavior,” χ2(1) = 19.68
(p < .001). The large projected values of the residual correlations (.39 and .55, respec-
tively) further point to these omitted paths as important sources of misfit.

Imputation Models
There are several compelling reasons to use multiple imputation with structural equa-
tion models, not the least of which is its flexibility with mixtures of categorical and con-
tinuous variables. A number of recent papers have extended procedures for fit assess-
ments to multiply imputed data (Chung & Cai, 2019; Enders & Mansolf, 2018; Lee
& Cai, 2012; Mansolf, Jorgensen, & Enders, 2020), and researchers now have the full
complement of tools necessary to carry out structural equation modeling analyses.
The joint model imputation and fully conditional specification procedures that I
apply to the example are agnostic in the sense that they do not impose a particular struc-
ture or pattern on the means and associations. In the parlance of the structural equa-
tion modeling literature, the imputation phase uses a saturated or just-­identified model
(Bollen, 1989; Kline, 2015) that spends all available degrees of freedom. For the joint
model, this is an unrestricted mean vector and covariance matrix, and fully conditional
specification uses an equivalent set of regressions. It is also possible to employ a model-
based approach that uses a factor analysis model for imputation (e.g., H0 imputation;
Asparouhov & Muthén, 2010c). Model-based imputation is attractive from a precision
perspective, because it requires far fewer parameters. For example, the factor model in
Figure 10.9 has 53 degrees of freedom, one for each omitted path or restriction placed
on the covariance matrix. Employing a restrictive imputation procedure to this analysis
would limit the range of models that could be estimated and compared, because the
resulting imputations presuppose perfect fit (e.g., global tests of model fit would be
suspect). Nevertheless, this approach warrants consideration when the number of indi-
cators is very large relative to the sample size, because it should converge more reliably
than an unrestricted imputation model.
The joint modeling framework invokes a multivariate normal distribution for the 12
latent response variables and numerical body mass index scores, and the corresponding
imputation model parameters are a mean vector and variance–­covariance matrix. The
13-dimensional normal distribution for the focal variables and auxiliary variables is as
follows:
434 Applied Missing Data Analysis

 DRIVE1*i    μ1   1 
    
         
 *   μ  σ   1 
 DRIVE7i    7   7⋅1 
 DIETING  ~ N13   μ 8  ,  σ 8⋅1  σ 8⋅7
* 1  (10.30)
 1i
      
      
  
 *   σ
 μ12   12⋅1  σ12⋅7 σ12⋅18  1 
 DIETING5i      2 
 BMI
 i

   μ13   σ13⋅1  σ13⋅7 σ13⋅8  σ13⋅12 σ13 

Following established notation, the asterisk superscripts denote latent response vari-
ables, the variances of which are fixed at one to establish a scale. The model also incor-
porates five threshold parameters per item that divide the latent continuum into discrete
segments (see Section 6.4).
Fully conditional specification uses a sequence of regression models to impute vari-
ables in a round-robin fashion. I use a fully latent version of the procedure that models
associations among continuous and latent response variables. This approach invokes a
linear regression for the body mass index and a probit regression for each categorical
variable. To illustrate, the probit imputation model for the first drive for thinness item
is as follows:

( ) ( )
DRIVE1*i = γ 0 + γ1 DRIVE2*i +  + γ 6 DRIVE7*i + γ 7 DIETING1*i +  ( ) (10.31)
(
+ γ11 DIETING5*i )+γ 12 ( BMI i ) + ri

The latent response variable’s residual variance is fixed at one to establish its scale, and
the model also requires five threshold parameters (one of which is fixed) that divide
the underlying latent distribution into six discrete segments. As a second example, the
linear regression imputation model for body mass index is shown below:

( ) ( )
BMI i = γ 0 + γ1 DRIVE1*i +  + γ 7 DRIVE7*i + γ 8 DIETING1*i +  ( ) (10.32)
+ (
γ12 DIETING5*i )+r
i

Unlike the conventional MICE specification, the fully latent version of fully conditional
specification features latent response variables as regressors on the right side of all
imputation models.
After sequentially estimating various sets of model parameters, MCMC samples
new imputations from posterior predictive distributions based on the updated param-
eter values. For example, the missing body mass index scores are sampled from a normal
distribution with center and spread equal to a predicted score and residual variance,
respectively (i.e., imputation = predicted value + noise). The probit regression models
produce latent response variable imputations for the entire sample (recall that latent
scores are restricted to a particular region of the distribution if the discrete response is
observed, and they are unrestricted otherwise). The location of the continuous imputes
relative to the estimated threshold parameters induces discrete imputes for the filled-­in
Special Topics and Applications 435

data sets. The estimated latent response scores that spawn the categorical responses can
also be saved alongside the discrete imputes.

Multiple Imputation with Normal‑Theory Estimation


The first analysis applied normal-­theory maximum likelihood estimation to the cat-
egorical imputations, and I used robust corrections to standard errors and model fit
statistics to counteract normality violations (Rhemtulla et al., 2012; Satorra & Bentler,
1988, 1994; Savalei, 2014). Prior to creating imputations, I performed an exploratory
analysis and used trace plots and potential scale reduction factor diagnostics (Gelman &
Rubin, 1992) to evaluate convergence. This step is especially important when imputing
questionnaire items, because threshold parameters tend to converge slowly and require
long burn-in periods (Cowles, 1996; Nandram & Chen, 1996). Based on this diagnostic
run, I specified 100 parallel imputation chains with 10,000 iterations each, and I saved
the data at the final iteration of each chain. After creating the multiple imputations, I fit
the confirmatory factor analysis model in Figure 10.9 to each imputed data set and used
Rubin’s (1987) pooling rules to combine the estimates and standard errors.
The leftmost panel of Table 10.10 gives the pooled standardized factor loadings
and robust standard errors for fully conditional specification (the joint model estimates
were virtually identical). You might expect multiple imputation standard errors to be

TABLE 10.10. Multiple Imputation Standardized Loadings with Robust


Standard Errors
Normality Latent
assumed WLS response
Item stem Est. SE Est. SE Est. SE
Drive for thinness
Am terrified about being overweight. 0.68 0.05 0.71 0.04 0.73 0.04
Avoid eating when I am hungry. 0.62 0.07 0.69 0.05 0.69 0.05
Feel extremely guilty after eating. 0.71 0.05 0.76 0.04 0.77 0.04
Am preoccupied with a desire to be thinner. 0.82 0.03 0.86 0.02 0.87 0.03
Think about burning up calories when I exercise. 0.77 0.04 0.82 0.03 0.79 0.04
Am preoccupied with . . . fat on my body. 0.78 0.04 0.81 0.03 0.82 0.03
Like my stomach to be empty. 0.70 0.05 0.77 0.04 0.73 0.04

Dieting behavior
Aware of the calorie content of foods that I eat. 0.74 0.04 0.79 0.04 0.76 0.04
Particularly avoid food with a high carbohydrate. 0.58 0.06 0.68 0.04 0.65 0.06
Avoid foods with sugar in them. 0.59 0.06 0.73 0.05 0.67 0.06
Eat diet foods. 0.72 0.05 0.75 0.04 0.77 0.04
Engage in dieting behavior. 0.82 0.04 0.88 0.03 0.88 0.03

Note. WLS, weighted least squares.


436 Applied Missing Data Analysis

somewhat larger than those of the direct estimator in Table 10.9, because the initial
imputation stage employs an unrestricted model that spends all 53 of the factor model’s
degrees of freedom (Collins et al., 2001). However, this doesn’t appear to be the case, as
the two sets of standard errors were quite similar. For all intents and purposes, imputing
the data gave the same results as direct maximum likelihood estimation based on the
observed data. This has been a recurring theme throughout the book.
Evaluating global model fit is a standard step in any structural equation modeling
analysis. Fit assessments nearly always include a test statistic comparing the research-
er’s model (e.g., the two-­factor model) to an optimal saturated model that places no
restrictions on the data. Meng and Rubin’s (1992) pooled likelihood ratio statistic serves
this purpose for multiply imputed data (Enders & Mansolf, 2018; Lee & Cai, 2012), and
the so-­called D2 statistic (Li, Meng, et al., 1991; Rubin, 1987) is another option for pool-
ing goodness-­of-fit tests.
The naive Meng and Rubin (1992) statistic that assumes normality was statisti-
cally significant, χ2(53) = 130.10, p < .001, indicating that the two-­factor model did not
adequately explain the sample variances and covariances. The test can be robustified by
computing the Satorra and Bentler (1994) scaling factor from each data set and using
the arithmetic average to rescale the chi-­square statistic (Jorgensen, Pornprasertmanit,
Schoemann, & Rosseel, 2021). The rescaled chi-­square was also statistically significant,
TSB(53) = 107.36, p < .001, but it was a much closer match to that of the direct estimator.
A second option is to compute the rescaled test statistic for each data set and use the
D2 procedure to pool the chi-­square values. The pooled rescaled statistic was well cali-
brated to the other robust options, TD2(53) = 110.46, p < .001.
At present, relatively little is known about the behavior of these fit statistics with
multiply imputed data. Limited simulation results suggest that the Meng and Rubin
chi-­square may lack power relative to direct maximum likelihood estimation (Enders
& Mansolf, 2018). Additionally, the test statistic is not invariant to changes in model
parameterization, and using different identification constraints (e.g., setting a loading to
one rather than fixing the factor variance) will lead to different test statistics; the test is
similar to the Wald statistic in this respect (Gonzalez & Griffin, 2001). In practice, the
changes to the test statistic across different parameterizations tend to be very small and
should not materially impact decisions about model fit (Enders & Mansolf, 2018). This
is not an issue for the D2 procedure.
Returning to the fit indices in Equations 10.28 and 10.29, a natural way to compute
imputation-­based versions of these measures is to substitute pooled chi-­square statistics
into the expressions (Enders & Mansolf, 2018; Lee & Cai, 2012; Muthén & Muthén,
1998–2017). Robust indices based on pooled Satorra–­Bentler chi-­square statistics gave
TLI = .920, CFI = .935, and RMSEA = .086. Again, the literature offers little guidance on
the behavior of these fit measures, but limited simulation results suggest that Meng and
Rubin’s (1992) likelihood ratio statistic works well when data are multivariate normal
(Enders & Mansolf, 2018). Considered as a whole, the fit statistics suggest the two-­factor
model doesn’t adequately describe the correlations in the data (Hu & Bentler, 1999).
I previously used modification indices (MacCallum, 1986; Sörbom, 1989) to iden-
tify potential sources of model misfit, and these diagnostics were recently developed
for multiply imputed data (Mansolf et al., 2020). Consistent with the direct estimation
Special Topics and Applications 437

results, the analysis produced three large modification indices. The first pointed to an
omitted cross-­loading from the dieting behavior factor to the drive for thinness item
“Think about burning up calories when I exercise.” The modification index was χ2(1) =
10.82 (p < .01), and the predicted value of the omitted standardized loading was .43. The
other two large indices involved residual covariances between pairs of dieting behavior
items. The first indicated that adding a covariance between “Particularly avoid food
with a high carbohydrate . . . ” and “Avoid foods with sugar in them” would significantly
improve model fit, χ2(1) = 26.65 (p < .001), and the second predicted a similar improve-
ment for the covariance between “Eat diet foods” and “Engage in dieting behavior,” χ2(1)
= 23.04 (p < .001). The large projected values of the residual correlations (.43 and .55,
respectively) further point to these omitted paths as important sources of misfit.
I generated the previous analysis results by feeding imputed data sets into a capable
structural equation modeling program. Lee and Cai (2012) outlined an alternative two-
stage estimation strategy that uses the pooled mean vector and covariance matrix as
input data. In fact, their procedure is the multiple imputation analogue of the two-stage
maximum likelihood estimator described in Section 3.10 (Savalei & Bentler, 2009). The
first stage of the procedure uses multiple imputation to treat the missing data, and the
second stage uses the pooled mean vector and covariance matrix as input data to the
classic maximum likelihood discrepancy function shown below:

( )
f θ | μˆ , Sˆ = ( ) {(
ˆ −1 ( θ ) + ( μˆ − μ ( θ ) )′ S −1 ( θ ) ( μˆ − μ ( θ ) ) + tr SS
−ln SS ) }
ˆ −1 ( θ ) − V (10.33)

In this context, μ̂ and Ŝ are pooled estimates of the sample means and variance–­
covariance matrix, but the discrepancy function is otherwise the same as that found
in the structural equation modeling literature (Bollen, 1989; Jöreskog, 1969; Kaplan,
2009).
The maximum likelihood estimator identifies the factor model parameters that
minimize the difference between the first stage estimates in μ̂ and Ŝ and the model-­
implied moments in μ(θ) and Σ(θ). Because the discrepancy function makes no refer-
ence to observed data values, it incorrectly assumes there are no missing values. This
has no bearing on the estimates, but standard errors and model-fit statistics will be
too small, because they fail to account for imputation noise (between-­imputation varia-
tion). To counteract this problem, Lee and Cai (2012) use the between-­imputation varia-
tion of the sample moments and a key result from Browne’s (1984) famous paper on
distribution-­free estimation to derive an adjustment to the standard errors and model
fit statistic. Their correction is analogous to the one described by Savalei and Bentler
(2009) for maximum likelihood estimation, and a SAS macro for implementing the two-
stage approach is available on the Internet.

Multiple Imputation with Weighted Least Squares Estimation


A second and perhaps more theoretically appropriate use for these imputations is to fit
a factor analysis model for ordinal indicators. Two popular options are weighted least
squares and diagonally weighted least squares (Finney & DiStefano, 2013; Muthén,
1984; Muthén et al., 1997). Like probit regression, these estimators model associations
438 Applied Missing Data Analysis

among the underlying latent variables. The procedure happens in two stages. The first
stage estimates threshold parameters (i.e., the z-score cutoff points that divide the latent
distribution into discrete segments) and the polychoric correlation for each pair of latent
variables, and the second stage engages an iterative optimization routine that minimizes
the sum of squared standardized differences between the first stage estimates and the
thresholds and correlations predicted by the factor analysis model. The fit function that
gives these weighted discrepancies is

( σˆ − σ ( θ ) )′ W −1 ( σˆ − σ ( θ ) )
f ( θ | σˆ ) = (10.34)

where σ̂ is a vector containing the thresholds and latent variable correlations from the
first stage, σ(θ) is the corresponding vector of model-­predicted thresholds and correla-
tions, and W is a weight matrix that standardizes the squared deviation scores (e.g., the
variance–­covariance matrix of the estimates or the diagonal of that matrix in the case of
diagonally weighted least squares).
Weighted least squares is referred to as a limited-­information estimator, because
the initial stage derives polychoric correlations on a pairwise basis from two-way
cross-­tabulation tables that ignore the multivariate distribution of the categorical data
(Maydeu-­Olivares & Joe, 2005; Olsson, 1979; Rhemtulla et al., 2012). Estimating these
latent variable correlations requires complete data, and pairwise deletion is the default
option in software packages. Filling in the data prior to estimating the polychoric cor-
relations is a better solution, because it maximizes the sample size and requires a less
stringent conditionally MAR process where missingness depends on the observed data.
To illustrate, I applied weighted least squares estimation to the filled-­in data sets from
the previous example. Chung and Cai (2019) extended the aforementioned two-stage
estimator (Lee & Cai, 2012) to categorical variables, so their approach is an alternative
to what I describe here.
The middle panel of Table 10.10 gives the standardized factor loadings and their
standard errors. A desirable feature of complete-­data weighted least squares estimation
is that it provides a model fit statistic and the usual selection of fit indices. Liu and Sri-
utaisuk (2019) proposed using the D2 procedure to pool weighted least squares test statis-
tics, and their computer simulations support this strategy, particularly when the analysis
model includes variables with little or no missing data that correlate with the incomplete
variables. Small sample size aside, this example is an optimal application, because each
latent factor has indicators with little or no missing data. The pooled chi-­square from the
factor analysis was significant, χ2(53) = 100.92, p < .001, and substituting the test statistic
into the earlier fit expressions gave TLI = .582, CFI = .665, and RMSEA = .067.

Multiple Imputation with Latent Response Variables


Fully conditional specification with latent response variables offers the interesting pos-
sibility of saving and analyzing the underlying latent scores in lieu of the categorical
indicators (Keller & Enders, 2021). Doing so has the advantage of converting a complex
categorical variable model into a simpler one for multivariate normal data. Moreover,
the latent response variables have a natural substantive interpretation as continuous
Special Topics and Applications 439

estimates of the various concerns, preoccupations, and behaviors captured by the ques-
tionnaire items. The procedure is conceptually equivalent to full information estimation
with a probit link but provides a mechanism for estimating a saturated model and evalu-
ating model fit. As mentioned elsewhere, generating latent replacements for categori-
cal indicators has a rich history in the psychometrics literature and is routinely used
in large-scale assessment settings where latent imputations are referred to as plausible
values (Asparouhov & Muthén, 2010d; Lee & Cai, 2012; Mislevy, 1991; Mislevy, Beaton,
Kaplan, & Sheehan, 1992; von Davier, Gonzalez, & Mislevy, 2009).
Because the latent item responses have 100% missing data, more data sets are needed
to maximize precision and minimize Monte Carlo simulation error. In my experience,
increasing the number of imputations from 100 to 500 can have a meaningful impact on
test statistics and probability values, with additional increases providing diminishing
returns. To this end, I created M = 500 filled-­in data sets by saving the imputations and
latent variable scores from the final iteration of 100 parallel MCMC chains with 10,000
iterations each. The latent data are normal by construction, so a conventional maximum
likelihood estimator is optimal, and no robust corrections are necessary. After fitting the
two-­factor model to each latent data set, I used Rubin’s (1987) pooling rules to combine
the results. The rightmost panel in Table 10.10 shows the pooled standardized factor
loadings and standard errors. The latent response variable analysis produced noticeable
discrepancies with normal-­theory and weighted least squares estimation, with several
loadings that differed by up to one standard error unit. However, the estimates were
effectively equivalent to those of the full information director estimator with a probit link
(see the rightmost panel in Table 10.9). In fact, fully conditional specification with latent
variables can be viewed as a multiple imputation analogue of full information estimation.
Even with this relatively simple model, direct maximum likelihood estimation was
incapable of generating a model fit statistic, and numerical integration precludes the use
of modification indices. Both are available when analyzing the latent imputations. Meng
and Rubin’s (1992) test statistic was significant, χ2(53) = 111.280, p < .001, and the cor-
responding fit indices were as follows: TLI = .895, CFI = .915, and RMSEA = .074. The
TD2 (or D2) statistic (Li, Meng, et al., 1991; Rubin, 1987) was previously well calibrated
to the likelihood ratio test, but it is now noticeably larger in value, χ2(53) = 136.68, p <
.001. The literature suggests that TD2 loses its good statistical properties when the frac-
tions of missing information are very high (Grund et al., 2016c; Li, Meng, et al., 1991; Liu
& Sriutaisuk, 2019), as they are here, because the latent response variables have 100%
missing data. Unless future methodological research suggests otherwise, the current
literature suggests that TD2 is inappropriate for fit assessments with latent imputations.
The modification indices revealed the same sources of misfit described earlier, so no
further discussion is warranted.

10.7 SCALE SCORES AND MISSING QUESTIONNAIRE ITEMS

Researchers collecting self-­report data routinely use questionnaires with multiple items
that tap into different features of the construct being measured. When analyzing such
data, the focus is usually a scale score that sums or averages items that measure a com-
440 Applied Missing Data Analysis

mon theme. Returning to the chronic pain data, consider a linear regression analysis
where depression, gender, and a binary severe pain indicator (0 = no, little, or moder-
ate pain, 1 = severe pain) influence psychosocial disability (a construct capturing pain’s
impact on emotional behaviors such as psychological autonomy and communication,
emotional stability, etc.). The focal analysis model is as follows:

DISABILITYi = β0 + β1 ( DEPRESSi ) + β2 ( MALEi ) + β3 ( PAIN i ) + ε i (10.35)

I used the disability and depression scales in earlier examples without mentioning that
the former is the sum of six 6-point questionnaire items and the latter is a composite
of seven 4-point rating scales (see Appendix). The item-level missingness rates range
between 1.5 and 4.7%, and the disability and depression scale scores have 9.1 and 13.45%
of their values missing, respectively (a scale score is missing if at least one of its compo-
nents is missing).
Item-level missing data can occur for a variety of reasons. Among other things,
a participant may inadvertently skip items or refuse to answer certain questions, an
examinee may fail to complete a full set of cognitive items in the allotted time, or a
researcher may employ a planned missing data design that intentionally omits a sub-
set of items from each respondent’s questionnaire form (Graham et al., 2006). Perhaps
the most common way to deal with item-level missing data is to compute a prorated
scale score that averages the observed responses. For example, if a respondent answered
four out of seven depression items, the scale score would be the average of those four
responses. The missing data literature also describes this procedure as averaging the
available items (Schafer & Graham, 2002) and person mean imputation, because it is
equivalent to imputing missing item responses with the average of each participant’s
observed scores (Huisman, 2000; Peyre et al., 2011; Roth et al., 1999; Sijtsma & van der
Ark, 2003).
Despite their widespread use, prorated scale scores have important limitations that
should deter their use. For one, the method assumes an MCAR process where missing-
ness is unrelated to the data; this puts it on par with deletion procedures that discard
incomplete data records. A second, and perhaps more problematic, feature is that pro-
ration requires all item means to be the same and all pairs of variables to have equal
correlations (Graham, 2009; Mazza et al., 2015; Schafer & Graham, 2002). This popular
procedure is prone to bias when these strict assumptions are not satisfied (Mazza et al.,
2015).
This section describes two broad approaches for analyzing composites with item-
level missing data: factored regression models and agnostic multiple imputation. The
former uses a now-­familiar factoring strategy to augment the focal analysis model with
supporting regressions for the incomplete questionnaire items, whereas the latter fills in
the missing items’ responses with no regard to the scale score structure. All things being
equal, the two approaches generally give the same answer, although factored regression
is advantageous when the number of items is very large relative to the sample size (a
situation where item-level multiple imputation routines often fail to converge). The key
feature of both methods is that they leverage strong sources of item-level correlation in
the data, thereby maximizing power and precision (Gottschall, West, & Enders, 2012).
Special Topics and Applications 441

Factored Regression Specification for Item‑Level Missing Data


Alacam, Du, Enders, and Keller (2022) describe a factored regression model for scale
scores, and I summarize their approach here. Deviating from the analysis example for
a bit, consider a linear regression model with two scale scores as predictors, X and Z.

Yi = β0 + β1 X i + β2 Zi + ε i (10.36)

Furthermore, suppose that each scale is the sum of four questionnaire items: X1 to X4
and Z1 to Z4. For now, assume that Y is a numerical variable rather than a composite. A
factored regression specification expresses the multivariate distribution of the depen-
dent variable and the regressor items as a sequence of univariate distributions, as fol-
lows:

( Y , X1 , X2 , X 3 , X 4 , Z1 , Z2 , Z3 , Z4 ) f ( Y | X1 , X2 , X 3 , X 4 , Z1 , Z2 , Z3 , Z4 ) ×
f=
f ( X1 | X 2 , X 3 , X 4 , Z1 , Z2 , Z3 , Z4 ) × f ( X 2 | X 3 , X 4 , Z1 , Z2 , Z3 , Z4 ) ×
(10.37)
f ( X 3 | X 4 , Z1 , Z2 , Z3 , Z4 ) × f ( X 4 | Z1 , Z2 , Z3 , Z4 ) × f ( Z1 | Z2 , Z3 , Z4 ) ×
f ( Z2 | Z 3 , Z 4 ) × f ( Z 3 | Z 4 ) × f ( Z 4 )

Notice that first term following the equals sign has the same structure as the focal model
(i.e., the outcome to the left of the pipe and predictors to its right), but it features items
rather than scales.
A regression model with scale scores can be viewed as placing restrictions or con-
straints on the associations in f(Y|X1, X2, X3, X4, Z1, Z2, Z3, Z4). To illustrate, the equation
below rewrites the regression as a function of the item responses:

Yi = β0 + β1 X i + β2 Zi + ε i = β0 + β1 ( X1i + X 2i + X 3i + X 4 i )
+ β2 ( Z1i + Z2i + Z3i + Z4 i ) + ε i (10.38)

= β0 + ( β1 X1i + β1 X 2i + β1 X 3i + β1 X 4 i ) + ( β2 Z1i + β2 Z2i + β2 Z3i + β2 Z4 i ) + ε i


As you can see, the focal model is equivalent to an item-level regression where questions
from the same scale share a common slope coefficient. As such, a scale score model can
be cast as an item-level regression analysis that imposes equality constraints on the
slopes. These constraints are a key part of the factored regression specification, and I use
the following notation with alphanumeric superscripts to convey common parameter
values:

( a a b b b
)
f ( Y | X1 , X 2 , X 3 , X 4 , Z1 , Z2 , Z3 , Z4 ) = f Y | X1( ) , X 2( ) , X 3( ) , X 4( ) , Z1( ) , Z2( ) , Z3( ) , Z4( ) (10.39)
a a b

Looking beyond the focal model, the terms in the second and third rows of Equation
10.40 correspond to a sequence of probit regression models where each questionnaire
item is regressed on some subset of the others. Item-level missing data handling can be
computationally challenging with smaller samples, because the number of parameters
that accumulates across these supporting item-level models can be very large, especially
442 Applied Missing Data Analysis

when scales have many items. The categorical nature of the questionnaire items adds
to this challenge. Additional constraints can simplify estimation while still exploiting
strong item-level associations. For example, the factorization below illustrates between-­
scale constraints that assume each X item has the same association with all Z items:

( a a b b b b
)
f Y | X1( ) , X 2( ) , X 3( ) , X 4( ) , Z1( ) , Z2( ) , Z3( ) , Z4( ) ×
a a

( c
) (
f X1 | X 2 , X 3 , X 4 , Z1( ) , Z2( ) , Z3( ) , Z4( ) × f X 2 | X 3 , X 4 , Z1( ) , Z2( ) , Z3( ) , Z4( ) ×
c c c d d d d
) (10.40)
f ( X | X , Z( ) , Z( ) , Z( ) , Z( ) ) × f ( X (f) (f) (f) (f)

e e e e
3 4 1 2 3 4 4 | Z1 , Z2 , Z3 , Z4

f ( Z1 | Z2 , Z3 , Z4 ) × f ( Z2 | Z3 , Z4 ) × f ( Z3 | Z4 ) × f ( Z4 )

The alphanumeric superscripts show that the constraints reduce 16 coefficients (one per
each Z item) into four slopes (one for each parcel of Z items).
Further constraints can be imposed if the sample size is still too small to support
estimation (e.g., the MCMC algorithm fails to converge). The factorization below illus-
trates within-­scale constraints that assume each X item has the same association with a
subset of other X items (e.g., X1 is assumed to have a common association with X2, X3,
and X4):

( a a b b b b
)
f Y | X1( ) , X 2( ) , X 3( ) , X 4( ) , Z1( ) , Z2( ) , Z3( ) , Z4( ) ×
a a

( c c c d d d
) (
f X1 | X 2( ) , X 3( ) , X 4( ) , Z1( ) , Z2( ) , Z3( ) , Z4( ) × f X 2 | X 3( ) , X 4( ) , Z1( ) , Z2( ) , Z3( ) , Z4( )
d e e f f f f
) (10.41)
× f ( X | X , Z( ) , Z( ) , Z( ) , Z( ) ) × f ( X )
| Z1( ) , Z2( ) , Z3( ) , Z4( ) ×
g g g g h h h h
3 4 1 2 3 4 4

f ( Z1 | Z2 , Z3 , Z4 ) × f ( Z2 | Z3 , Z4 ) × f ( Z3 | Z4 ) × f ( Z 4 )

Similar within-­scale constraints can be placed on the Z items. It is important to empha-


size that within-­scale constraints impose restrictions on the variance–­covariance matrix
that may be at odds with the internal structure of the scale. However, limited simulation
evidence suggests this type of misspecification may have relatively little impact on the
focal model parameters (Alacam et al., 2022).
Turning to the dependent variable, suppose that Y is the sum of three question-
naire items, Y1 to Y3. For simplicity, assume that X and Z are numerical regressors rather
than composites variables. A key idea behind the dependent variable’s specification is
that there are four pieces of information, only three of which are free to vary; that is,
the scale score is fully determined from the three item responses, and an item response
is determined given the scale score and two other items. Working with the scale score
and all but one of the items sets up a factorization where the scale score functions as a
dependent variable that borrows information from its items, as follows:

f ( Y1 , Y2 , Y , X, Z=
) f ( Y1 | Y2 , Y ) × f ( Y2 | Y ) × f ( Y | X, Z ) × f ( X | Z ) × f ( Z ) (10.42)

The first two terms following the equals sign are supporting models that link the com-
Special Topics and Applications 443

posite to its items (e.g., a pair of probit regressions), the third term corresponds to the
focal analysis model (e.g., Equation 10.36), and the last two terms are supporting regres-
sor models (which could also look like Equation 10.37 with composite predictors).
Three points are worth highlighting. First, neither f(Y1|Y2, Y) nor f(Y2|Y) include X
or Z, because I assume that items are conditionally independent of the predictors after
controlling for the scale score. This is tantamount to assuming that Y items do not cross-
load on a factor with X or Z items. Second, the supporting item-level regressions are
designed to leverage collinearity between the dependent variable and its components,
and the factorization essentially applies the idea of treating questionnaire items as aux-
iliary variables (Eekhout et al., 2015b; Mazza et al., 2015). Third, the factorization neces-
sarily leaves out one item to avoid perfect linear dependencies. In practice, virtually any
combination of items will convey roughly the same amount of information to the scale
score, so excluding the item with the highest missing data rate is a good strategy.
The factored regression specification readily accommodates auxiliary variables
with additional terms on the left side of the factorization prior to the analysis variables.
To illustrate, consider a scenario where Y is measured with three items (Y1 to Y3), and X
and Z are both measured by pair of items (X1 and X2 and Z1 and Z2). The factorization
for a model with a single auxiliary variable and between-­scale constraints is as follows:

( a a b b
)
f A | Y , X1( ) , X 2( ) , Z1( ) , Z2( ) × f ( Y1 | Y2 , Y ) × f ( Y2 | Y ) ×

f (Y | X ( ) , X ( ) , Z( ) , Z( ) ) × f ( X | X , Z( ) , Z( ) ) ×
c c d d e e
1 2 1 2 1 2 1 2 (10.43)

f ( X | Z( ) , Z( ) ) × f ( Z | Z ) × f ( Z )
f f
2 1 2 1 2 2

Figure 10.10 shows the factorization as a path diagram, with the dashed rectangles
enclosing the X and Z items denoting composite scores. The factorization reduces model
complexity by imposing two innocuous constraints on the auxiliary variable’s regres-
sion model. First, the model features the Y scale score as a predictor but not its items
(i.e., the partial regression slopes for Y1 and Y2 are fixed at 0). Second, equality con-
straints are placed on coefficients linking the X and Z items to the auxiliary variable.
Collectively, these constraints transmit the auxiliary variable’s information to the scale
scores rather than the individual items.
The factored regression specification places the emphasis on the scale scores, and
the item-level regressions are simply a device for accessing sources of strong correla-
tion in the data. The specification can be implemented with maximum likelihood or
Bayesian estimation, and the latter could also generate model-based multiple imputa-
tions. The imputed data would include the dependent variable scale score, all but one
of the dependent variable’s items, and all items from the regressor scales (but not the
scale scores themselves, which are obtained by summing the filled-­in item responses).
The absence of one Y item is not a problem, as you would simply analyze the imputed
scale scores without regard to the items. Agnostic multiple imputation approaches like
the joint model and fully conditional specification are ideally suited for filling in item
responses without imposing a scale score structure, and I illustrate that procedure later
in this section.
444 Applied Missing Data Analysis

X2 Z2

X1 Z1

FIGURE 10.10. Path diagram of a factored regression model with a single auxiliary variable
and between-­scale constraints. The dashed rectangles enclosing the X and Z items represent the
scale scores.

Bayesian Estimation
Returning to the chronic pain data and the analysis model in Equation 10.35, the depres-
sion measure is the sum of seven 4-point rating scales, and psychosocial disability is a
composite of six 6-point questionnaire items. I use Bayesian estimation to apply the
factored regression approach to the linear regression model, and I also include the per-
ceived control over pain and pain interference with daily life scales as auxiliary vari-
ables, as both are correlates of the analysis variables.
Although the number of questionnaire items is not very large relative to the sam-
ple size, I reduced model complexity by excluding the psychosocial disability items
from the auxiliary variable models and imposing equality constraints on the associa-
tions between the auxiliary variables and the depression items (e.g., a single coefficient
described the regression of pain interference on the seven depression items). The speci-
fications mimic those in Equation 10.43 and Figure 10.10. The factored regression model
for the analysis is shown below:

(
f INTERFERE | CONTROL, DISABILITY , DEP1( ) ,…, DEP7( ) , PAIN, MALE ×
a a
)
(
f CONTROL | DISABILITY , DEP1( ) ,…, DEP7( ) , PAIN, MALE ×
b b
)
( ) (
f DIS1* | DIS2 ,…, DIS5 , DISABILITY ×…× f DIS5* | DISABILITY × ) (10.44)
(
f DISABILITY | DEP1( ) ,…, DEP7( ) , PAIN, MALE ×
c c
)
( ) (
f DEP1* | DEP2 ,…, DEP7 , PAIN, MALE ×…× f DEP7* | PAIN, MALE × )
( *
) (
f PAIN | MALE × f MALE *
)
Special Topics and Applications 445

The first two terms are the auxiliary variable models. For example, the term linking
perceived control over pain to the analysis variables translates into the linear regression
model below:

CONTROLi = γ 02 + γ12 ( DISABILITYi ) + γ 22 ( DEP1i + DEP2i + … + DEP7i )


(10.45)
+ γ 32 ( PAIN i ) + γ 42 ( MALEi ) + r2i

Notice that the equation features an equality constraint where the seven depression
items share a common slope coefficient (i.e., the auxiliary variable links to the depres-
sion scale score rather than individual items). The next group of terms link all but the
last psychosocial disability item to the corresponding scale score. For example, the
f(DIS1|DIS2, . . ., DIS5, DISABILITY) term translates into a probit regression model for the
first disability item’s underlying latent response variable.

DIS1*i = γ 03 + γ13 ( DIS2i ) +  + γ 43 ( DIS5i ) + γ 53 ( DISABILITYi ) + r3i (10.46)

As always, the residual variance is fixed at one to establish the metric, and the model
also requires a set of threshold parameters, one of which is fixed at zero.
The focal model, which appears in the third row from the bottom, is an item-level
linear regression with equality constraints on the regression slopes as follows:

DISABILITYi = β0 + β1 ( DEP1i + … + DEP7i ) + β2 ( MALEi ) + β3 ( PAIN i ) + ε i (10.47)

The practical impact of these constraints is that β1 reflects the association between the
depression scale score and the dependent variable. The supporting models for the pre-
dictors appear in the bottom two rows of Equation 10.44, all of which are probit regres-
sions with latent response scores as dependent variable (including the binary severe
pain indicator). Finally, the last term is the marginal distribution of the gender dummy
code. I ignore this term, because this variable is complete and does not require a distri-
bution. Keller (2022) describes a Metropolis–­Hastings step that streamlines the intro-
duction of these equality constraints, such that a researcher only needs to specify the
scale score structure in a format that mimics Equation 10.47.
After updating several sets of regression model parameters, MCMC samples new
imputations from posterior predictive distributions based on the model parameters. Fol-
lowing ideas established in Chapters 5 and 6, the missing values follow complex, mul-
tipart functions that depend on every model in which a variable appears. For example,
the conditional distribution of the psychosocial disability scale scores given everything
else is the product of eight univariate distributions: two induced by the auxiliary vari-
able models, five from the item-level regressions, and one from the focal analysis. Simi-
larly, the distributions of the missing depression items depend on a latent response vari-
able distribution and other models where the discrete response appears as a predictor.
Importantly, the depression scale score itself is never the target of imputation; rather,
item responses are chosen that make sense when summed together with other items.
I began with an exploratory MCMC chain and used trace plots and potential scale
reduction factor diagnostics (Gelman & Rubin, 1992) to evaluate convergence. This step
446 Applied Missing Data Analysis

TABLE 10.11. Posterior Summary from the Scale Score Analysis


Variables Mdn SD LCL UCL
β0 17.61 0.67 16.27 18.88
β1 (DEPRESS) 0.27 0.04 0.19 0.35
β2 (MALE) –0.80 0.53 –1.83 0.24
β3 (PAIN) 1.77 0.60 0.63 2.96
σε2 16.93 1.51 14.32 20.24
R2   .20 .04   .12   .28

Note. LCL, lower credible limit; UCL, upper credible limit.

is especially important with categorical questionnaire items, because the probit model’s
threshold parameters tend to converge slowly and require long burn-in periods (Cowles,
1996; Nandram & Chen, 1996). Based on this diagnostic run, I specified an MCMC pro-
cess with 10,000 iterations following an initial burn-in period of 20,000 iterations. Table
10.11 summarizes the posterior distributions of the regression model parameters. In the
interest of space, I omit the auxiliary variable and item-level regressions from the table,
because they are not the substantive focus. Importantly, the substantive interpretations
make no reference to the supporting item-level regressions and match those of a scale
score analysis. For example, the β1 coefficient indicates that a one-unit change in the
depression scale score predicts a 0.27 increase in the psychosocial disability scale score,
controlling for gender and pain (Mdnβ1 = 0.27, SD = 0.04).

Maximum Likelihood
Maximum likelihood offers a frequentist alternative for deploying the factored regres-
sion specification. As explained in Chapter 3, a mixture of categorical and continuous
variables requires an iterative optimization procedure known as numerical integration
that fills in the missing parts of the data in an imputation-­esque fashion. Applying the
factorization in Equation 10.44 produced the estimates in the top panel of Table 10.12.
Perhaps not surprisingly, the point estimates and standard errors were numerically
equivalent to the posterior medians and standard deviations. This has been a repetitive
theme throughout the book.

Multiple Imputation
Agnostic multiple imputation is well suited for item-level missing data, because it fills in
the missing values without imposing a structure or pattern on the means and associa-
tions (although it could, if desired). Several studies have investigated the use of multiple
imputation with item-level missing data, and they are unequivocally supportive (Eek-
hout et al., 2014; Finch, 2008; Gottschall et al., 2012; Peyre et al., 2011; van Buuren,
2010).
An important practical issue is whether to impute the scale scores themselves or
the individual questionnaire items (the imputation model cannot include both, because
Special Topics and Applications 447

TABLE 10.12. Frequentist Results from the Scale Score Analysis


Parameter Est. SE z or t df p FMI
Maximum likelihood estimation
β0 17.62 0.66 26.60 ∞ < .001 —
β1 (DEPRESS) 0.27 0.04 6.33 ∞ < .001 —
β2 (MALE) –0.79 0.53 –1.50 ∞   .13 —
β3 (PAIN) 1.74 0.59 2.96 ∞    .003 —
σε2 16.62 — — — — —
R2   .20 — — — — —

Multiple imputation
β0 17.59 0.67 26.33 260.16 < .001 .04
β1 (DEPRESS) 0.27 0.04 6.35 256.82 < .001 .05
β2 (MALE) –0.80 0.53 –1.51 259.66   .13 .04
β3 (PAIN) 1.75 0.59 2.97 248.28 < .001 .07
σε2 16.86 — — — — —
R2   .20 — — — — —

Note. FMI, fraction of missing information.

of linear dependencies). I refer to these options as scale-level imputation and item-


level imputation, respectively. To implement scale-level imputation, you compute scale
scores prior to imputation, treating the composite as missing if any one of its constitu-
ent items is missing. The scale-level imputation model includes the composite scores
and other analysis variables. This approach is attractive, because it invokes a simple
imputation model without the categorical questionnaire items. The downside of this
parsimony is that imputation fails to leverage the strong correlations among the items
and instead relies on weaker between-­scale correlations to generate predictions. In
contrast, item-level imputation targets the individual questionnaire items, and com-
posites are computed from the filled-­in data prior to analysis. With few exceptions,
item-level imputation usually improves precision, because it taps into a larger reservoir
of shared variation. However, it achieves this advantage using a complex model with
many parameters (including a large set of potentially difficult-­to-­estimate threshold
parameters). With the exception of longitudinal studies where entire questionnaires are
missing, because participants skip a data collection wave (Vera & Enders, 2021), the
precision advantage is usually so dramatic that there is usually no reason to consider
scale-level imputation (Eekhout et al., 2015a, 2015b; Gottschall et al., 2012; Savalei &
Rhemtulla, 2017).
The joint modeling framework invokes a multivariate normal distribution for the 16
variables (13 questionnaire items, gender, two auxiliary variables), and the correspond-
ing imputation model parameters are a mean vector and variance–­covariance matrix.
The 16-dimensional normal distribution for the focal variables and auxiliary variables
is as follows:
448 Applied Missing Data Analysis

 DEP1*i 
 
 … 
 * 
 DEP7i 
 DIS1*i 
 
 …  ~ N16 ( μ, S ) (10.48)
 * 
 DIS6i 
 MALE* 
 i 
 INTERFEREi 
 
 CONTROL 
By now you are familiar with the fact that latent response variables replace discrete vari-
ables, and the corresponding diagonal elements in the variance–­covariance matrix are
set to one to establish a metric. The model also incorporates three threshold parameters
per depression item, five thresholds for each disability item, and a single fixed threshold
for the male dummy code.
A practical limitation of item-level imputation is that the number of parameters
can be quite large relative to the sample size. A model-based imputation strategy that
imposes a factor structure on the questionnaire items can dramatically reduce the num-
ber of parameters while still leveraging item-level covariation. Importantly, the model-
based imputations assume the two-­factor model is correct, so any global fit assessments
will be overly favorable. Nevertheless, this approach warrants consideration when the
number of indicators is very large, because it should converge more reliably than an
unrestricted imputation model.
Fully conditional specification uses a sequence of regression models to impute vari-
ables in a round-robin fashion. I use a fully latent version of the procedure that models
associations among continuous and latent response variables. This approach invokes a
linear regression for the auxiliary variables and probit regressions for each categorical
variable. To illustrate, the probit imputation model for the first depression item is as
follows:

( ) ( ) ( ) (
DEP1*i = γ 0 + γ1 DEP2*i + … + γ 6 DEP7*i + γ 7 DIS1*i + … + γ12 DIS6*i ) (10.49)
+ γ13 ( INTERFEREi ) + γ14 ( CONTROLi ) + γ15 ( MALEi ) + ri

As always, the latent response variable’s residual variance is fixed at one to establish its
scale. The model also requires three threshold parameters (one of which is fixed) that
divide the underlying latent distribution into six discrete segments.
After updating all model parameters, MCMC samples new imputations from pos-
terior predictive distributions based on the model parameters. For example, numerical
variables are sampled from a normal distribution with center and spread equal to a
predicted score and residual variance, respectively (i.e., imputation = predicted value +
noise). The procedure creates latent response imputations for the entire sample (recall
that latent scores are restricted to a particular region of the distribution if the discrete
response is observed, and they are unrestricted otherwise), and the location of these
Special Topics and Applications 449

continuous imputes relative to the estimated threshold parameters induces discrete val-
ues for each filled-­in data set.
Following earlier examples, I created M = 100 filled-­in data sets. Prior to creating
imputations, I performed an exploratory analysis and used trace plots and potential
scale reduction factor diagnostics (Gelman & Rubin, 1992) to evaluate convergence. As
mentioned previously, this step is especially important when imputing questionnaire
items, because the many threshold parameters tend to converge slowly and require long
burn-in periods. Based on this diagnostic run, I specified 100 parallel imputation chains
with 20,000 iterations each, and I saved the data at the final iteration of each chain. After
creating the multiple imputations, I computed the scale scores from the filled-­in item
responses and fit the regression model from Equation 10.35 to the data. Finally, I used
Rubin’s (1987) pooling rules to combine the estimates and standard errors and applied
Barnard and Rubin’s (1999) degrees of freedom expression to the significance tests. The
bottom panel of Table 10.12 summarizes the multiple imputation results, which were
identical to those of maximum likelihood (and Bayesian estimation).

10.8 INTERACTIONS WITH SCALES

The factored regression specification for scale scores readily extends to models with
interaction effects. Continuing with the chronic pain data, consider a moderated regres-
sion where the influence of depression on psychosocial disability differs by gender.

DISABILITYi = β0 + β1 ( DEPRESSi − μ1 ) + β2 ( MALEi )


(10.50)
+ β3 ( DEPRESSi − μ1 )( MALi ) + β4 ( PAIN i ) + ε i

I used this model in earlier chapters to illustrate missing data handling for interaction
effects, and this section shows how to accommodate the scale structure (depression is
measured with seven 4-point rating scales, and the dependent variable is measured with
six 6-point questionnaire items).

Factored Regression Specification


Keller (2022) described a factored regression model specification that accommodates
interaction effects with composite scores or latent factors. I begin with the former,
because it is a straightforward extension of the scale score specification from the previ-
ous section. To illustrate the procedure, I use the following the moderated regression
model:

Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + ε i (10.51)

where X is a composite (e.g., the depression scale score), and M is a numeric variable
(e.g., the male dummy code). Deviating from the analysis example, suppose that X is
the sum of three questionnaire items: X1 to X3. The dependent variable could also be a
composite, but for now I treat it as a numerical variable.
450 Applied Missing Data Analysis

Extending ideas from the previous section, a moderated regression with a scale score
can be viewed as an item-level analysis that imposes equality constraints on regression
slopes. The equation below rewrites Equation 10.51 as a function of the item responses:

Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + ε i
= β0 + β1 ( X1i + X 2i + X 3i ) + β2 Mi + β3 ( X1i + X 2i + X 3i ) Mi + ε i (10.52)
= β0 + ( β1 X1i + β1 X 2i + β1 X 3i ) + β2 Mi + ( β 3 X1i Mi + β 3 X 2i Mi + β 3 X 3i Mi ) + ε i

As you can see, the focal model is equivalent to an item-level regression where questions
from the same scale share a common slope coefficient, as do the collection of product
terms involving the item responses and the moderator.
The supporting item-level regression models follow the procedure from the previ-
ous section and do not change with the introduction of an interaction. Using generic
notation, the factored regression specification for this example involves the product of
five univariate distributions, each of which corresponds to a regression model.

( a b b
)
f Y | X1( ) , X 2( ) , X 3( ) , M, X1M ( ) , X 2 M ( ) , X 3 M ( ) ×
a a b
(10.53)
f ( X1 | X 2 , X 3 , M ) × f ( X 2 | X 3 , M ) × f ( X 3 | M ) × f ( M )
As before, the alphanumerical superscripts denote constrained associations.
The procedure readily extends to applications where both X and M are scales. To
illustrate, suppose that M is the sum of two questionnaire items, M1 and M2. To illustrate
this model’s constraints, the equation below rewrites the focal analysis as a function of
the item responses:
Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + ε i
= β0 + β1 ( X1i + X 2i + X 3i ) + β2 ( M1i + M2i ) + β3 ( X1i + X 2i + X 3i )( M1i + M2i ) + ε i
(10.54)
= β0 + ( β1 X1i + β1 X 2i + β1 X 3i ) + ( β2 M1i + β2 M2i )
+ ( β3 X1i M1i + β3 X 2i M1i + β3 X 3i M1i + β3 X1i M2i + β3 X 2i M2i + β3 X 3i M2i ) + ε i
As you can see, the focal model is equivalent to an item-level regression where ques-
tions from the same scale share common slope coefficients, as do the collection of all
possible product terms involving the two sets of scale items. Although this model seems
extraordinarily cumbersome to specify, Keller (2022) uses a Metropolis–­Hastings step
to streamline estimation, such that a researcher only needs to specify the scale score
structure in the following format:

Yi = β0 + β1 ( X1i + X 2i + X 3i ) + β2 ( M1i + M2i )


(10.55)
+ β3 ( X1i + X 2i + X 3i )( M1i + M2i ) + ε i

Finally, the factored regression specification for a composite dependent variable is


also unchanged with an interaction. To illustrate, reconsider the scenario where X is a
composite (e.g., the depression scale score) and M is a numeric variable (e.g., the male
dummy code), and assume that Y is the sum of three questionnaire items: Y1 to Y3. Fol-
lowing the scale score specification from Section 10.6, the factored regression includes
Special Topics and Applications 451

additional terms on the left that link the scale score to all but one of its component
items, as follows:

( a b b
)
f ( Y1 | Y2 , Y ) × f ( Y2 | Y ) × f Y | X1( ) , X 2( ) , X 3( ) , M, X1M ( ) , X 2 M ( ) , X 3 M ( ) ×
a a b

(10.56)
f ( X1 | X 2 , X 3 , M ) × f ( X 2 | X 3 , M ) × f ( X 3 | M ) × f ( M )
Auxiliary variables enter the factorization in the same way as Equation 10.43.

Bayesian Estimation and Model‑Based Multiple Imputation


I used Bayesian estimation and model-based multiple imputation to illustrate a factored
regression specification for the analysis model from Equation 10.50. To reiterate, the
dependent variable is measured with six 6-point questionnaire items, and depression
is measured with seven 4-point rating scales. Consistent with earlier applications of
this model, I centered the depression scale at its grand mean to facilitate interpretation
of the lower-order coefficients (e.g., β0 is the expected psychosocial disability score for
females, β2 is the gender difference at depression average). As mentioned previously,
the method described in Keller (2022) streamlines specification, such that a researcher
simply needs to specify the scale score structure of the model in a format that mimics
the following equation, and the software constructs the ancillary item-level regressions
(Keller & Enders, 2021):

DISABILITYi = β0 + β1 ( DEP1i + … + DEP7i ) + β2 ( MALEi )


(10.57)
+ β3 ( DEP1i + … + DEP7i )( MALEi ) + β4 ( PAIN i ) + ε i

At this time, it isn’t possible to estimate this model with maximum likelihood, and
Bayesian estimation and model-based multiple imputation require specialized software
that works with a series of univariate likelihoods instead of a single multivariate distri-
bution (Keller & Enders, 2021).
After examining trace plots and potential scale reduction factors (Gelman &
Rubin, 1992), I specified an MCMC process with 10,000 iterations following an initial
20,000-iteration burn-in period. The top panel of Table 10.13 summarizes the posterior
distributions of the parameters. Recall that lower-order terms are conditional effects
that depend on scaling; Mdnβ1 = 0.38 (SD = 0.06) is the effect of depression on psycho-
social disability for female participants, and Mdnβ2 = –0.80 (SD = 0.54) is the gender
difference at the depression mean. The interaction effect captures the slope difference
for males. The negative coefficient (Mdnβ3 = –0.25, SD = 0.08) indicates that the male
depression slope was approximately one-­fourth of a point lower than the female slope
(i.e., the male slope is Mdnβ1 + Mdnβ3 = 0.38 – 0.25 = 0.13). The simple slopes for males
and females resemble those in Figure 5.5.
The same analysis that generates Bayesian summaries of the model parameters
can also generate model-based multiple imputations for a frequentist analysis. To
illustrate the process, I also created M = 100 filled-­in data sets by saving the imputa-
tions from the final iteration of 100 parallel MCMC chains. After creating the multiple
imputations, I computed and centered the depression scale score from the filled-­in
452 Applied Missing Data Analysis

TABLE 10.13. Posterior Summary from the Moderated Regression


with Scales
Parameter Mdn SD LCL UCL
Scale score analysis
β0 21.61 0.39 20.85 22.37
β1 (DEPRESS) 0.38 0.06 0.27 0.49
β2 (MALE) –0.80 0.54 –1.88 0.27
β3 (DEPRESS)(MALE) –0.25 0.08 –0.41 –0.09
β4 (PAIN) 1.86 0.58 0.72 3.01
σε2 16.43 1.46 13.92 19.57
R2   .23 .04   .15   .31

Latent scale analysis


β0 0 — — —
β1 (DEPRESS*) 0.30 0.07 0.19 0.45
β2 (MALE) –0.04 0.09 –0.22 0.14
β3 (DEPRESS*)(MALE) –0.16 0.07 –0.31 –0.04
β4 (PAIN) — — — —
σε2 0.34 0.10 0.18 0.56
R2   .28 .06   .17   .39

Note. LCL, lower credible limit; UCL, upper credible limit.

item responses (imputation fills in the disability scale score) and fit the regression
model from Equation 10.50 to the data. Finally, I used Rubin’s (1987) pooling rules
to combine the estimates and standard errors and applied Barnard and Rubin’s (1999)
degrees of freedom expression to the significance tests. The top panel of Table 10.14
summarizes the multiple imputation estimates, which were numerically equivalent to
the Bayesian results.

Factored Regression Specification


for Latent Variable Interactions
Up to this point I’ve treated multiple-­item scales as composite scores, but the factored
regression specification readily accommodates latent variables and even latent variable
interactions (Keller, 2022). In some respects, the latent variable approach is simpler,
because it replaces a potentially large number of item-level regression models with a
parsimonious measurement model. I illustrate the approach with a latent-­by-­manifest
variable interaction and point interested readers to Keller for further details.
To illustrate the procedure, I use a single latent factor ηX in lieu of the X scale score.
The analysis model is the following linear regression:

( )
Yi = β0 + β1 η Xi + β2 ( Mi ) + β3 η Xi ( )(M ) + ε
i i (10.58)

where ηXi is the latent factor score for person i. The interaction term now reflects the
Special Topics and Applications 453

TABLE 10.14. Multiple Imputation Estimates from the Moderated Regression


with Scales
Parameter Est. SE t df p FMI
Scale score analysis
β0 21.60 0.35 61.00 243.49 < .001 .09
β1 (DEPRESS) 0.38 0.06 6.71 236.90 < .001 .10
β2 (MALE) –0.77 0.53 –1.47 257.09   .14 .05
β3 (DEPRESS)(MALE) –0.25 0.08 –3.06 252.31 < .001 .06
β4 (PAIN) 1.85 0.59 3.17 241.87 < .001 .09
σε2 16.40 — — — — .06
R2   .23 — — — — .06

Latent scale analysis


β0 0.04 0.09 0.41 124.90   .69 .51
β1 (DEPRESS*) 0.65 0.15 4.36 56.15 < .001 .78
β2 (MALE) –0.07 0.13 –0.53 166.66   .60 .36
β3 (DEPRESS*)(MALE) –0.35 0.15 –2.27 120.36   .03 .53
β4 (PAIN) — — — — — —
σε2 0.73 — — — — —
R2   .27 — — — — —

Note. FMI, fraction of missing information.

product of a latent and manifest variable. A measurement model linking the factor to the
items replaces the item-level regressions from the scale score model. The factor model
for this example consists of three probit regressions with latent response variables as
outcomes.
 X1*i   γ   γ   r1i 
 *   01   11   
 X 2i  =  γ 02  +  γ12  η Xi +  r2i  (10.59)
 
 X 3*i   γ 03   γ13  r 
 3i 
 
Applying ideas from Chapter 6, the residual variances are fixed at one to establish the
latent response variables’ metrics, and each regression additionally requires a set of
threshold parameters, one of which is fixed to zero. Like other latent variables we’ve
encountered, ηX doesn’t have a metric and requires similar identification constraints.
I scale the latent variable by fixing the first factor loading (the γ11 coefficient) equal to
1 (i.e., ηX’s variance is equated to “true score” variation in X1*), and I also set the factor
mean to 0. Finally, note that I use γ’s and r’s for consistency with earlier material, but
it is more common to see measurement intercepts and loadings written as υ and λ in
structural equation modeling applications.
The factored regression specification for the moderated regression analysis now
involves the dependent variable, questionnaire items, and latent factor. Applying the
probability chain rule gives the following factorization, which isn’t in its final form:
454 Applied Missing Data Analysis

f ( Y , η X , M, X1 , X 2 , X 3 ) = f ( Y | η X , M, X1 , X 2 , X 3 ) × f ( X1 | X 2 , X 3 , η X , M ) ×
(10.60)
f ( X2 | X 3 , ηX , M ) × f ( X 3 | ηX , M ) × f ( ηX | M ) × f ( M )
Returning to the measurement model, the latent factor is the sole determinant of the
item responses and, by extension, their means and associations. The factor model stipu-
lates that items correlate with each other, because they share a common predictor (i.e.,
any two items link indirectly via their pathways to the latent factor), and their correla-
tions with other variables like Y and M are also indirect via the latent variable. Said
differently, each item is conditionally independent of all other variables after controlling
for the factor.
The conditional independence assumption simplifies the factorization, as any item
on the right side of a vertical pipe with ηX vanishes. The final model specification is

f ( Y | η X , M, η X × M ) × f ( X1 | η X ) × f ( X 2 | η X ) × f ( X 3 | η X ) × f ( η X | M ) × f ( M ) (10.61)

where the first term corresponds to the focal analysis, the next three terms are the
measurement model, the penultimate term links the latent factor to the moderator, and
the final term is marginal (overall) distribution of the moderator. The final two terms
translate into a pair of linear regression models, one of which is empty.

η Xi = γ 04 + γ14 ( Mi − μ M ) + r4 i (10.62)

Mi =μ M + r5i

Centering the moderator in the top equation defines the intercept coefficient γ04 as the
latent factor mean, which I fix to 0 to identify the model and center ηX.
The MCMC algorithm for this analysis treats the factor scores as missing data to
be estimated, much like the item-level latent response variables. Applying ideas from
earlier chapters, the distribution of these missing values is a multipart function that
depends on every model in which ηX appears. In this case, the conditional distribution
of the factor scores is proportional to the product of five univariate distributions, each of
which corresponds to a normal curve induced by a linear regression model.

f ( η X | Y , M, X1 , X 2 , X 3 ) ∝ f ( Y | η X , M, η X × M ) ×
(10.63)
f ( X1 | η X ) × f ( X 2 | η X ) × f ( X 3 | η X ) × f ( η X | M )

Deriving the conditional distribution involves multiplying five normal curve equations
and performing algebra that combines the component functions into a single distribu-
tion for ηX. In practice, tthe Metropolis–­Hastings algorithm does the heavy lifting of
sampling latent imputations from this complicated distribution.
Because they function as dependent variables, the item-level missing values are
determined solely by the factor model; that is, the probit regressions from Equation
10.59 are the imputation models, and MCMC samples latent response scores from a
univariate normal distribution. To illustrate, the distribution of missing values for X1 is

((
X1*i ~ N1 E X1*i | η Xi ,1 ) ) (10.64)
Special Topics and Applications 455

where the predicted value in the function’s first argument is computed by substitut-
ing a person’s current factor score into a probit regression equation. The normal curve
generates latent scores for cases with observed data as well, but the threshold param-
eters restrict those values to a particular region of the distribution. Finally, MCMC sam-
ples missing Y scores from a normal distribution that depends only on the focal model
parameters (imputation = predicted value + noise).
The factored regression specification readily accommodates a latent factor as a
dependent variable as well. To illustrate, suppose the outcome is measured with three
questionnaire items, Y1 to Y3. A latent factor ηY replaces the Y scale score in the analysis
model, as follows:

( )
ηYi = β0 + β1 η Xi + β2 ( Mi ) + β3 η Xi ( )(M ) + ε i i (10.65)

Reusing previous notation, the measurement model linking the factor to the items again
consists of three probit regressions with latent response variables as outcomes.

 Y1*i   γ   γ   r1i 
 *   01   11   
 Y2i  =  γ 02  +  γ12  ηYi +  r2i  (10.66)
 * 
 Y3i   γ 03   γ13  r 
 3i 
 

As before, the residual variances are fixed at one to establish the latent response vari-
ables’ metrics, and each item’s regression additionally requires a set of threshold param-
eters, one of which is fixed at zero. I scale the latent variable by fixing the first factor
loading (the γ11 coefficient) equal to 1, and I set the structural intercept (the β0 coeffi-
cient) equal to 0 to identify the mean structure.
Assuming that items are conditionally independent of other variables after control-
ling for their respective latent variables gives the following factored regression specifica-
tion:

f ( Y1 , Y2 , X 3 , ηY , η X , M, X1 , X 2 , X=
3) f ( Y1 | ηY ) × f ( Y2 | ηY ) × f ( Y3 | ηY ) ×
(10.67)
f ( ηY | η X , M, η X × M ) × f ( X1 | η X ) × f ( X 2 | η X ) × f ( X 3 | η X ) × f ( η X | M ) × f ( M )

The first three terms after the equals sign correspond to the measurement model in
Equation 10.66, the fourth term is the focal analysis model from Equation 10.65, and
the remaining terms are the same as before. Like ηX, the MCMC algorithm treats ηY as
a variable to be imputed, and the conditional distribution of the factor scores is propor-
tional to the product of the four univariate distributions in which ηY appears.

f ( ηY | Y , M, η X , Y1 , Y2 , Y3 , X1 , X 2 , X 3 ) ∝ f ( Y1 | ηY ) ×
(10.68)
f ( Y2 | ηY ) × f ( Y3 | ηY ) × f ( Y | η X , M, η X × M )

After updating all model parameters, MCMC uses a Metropolis–­Hastings step to sample
latent imputations from this complex distribution. Like the X items, the measurement
model solely determines the distributions of the missing Y items.
456 Applied Missing Data Analysis

Bayesian Estimation and Multiple Imputation


I used Bayesian estimation and model-based multiple imputation to illustrate a factored
regression specification for the following analysis model:

( ) ( )
DISABILITYi* = β0 + β1 DEPRESSi* + β2 ( MALEi ) + β 3 DEPRESSi* ( MALEi )
(10.69)
+ β4 ( PAIN i ) + ε i

Following established conventions, I use an asterisk to denote latent variables, which in


this case are two latent factors. The latent depression factor is centered by construction,
which facilitates the interpretation of the lower-order coefficients. At this time, it isn’t
possible to estimate this model with maximum likelihood, and Bayesian estimation and
model-based multiple imputation require specialized software that works with a series
of univariate likelihoods instead of a single multivariate distribution (Keller & Enders,
2021).
After examining trace plots and potential scale reduction factors (Gelman &
Rubin, 1992), I specified an MCMC process with 10,000 iterations following an initial
15,000-iteration burn-in period. The bottom panel of Table 10.13 summarizes the pos-
terior distributions of the parameters. To get a sense about the scaling, the standard
deviations of the latent disability and latent depression factors were approximately 0.70
and 1.46, respectively. Consistent with the scale score analysis, lower-order terms are
conditional effects that depend on scaling; Mdnβ1 = 0.30 (SD = 0.07) is the effect of latent
depression on latent psychosocial disability for female participants, and Mdnβ2 = –0.04
(SD = 0.09) is the gender difference at the latent depression factor’s mean. The negative
interaction coefficient (Mdnβ3 = –0.16, SD = 0.07) indicates that the male depression
slope was approximately 0.16 points lower than the female slope (i.e., the male slope is
Mdnβ1 + Mdnβ3 = 0.30 – 0.16 = 0.14). Although scaling differences preclude a direct com-
parison with the composite score analysis in the top panel of Table 10.13, it is notewor-
thy that latent variable analysis produced a substantially larger R2 statistic, presumably,
because the latent variables are purged of measurement error.
The Bayesian analysis offers the interesting possibility of saving the latent variable
scores as multiple imputations for a frequentist analysis. Doing so has the advantage of
converting a complex measurement model into a conventional moderated regression
analysis. Because the latent factors have 100% missing data, more data sets are needed
to maximize precision and minimize Monte Carlo simulation error. In my experience,
increasing the number of imputations from 100 to 500 can have a meaningful impact
on test statistics and probability values with additional increases providing diminish-
ing returns. To this end, I created M = 500 filled-­in data sets by saving the imputa-
tions and latent variable scores from the final iteration of 100 parallel MCMC chains
with 15,000 iterations each. Importantly, the product term in Equation 10.65 is not an
imputed variable. Rather, MCMC selects latent factor scores that make sense when mul-
tiplied by gender. Prior to creating the product, I converted both sets of latent imputa-
tions to z-scores to enhance their interpretability. After multiplying the factor scores by
the gender dummy code, I fit the regression model from Equation 10.65 to the data (no
measurement model is needed at this point). finally, I used Rubin’s (1987) pooling rules
Special Topics and Applications 457

to combine the estimates and standard errors and applied Barnard and Rubin’s (1999)
degrees of freedom expression to the significance tests.
The bottom panel of Table 10.14 summarizes the multiple imputation results.
Because the latent variables are standardized, β̂1 = 0.65 (SE = 0.15) gives the standard
deviation unit change for an increase of one standard deviation in latent depression
scores among female participants, and β̂2 = –0.07 (SE = 0.13) is the standardized gender
difference at the latent depression factor’s mean. The negative interaction coefficient
(β̂3 = –0.35, SE = 0.15) indicates that the male depression slope was approximately half
that of females. It is important to highlight that the fractions of missing information
(the proportion of a squared standard error due to the missing data) were very high (e.g.,
values ranging between .36 and .78), owing to the fact that the latent factor scores have
100% missing data. Although scaling differences preclude a direct comparison with the
composite score analysis in the top panel of Table 10.14, the relative magnitude of the
test statistics suggests that the latent variable analysis had less power. However, there
appears to be a trade-off, as the variable analysis produced a substantially larger R2 sta-
tistic, presumably due to the reduction in measurement error.

10.9 LONGITUDINAL DATA ANALYSES

This section describes longitudinal missing data in the context of a latent growth curve
model. In many applications, there is more than one way to treat longitudinal missing-
ness, and two approaches that seemingly invoke the same conditionally MAR assump-
tion could give different answers (Gottfredson, Sterba, & Jackson, 2017). I use the psy-
chiatric trial data on the companion website to illustrate a growth curve analysis with
three different missing data treatments: maximum likelihood estimation based on a
latent curve analysis, agnostic multiple imputation (fully conditional specification) fol-
lowed by a latent curve analysis, and multilevel multiple imputation. I used these data
extensively in Chapter 9 to illustrate analyses for MNAR processes.
The psychiatric trial data consist of repeated measurements of illness severity rat-
ings measured in half-point increments ranging from 1 (normal, not at all ill) to 7 (among
the most extremely ill). In the original study, the 437 participants with schizophrenia
were assigned to one of four experimental conditions (a placebo condition and three
drug regimens), but the data collapse these categories into a dichotomous treatment
indicator (DRUG = 0 for the placebo group, and DRUG = 1 for the combined medication
group). The researchers collected a baseline measure of illness severity prior to random-
izing participants to conditions, and they obtained follow-­up measurements 1 week, 3
weeks, and 6 weeks later. The overall missing data rates for the repeated measurements
were 1, 3, 14, and 23%, and these percentages differ by treatment condition; 19 and
35% of the placebo group scores were missing at the 3-week and 6-week assessments,
respectively, versus 13 and 19% for the medication group. Table 9.3 shows the missing
data patterns.
As a review, a longitudinal growth curve model (also called a linear mixed model
and a multilevel model) is a type of regression where repeated measurements are a
function of a temporal predictor that codes the passage of time, in this case weeks. To
458 Applied Missing Data Analysis

facilitate interpretation, researchers usually code one of the measurement occasions as


zero and set the others relative to that fixed point. One common option expresses time
relative to the baseline assessment (e.g., WEEK = 0, 1, 3, 6), and another reflects the
“time scores” relative to the final measurement (e.g., WEEK = –6, –5, –3, 0). I use the
former definition for the ensuing examples. The observed means follow a nonlinear
trend with a pronounced reduction at the 1-week follow-­up and more gradual change
after that. Published illustrations with these data linearize the trend lines by modeling
illness severity as a function of the square root of weeks since the baseline assessment
(Demirtas & Schafer, 2003; Hedeker & Gibbons, 1997), and I do the same here. Taking
the square root of the time scores creates a variable SQRTWEEK that codes the measure-
ment occasions as 0 = 0, 1 = 1, 3 = 1.73, and 6 = 2.45. Figure 9.17 shows that the
transformation compresses elapsed time after the 1-week follow-­up.
The growth curve model features an average linear trajectory for each condition
with individual variation around the mean intercept and slope. Starting at the individual
level, the within-­person linear model for participant i is

SEVERITYti = β0i + β1i ( SQRTWEEKti ) + εti (10.70)

where SEVERITYti is an individual’s illness severity rating at occasion t, SQRTWEEK is


the temporal predictor or “time variable,” β0i is a participant’s expected baseline illness
severity score (i.e., the predicted value when SQRTWEEK = 0), and β1i is his or her lin-
ear latent growth rate. Finally, εti is a time-­specific residual that captures the distances
between the observed data and the individual trajectories. By assumption, these residu-
als are normally distributed with a constant variance σε2.
The person-­specific intercepts and slopes function as between-­person latent vari-
ables or random effects. A pair of between-­person regressions feature treatment condi-
tion as a predictor of the individual intercepts and slopes.

β0i = β0 + β2 ( DRUGi ) + b0i (10.71)

β1i = β1 + β3 ( DRUGi ) + b1i

where β0 and β1 are the placebo group’s average intercept and slope, respectively, β2 is
the baseline mean difference for the medication condition, and β3 is the difference in
the mean change rate for this group. The b0i and b1i terms are deviations between the
group-­average trajectories and the individual intercepts and slopes. By assumption, the
latent residuals are bivariate normal with a covariance matrix Σb. Replacing β0i and β1i
in Equation 10.70 with the right sides of their expressions from Equation 10.71 reveals
that β3 is a group-by-time interaction effect.

SEVERITYti = ( β0 + b0i ) + ( β1 + b1i )( SQRTWEEKti ) + β2 ( DRUGi )


(10.72)
+ β3 ( SQRTWEEKti )( DRUGi ) + εti

An important feature of this data set is that all participants share the same assess-
ment schedule (i.e., the time scores are constant across participants instead of variable).
Special Topics and Applications 459

Designs like this provide the greatest flexibility for missing data handling, because the
repeated measurements can be treated as separate variables in a multivariate analysis
(i.e., single-­level data in wide format) or as a single variable with multiple observations
nested within individuals (i.e., multilevel data in long or stacked format). Because miss-
ingness is relegated to the dependent variable, any multilevel software package that sim-
ply removes measurement occasions with missing data gives optimal maximum likeli-
hood estimates (von Hippel, 2007), and a single-­level latent curve structural equation
model gives identical results. The same is true for Bayesian estimation, where multilevel
and structural equation growth models are conceptually equivalent (albeit with differ-
ent parameterizations and different prior distributions).
Single-­level and multilevel multiple imputation are two routes that don’t neces-
sarily produce identical results. Single-­level agnostic approaches such as joint model
imputation and fully conditional specification do not impose a pattern on the means
or variance–­covariance matrix, making it unnecessary to specify a functional form for
growth. For example, the joint modeling approach draws imputations from a multi-
variate normal distribution, where each illness severity rating has a unique mean and
group mean difference, and fully conditional specification deploys a set of round-robin
regression models, where each illness severity variable is imputed conditional on the
treatment assignment indicator and all other repeated measurements. Both approaches
automatically preserve group-­specific changes, as well as any nonlinearities that may be
present in the data.
In contrast, model-based multilevel imputation adheres exactly to the focal model’s
time trend. For example, using Equation 10.72 as an imputation model presupposes
that the individual and average change trajectories are linear within each condition.
Tailoring the filled-­in data to a particular analysis is precisely the goal of model-based
imputation, but it is important to emphasize that the resulting imputations are inappro-
priate for exploring nonlinear change. If you are unsure about the functional form, it is
appropriate to adopt a more complex imputation model and fit simpler nested models to
the data. For example, the following quadratic growth model would likely do a good job
of capturing the nonlinearities in these data, and the resulting imputations would also
support a linear growth curve model

( )
SEVERITYti = ( β0 + b0i ) + ( β1 + b1i )( WEEKti ) + β2 WEEKti2 + β3 ( DRUGi )
(10.73)
+β4 ( WEEKti )( DRUGi ) + β5 (WEEK ) ( DRUG ) + ε
2
ti i ti

Despite their apparent flexibility, single-­level agnostic imputation schemes are


not necessarily preferable, as Gottfredson et al. (2017) show that multilevel imputa-
tion models can offer substantial protection against random coefficient-­ dependent
MNAR processes where one’s underlying growth trajectory is responsible for missing
data (e.g., participants experiencing the most rapid declines in illness severity might
quit the study, because they judge that treatment is no longer necessary, whereas indi-
viduals with elevated and flatter trajectories might drop out to seek treatment else-
where). Although multilevel imputation does not reduce bias to the same degree as a
shared parameter model for this process (see Section 9.11), it can outperform single-­level
imputation. Multilevel imputation also provides a way to fill in data sets from cohort-­
460 Applied Missing Data Analysis

sequential or cross-­sequential planned missing designs (see Section 1.9) where single-­
level unrestricted imputation models are inestimable.

Maximum Likelihood Estimation


I begin by fitting the analysis as a latent curve structural equation model (Bollen & Cur-
ran, 2005; Grimm et al., 2016). The structural equation model views repeated measure-
ments as manifest indicators of an intercept and slope latent factor with fixed loadings
that encode the time scores. The path diagram for the analysis resembles Figure 9.10 but
features an additional measurement and different slope factor loadings (i.e., loadings
equal to the time scores, 0, 1, 3, and 6). I fit the model with structural equation
modeling software, and I used robust or sandwich estimator standard errors to compen-
sate for the negative skewness and excess kurtosis of the illness severity ratings.
The leftmost panel in Table 10.15 shows parameter estimates and standard errors
from the analysis. As you would expect from a randomized experiment, the two condi-
tions were effectively identical at baseline (β̂2 was close to 0 and had a nonsignificant
test statistic). The placebo group’s illness severity ratings decreased by roughly one-
third of a point per time unit (β̂1 = –0.35), and the medication condition improved at
nearly triple that rate, on average (β̂1 + β̂3 = –0.35 – 0.63 = –0.98). Figure 9.15 shows the
group-­specific growth trajectories as solid lines.

Model‑Based Multilevel Multiple Imputation


Model-based multiple imputation tailors imputations to the multilevel growth model
from Equation 10.72. Following the procedure described in Section 8.3, MCMC samples
imputations from a normal distribution centered around an individual’s own linear tra-
jectory. To illustrate, I created 100 imputations by saving a single data set from 100 par-

TABLE 10.15. Growth Curve Estimates for Three Missing Data Methods
ML MLM FCS
Effect Est. SE Est. SE Est. SE
Intercept (β0) 5.35 0.08 5.35 0.09 5.37 0.09
SQRTWEEK (β1) –0.35 0.06 –0.35 0.07 –0.38 0.07
DRUG (β2) 0.05 0.10 0.05 0.10 0.03 0.10
SQRTWEEK × DRUG (β3) –0.63 0.07 –0.63 0.08 –0.60 0.08
Intercept variance (σb20) 0.36 0.06 0.36 — 0.38 0.06
Slope variance (σb21) 0.23 0.03 0.23 — 0.23 0.04
Covariance (σb0b1) 0.02 0.04 0.03 — 0.01 0.04
Residual variance (σε2) 0.59 0.04 0.59 — 0.61 0.04

Note. ML, maximum likelihood; FCS, single-level fully conditional specification; MLM, multilevel model-based mul-
tiple imputation.
Special Topics and Applications 461

allel MCMC chains, each with 2,000 iterations. I then fit the multilevel growth model to
each data set and used Rubin’s (1987) pooling rules to combine the estimates and stan-
dard errors. The middle panel of Table 10.15 shows the resulting estimates, which were
effectively equivalent to a latent curve analysis with maximum likelihood estimation.
As mentioned previously, creating imputations that condition on the individual trajecto-
ries in this way can offer substantial protection against a random coefficient-­dependent
missingness process (Gottfredson et al., 2017).

Single‑Level Multiple Imputation


Joint model imputation and fully conditional specification are agnostic in the sense that
they do not impose a particular structure or pattern on the means and associations. In
the parlance of the structural equation modeling literature, the imputation phase uses
a saturated or just-­identified model (Bollen, 1989; Kline, 2015) that spends all available
degrees of freedom. The joint modeling framework invokes a multivariate normal dis-
tribution for the drug indicator and the repeated measurements, and the corresponding
imputation model parameters are a mean vector and variance–­covariance matrix. The
five-­dimensional normal distribution is as follows:
 SEVERITY0i    μ   σ12 
   1    
 SEVERITY 1i   μ
 2  σ 2 ⋅1 σ 2
2  
 SEVERITY3i  ~ N 5   μ 3  ,    σ3 2 
(10.74)
     
 SEVERITY 6i 
 μ   
  4   
2
  σ4 
 DRUG* 
    μ5   σ σ σ σ 1 
i
  5⋅1 5⋅2 5⋅3 5⋅4 
Following established notation, the asterisk superscript denotes a latent response vari-
able, the variance of which is fixed at one to establish a scale. The model also incorpo-
rates a single fixed threshold that divides the latent response distribution into two seg-
ments. The imputation model highlights that the means freely vary and do not adhere
to a particular functional form.
A fully conditional specification scheme uses a sequence of regression models to
impute variables in a round-robin fashion. This analysis requires four regression mod-
els, one per repeated measurement. To illustrate, the linear regression imputation model
for the baseline severity scores is

SEVERITY0i = γ 01 + γ11 ( SEVERITY1i ) + γ 21 ( SEVERITY3i )


(10.75)
+ γ 31 ( SEVERITY6i ) + γ 41 ( DRUGi ) + r1i

and the corresponding imputation model for the 6-week follow-­up scores is as follows:

SEVERITY6i = γ 04 + γ14 ( SEVERITY0i ) + γ 24 ( SEVERITY1i )


(10.76)
+ γ 34 ( SEVERITY3i ) + γ 44 ( DRUGi ) + r4 i

Again, the equations highlight that the means (regression intercepts) are free to vary at
each time point and do not follow a particular functional form.
462 Applied Missing Data Analysis

I again created 100 imputations by saving a single data set from 100 parallel MCMC
chains, each with 2,000 iterations. I then fit a single-­level latent growth model to each
data set and used Rubin’s (1987) pooling rules to combine the estimates and standard
errors. The rightmost panel of Table 10.15 shows the estimates, which have the same
interpretation as the other analyses. Single-­level multiple imputation produced modest
but noticeable differences when compared to direct maximum likelihood and multi-
level model-based imputation; the placebo group’s growth rate was lower (steeper, more
negative) by nearly half a standard error unit; and the medication group’s slope differ-
ence was higher (flatter, less negative) by roughly the same amount. One explanation
for these differences is that the estimates are simply noisier, because the wide-­format
fully conditional specification requires more parameters, but the standard errors don’t
support this explanation. Although there is no way to know the exact cause, differences
of this magnitude could arise, because the methods that condition on random effects
offer some protection against an MNAR mechanism where the individual intercepts and
slopes determine missingness (Gottfredson et al., 2017). This explanation seems likely
given that the analysis examples in Section 9.13 found support for an MNAR process.

10.10 REGRESSION WITH A COUNT OUTCOME

Missing data-­handling methods for discrete data have evolved considerably since the
first edition of this book, and earlier chapters and analysis examples illustrated esti-
mation and imputation approaches for binary, ordinal, and multicategorical variables.
Bayesian methods (including model-based multiple imputation) for incomplete count
outcomes are a more recent innovation (Asparouhov & Muthén, 2021b; Neelon, 2019;
Polson et al., 2013), and these routines are beginning to appear in statistical software
packages (Keller & Enders, 2021; Muthén & Muthén, 1998–2017). This section summa-
rizes this approach and provides a data analysis example.
I use substance use data on the companion website to illustrate missing data han-
dling for a regression model with a count outcome. The data set includes a subset of N =
1,500 respondents from a national survey on substance use patterns and health behav-
iors. I previously used these data in Section 10.3 to illustrate a logistic regression with a
dichotomous measure of drinking frequency as the outcome, and this example repeats
that analysis with the number of drinking days per month as the dependent variable.
Drinking frequency features excessive zeros from a large proportion of respondents who
reported no lifetime alcohol use (i.e., so-­called “structural zeros”). I excluded these 483
individuals from consideration, thereby defining the population of interest as people who
would potentially consume alcohol. Figure 10.11 shows the observed-­data distribution.
Either Poisson or negative binomial regression are appropriate for count outcomes
such as number of drinking days. Both are linear models with the natural logarithm of
the counts as the outcome. The regression model for this example is

(
ln  )
ALCDAYSi = β0 + β1 ( AGETRYALCi ) + β2 ( COLLEGEi ) + β3 ( AGEi ) (10.77)
+ β4 ( MALEi )
Special Topics and Applications 463

50
40
30
Percent
20
10
0

0 2 4 6 8 10 12 14 16 19 22 25 28 30
Number of Drinking Days

FIGURE 10.11. Bar graph of number of drinking days per month.


ALCDAYSi is the predicted number of drinking days per month for individual i,
AGETRYALC is the age at which the respondent first tried alcohol, COLLEGE is a dummy
code indicating some college or a college degree, and MALE is a gender dummy code
with females as the reference group. Approximately 8.4% of the dependent variable
scores are missing, and 15.9% of the educational attainment values are unknown. I
use negative binomial rather than Poisson regression, because the former incorporates
a dispersion parameter that accommodates heterogeneity among individuals with the
same predicted count (the model simplifies to a Poisson regression when the dispersion
parameter equals 0). Interested readers can consult Coxe, West, and Aiken (2009) for an
excellent tutorial on regression models for count data.
The familiar factored regression specification readily accommodates count regres-
sion. The factorization for this analysis features a different dependent variable, but
it otherwise has the same structure as the earlier one from Equation 10.8. Moreover,
Bayesian estimation for count regression models is very similar to the data augmenta-
tion procedure for logistic models described in Section 6.9. To refresh, the procedure
introduces latent response scores and person-specific weights as a rescaling trick that
allows regression coefficients to be estimated using the same machinery as linear regres-
sion models (see Equation 6.52). The MCMC algorithm cycles between four major steps:
Estimate person-specific weights that determine the latent response variable scores,
estimate the regression coefficients given the current latent data and weights, update
the dispersion parameter, and sample discrete imputes by drawing values from a nega-
464 Applied Missing Data Analysis

tive binomial distribution function. This process requires an additional step for the dis-
persion parameter, but it otherwise mimics Bayesian estimation for logistic regression
models. I point interested readers to Polson et al. (2013) and Asparouhov and Muthén
(2021b) for additional details.
I used Bayesian estimation with model-based multiple imputation to estimate the
regression model in Equation 10.77. To reiterate, I used a negative binomial regression
that uses a dispersion parameter to account for unobserved heterogeneity in the counts,
as this approach imposes more flexible assumptions than Poisson regression (e.g., the
model allows for variation among individuals with the same predicted count). This
choice has no bearing on the interpretation of the coefficients, as negative binomial and
Poisson coefficients have the same meaning. After inspecting trace plots and potential
scale reduction factor diagnostics (Gelman & Rubin, 1992), I created 100 filled-­in data
sets by saving the imputations from the final iteration of 100 parallel chains, each with
1,000 iterations.
Table 10.16 shows the model-based multiple imputation estimates (not surprisingly,
the Bayesian results that generated the imputations were numerically equivalent). The
slope coefficients reflect the change in the logarithm of the counts for a one-unit change
in a predictor. Although the coefficients don’t reflect the natural metric of the dependent
variable, we can nevertheless conclude from their signs that the number of drinking
days increased for individuals who tried alcohol at an earlier age, attended at least some
college, are older, and are males. The exponentiated coefficients in the rightmost col-
umn reflect the results on the count metric. For example, the intercept is the predicted
number of drinking days for a person with zeros on all predictors (a nonsensical score
profile). Paralleling the logic of odds ratios in logistic regression, the exponentiated
slope coefficients give the multiplicative effect of a one-unit change in the predictors on
the counts. For example, the model predicts that the number of drinking days for males
is 1.79 larger than that for females, controlling for other variables. Similarly, considering
the effect of trying alcohol at age 19 versus age 18, we would expect the 19-year-old to
drink 93% as many days as the 18-year-old. Finally, the large and significant dispersion
parameter suggests there is residual heterogeneity among individuals who share the
same predicted count.

TABLE 10.16. Multiple Imputation Estimates from a Negative Binomial


Regression Analysis with a Count Outcome
Effect Est. SE z p FMI expβ
β0 1.66 0.33 5.07 < .001 .09 5.29
β1 (AGETRYALC) –0.07 0.02 –3.45 < .001 .06 0.93
β2 (COLLEGE) 0.26 0.12 2.09   .04 .25 1.29
β3 (AGE) 0.01 0.00 2.98 < .001 .11 1.01
β4 (MALE) 0.58 0.11 5.28 < .001 .14 1.79
α (Dispersion) 2.40 0.15 16.11 < .001 .09 —

Note. FMI, fraction of missing information.


Special Topics and Applications 465

10.11 POWER ANALYSES FOR GROWTH MODELS


WITH MISSING DATA

The final example illustrates power analyses for growth curve models with missing
data. I demonstrate the process for the wave planned missingness designs from Sec-
tion 1.9, as well as for unplanned conditionally MAR data. I use Monte Carlo computer
simulations for this purpose, because they are relatively easy to implement and ideally
suited for a wide range of analysis models beyond longitudinal growth curves (Muthén
& Muthén, 2002). The goal of a computer simulation is to generate many artificial data
sets with known population parameters and examine the distributions of the estimates
and test statistics across those many samples. Each of the artificial data sets has miss-
ing values that follow a desired process, and FIML estimation provides the parameter
estimates. The proportion of artificial data sets that produce a significant test statistic is
a simulation-­based power estimate.
Generating realistic population parameters is by far the hardest part of a computer
simulation. For the purposes of illustration, I consider a longitudinal study with five
weekly assessments, and I use the same growth model as the previous example. Using
generic notation, the analysis model and its population values are

Yti = (β0 + b0i ) + (β1 + b1i )( WEEKti ) + β2 ( Xi ) +β3 ( WEEKti )( X i ) + εti


(10.78)
= ( 50 + b0i ) + (.71 + b1i )( WEEKti ) + 0 ( Xi ) + .92 ( WEEKti )( X i ) + εti
 b0 i   49.15 3.33 
  ~ N=
 b1i 
2 ( 0, S b ) Sb 
 3.33 2.50 
 εti ~ N
= ( 2
1 0, σ ε)σ2ε 41

where WEEK is the temporal predictor that codes occasions relative to the baseline mea-
surement (i.e., WEEK = 0, 1, 2, 3, and 4), X is a binary between-­person or time-­invariant
covariate (e.g., intervention status, demographic characteristic), and all other terms are
the same as their counterparts from Equation 10.72.
I used the following process to generate parameter values: To begin, I arbitrarily
fixed the baseline mean and standard deviation equal to 50 and 10, respectively, and
I used .50 as an intraclass correlation (i.e., between-­person variation comprised 50%
of the variation, a typical value for longitudinal data). These constraints allowed me
to use effect size expressions from Rights and Sterba (2019) to solve for the growth
model parameters. To induce a small amount of normative growth in the X = 0 group,
the fixed effect of WEEK explained 2% of the within-­person variation, and the group-
by-time interaction explained an additional 6% of the variability. To mimic a scenario
where two groups are identical at baseline, I set the R2 value for the between-­cluster
effect of X equal to zero. Finally, the random slope variance accounted for 10% of the
total variation (in my experience with longitudinal data, variance explained by the
random slopes is often between 5 and 10%), and the residual correlation between the
random intercepts and slopes was .30. These inputs produced the population param-
eters in Equation 10.78. An R tool that performs these calculations is available on the
companion website.
466 Applied Missing Data Analysis

Power for Wave Planned Missingness Designs


Armed with reasonable guesses about the population parameters, we can now walk
through the process of estimating power for a wave planned missing data design (see
Section 1.9). To keep the number of missing data patterns manageable, I consider a
design where every participant skips two of the five measurement occasions. These
features define a pool of 10 combinations of three measurement occasions across the
5-week study: (1, 2, 3), (1, 2, 4), (1, 2, 5), (1, 3, 4), (1, 3, 5), (1, 4, 5), (2, 3, 4), (2, 3, 5),
(2, 4, 5), and (3, 4, 5). Not all of these patterns are equally effective, so it is important to
choose a combination that maximizes power. Wu et al. (2016) use brute-force computer
simulations to search for effective combinations, and Brandmaier et al. (2020) provide
an analytic solution that estimates a precision-­like quantity for each candidate pattern.
I used the latter procedure to select three patterns that maximize efficiency, then used
computer simulation to estimate power for that design.
Brandmaier et al. (2020) define effective error as the measurement error associated
with an individual’s latent random slope. This quantity, which depends on the configu-
ration of measurement occasions and the model parameters, relates to the precision of
the random slope variance. The authors show that the same missing data patterns that
maximize precision of the random slope variance are also optimal for testing average
growth rates and predictors of growth. Using the authors’ R functions, I determined
that the patterns with the lowest effective error (and thus the greatest precision) are (1,
2, 5), (1, 4, 5), and (1, 3, 5), and the patterns with the highest effective error (and thus
the lowest precision) are (1, 2, 3), (2, 3, 4), and (3, 4, 5). Notice that the optimal pat-
terns maximize the variation of the measurement occasions in parentheses, whereas
the worst patterns minimize this variance. Because the top three patterns were clearly
superior to the other seven combinations, I investigated power for a wave missing data
design with these three groups.
Brandmaier et al.’s (2020) approach is ideally suited for identifying the optimal
missing data patterns from a candidate set, but it doesn’t produce power estimates. I use
computer simulations for this task. The computer simulation generated 5,000 random
samples of normally distributed repeated measurements from a population with the
parameters from Equation 10.78. After creating each artificial data set, the simulation
program deleted data to create three approximately equal groups with patterns (1, 2,
5), (1, 4, 5), and (1, 3, 5). Finally, I used maximum likelihood estimation to fit analysis
models to the artificial data sets, and I recorded the proportion of the 5,000 samples that
produced statistically significant estimates of the group-by-time interaction effect (i.e.,
the β3 coefficient).
Repeating the simulation with different N’s revealed that a sample size of approxi-
mately 270 is necessary to achieve .80 power to detect the group-by-time interaction.
Three comparisons put this result in perspective. First, the simulation-­based power
estimate for a complete-­data design with the same sample size and no missing values
was only slightly higher at .83—strategically omitting 40% of the data points creates a
design with 96% as much power as the optimal complete-­data design! This finding is
consistent with studies showing that certain wave missing data designs have very high
precision relative to their complete-­data counterparts (Graham et al., 2001; Mistler &
Special Topics and Applications 467

Enders, 2011; Wu et al., 2016). Next, consider a scenario where the research budget fixes
the total number of assessments that can be collected, but those measurements can be
distributed across a complete-­data or wave missing data design. A wave missing data
design with 270 participants requires 810 measurements that could instead be distrib-
uted across 162 participants with complete data. The computer simulations revealed
that the complete-­data design was far less efficient and achieved power of only .63—a
20% reduction. This result is also consistent with the literature. Finally, it is instructive
to examine power for a wave missing design with different missingness patterns. Recall
that Brandmaier et al.’s (2020) approach identified (1, 2, 3), (2, 3, 4), and (3, 4, 5) as the
least efficient patterns among the 10 candidates with which I started. Deploying a design
with this configuration required a massive increase to 735 participants to achieve .80
power to detect the group-by-time interaction.
Collectively, the computer simulations suggest that, when done well, wave missing
data designs can achieve nearly optimal power while dramatically reducing data collec-
tion resources and respondent burden. However, the simulations also show that, when
done poorly, planned missingness can result in a catastrophic reduction in power. For-
tunately, it is relatively straightforward to create good designs, as analytic solutions and
Monte Carlo computer simulation make it easy to identify patterns that maximize preci-
sion and vet their power. The best patterns maximize the variability of the time scores.

Power for Unplanned Missing Data


For the second example, I used computer simulation to estimate power for unplanned
missing data where the missingness probabilities increased over time and as a func-
tion of the observed data. The first two measurements had complete data, and I set the
overall missing data rates at the last three occasions to 10, 20, and 40%. This pattern
is common in longitudinal studies where participants tend to quit at later waves. The
probability of missing data at wave t also increased or decreased as a function of the
baseline scores and the covariate. I used a probit regression model to link missingness
to the observed data.
As a quick recap, probit regression envisions binary scores as originating from a
latent response variable that now represents a normally distributed propensity for miss-
ing data. The model also includes a threshold parameter τ that divides the latent response
distribution into two segments, such that participants with missing and observed data
have latent scores above and below the threshold, respectively. The link between the
latent response variable and its manifest missing data indicator is as follows:

0 if Mti* ≤ τ
Mti =  *
(10.79)
1 if Mti > τ

where Mti is the missing data indicator for occasion t (0 = observed, 1 = missing),
Mt*i is the underlying latent variable, and the threshold parameter τ is fixed to zero
to identify the latent response variable’s mean structure. The predicted probability of
missing data is computed as the area under the normal curve above the threshold (see
Equation 2.67).
468 Applied Missing Data Analysis

The regression model that linked the latent response variable at occasion t to the
observed data is as follows:

Mti* = γ 0 + γ1 ( Y1i ) + γ 2 ( X i ) + rti (10.80)

To facilitate model specification, I created an R function that uses the desired R2 effect
size (the strength of the MAR mechanism) and the relative contribution of the predic-
tors to solve for the population regression coefficients. This tool is available on the
companion website. Specifying a strong selection mechanism where the two predictors
combined equally to explain 25% of the latent response variable’s variance gave the fol-
lowing regressions (as always, residual variances are fixed at 1).

−4.18 + .003 ( Y1i ) + .061 ( X i ) + r3i


M 3*i = (10.81)

M 4* i −3.74 + .003 ( Y1i ) + .061 ( X i ) + r4 i


=
−3.15 + .003 ( Y1i ) + .061 ( X i ) + r5i
M 5*i =

The positive slope coefficients indicate that individual missingness probabilities


increased with higher baseline scores and membership in the X = 1 group. Although
it isn’t obvious, the two predictors contributed equally to missingness, and their slopes
differ, because Y1 and X have different scales. For each artificial sample, substituting
the data values and simulated residual terms into these equations produced a full set of
latent response scores, and I then deleted repeated measurements with Mt*i values greater
than zero (see Equation 10.79). This procedure induced a strong MAR process with the
desired missing data rates.
Repeating the simulation with different N’s revealed that a sample size of approxi-
mately 480 is necessary to achieve .80 power to detect the group-by-time interaction.
Despite a substantial reduction in the overall percentage of missing observations in
each sample, this N is considerably larger than the 270 participants needed to optimize
power for the planned missing data design. The mechanism is partly responsible, as
power naturally varies across different missingness processes (e.g., an MAR mechanism
that depends on auxiliary variables instead of analysis variables would also have dif-
ferent power requirements). Unlike the planned missing data design, which leveraged
efficient patterns with complete data at the first and last measurement occasions, the
MAR process introduced many inefficient patterns with missing data at the final waves
(i.e., patterns with low variation in the time scores). The difference in the sample size
requirements illustrates the importance of investigating qualitatively different mecha-
nisms that also vary with respect to their strength, as both factors influence power. The
R tool on the companion website is useful in this regard, as researchers simply need to
specify the desired missing data rate, an R2 effect size that determines the strength of the
desired process (e.g., setting R2 = 0 gives a missing completely at random process where
γ1 = γ2 = 0), and the relative contribution of each analysis variable in the selection equa-
tion (e.g., I set the contribution weights of Y1 and X equal to one, and all other variables
had weights of zero). Gomer and Yuan (2021) provide additional details on simulating
different missing data processes.
Special Topics and Applications 469

10.12 SUMMARY AND RECOMMENDED READINGS

This chapter has used a series of data analysis examples to illustrate a collection of odds
and ends that include specialized topics, advanced applications, and practical issues.
These topics include missing data handling for descriptive summaries, transformation
methods for non-­normal variables, estimation and inferential procedures for path and
structural equation models, missing data handling for incomplete questionnaire items,
and longitudinal analyses. I recommend the following articles for readers who want
additional details on some of the topics from this chapter.

Alacam, E., Du, H., Enders, C. K., & Keller, B. T. (2022). A model-based approach to treating
composite scores with missing items. Manuscript submitted for publication.

Asparouhov, T., & Muthén, B. (2021b). Expanding the Bayesian structural equation, multilevel
and mixture models to logit, negative-­binomial and nominal variables. Structural Equation
Modeling: A Multidisciplinary Journal, 28, 622–637.

Brandmaier, A. M., Ghisletta, P., & von Oertzen, T. (2020). Optimal planned missing data
design for linear latent growth curve models. Behavior Research Methods, 52, 1445–1458.

Enders, C. K. (in press). Fitting structural equation models with missing data. In R. Hoyle (Ed.),
Handbook of structural equation modeling (2nd ed.). New York: Guilford Press.

Graham, J. W., Taylor, B. J., & Cumsille, P. E. (2001). Planned missing data designs in analysis of
change. In L. Collins & A. Sayer (Eds.), New methods for the analysis of change (pp. 335–
353). Washington, DC: American Psychological Association.

Lüdtke, O., Robitzsch, A., & West, S. G. (2020b). Regression models involving nonlinear effects
with missing data: A sequential modeling approach using Bayesian estimation. Psychologi-
cal Methods, 25, 157–181.

Wu, W., Jia, F., Rhemtulla, M., & Little, T. D. (2016). Search for efficient complete and planned
missing data designs for analysis of change. Behavior Research Methods, 48, 1047–1061.
11

Wrap‑Up

11.1 CHAPTER OVERVIEW

This final chapter addresses two very practical issues: choosing a missing data-­handling
method and reporting the results from a missing data analysis. All things being equal,
the three analytic pillars of this book—­maximum likelihood, Bayesian estimation, and
multiple imputation—­are likely to produce very similar numerical results, so the choice
of technique is often one of personal preference. However, special analysis features and
software availability may make one method preferable. The first section walks read-
ers through the main considerations and provides a recipe for selecting a method. Of
course, software availability and data analytic preferences play a major role in this deci-
sion, so the second section of the chapter reviews the current software landscape and
paints with broad brushstrokes the different software tools. The chapter concludes with
recommendations for reporting the results from a missing data analysis, and it provides
templates that illustrate the suggestions.

11.2 CHOOSING A MISSING DATA‑HANDLING PROCEDURE

With few exceptions, analyses that assume a conditionally MAR mechanism should
be the norm, as there is rarely a good justification for using atheoretical methods (e.g.,
mean imputation) or methods that assume purely unsystematic missingness (e.g., dele-
tion methods). Maximum likelihood, Bayesian estimation, and multiple imputation are
all natural choices that often produce very similar numerical results—­all things being
equal. A quick recap of these methods sets the stage for choosing a method.
The goal of maximum likelihood estimation is to identify the model parameter
values most likely responsible for producing the data. The missing data-­handling aspect
of maximum likelihood happens behind the scenes, and a researcher simply needs to

470
Wrap‑Up 471

dial up a capable software package and specify a model. The estimator does not discard
incomplete data records, nor does it impute them. Rather, when confronted with miss-
ing values, maximum likelihood uses the normal curve to deduce the missing parts
of the data as it iterates to a solution (technically, the estimator marginalizes over the
missing values). The resulting parameter values are those with maximum support from
whatever data are available. Chapters 2 and 3 describe this approach.
Like maximum likelihood, the primary goal of a Bayesian analysis is to fit a model
to the data and use the resulting estimates to inform one’s research questions. How-
ever, Bayesian estimation has more of a multiple imputation flavor, because it fills in
the missing values en route to getting the parameter values. Like maximum likelihood,
missing data handling happens behind the scenes, with temporary imputations play-
ing a supporting role that simplifies estimation. While the numerical estimates tend to
match those of maximum likelihood, a Bayesian analysis requires a different inferential
framework that makes no reference to repeated sampling. Chapters 4 through 6 describe
this approach.
Unlike maximum likelihood and Bayesian estimation, multiple imputation puts
the filled-­in data front and center, and the goal is to create suitable imputations for
later analysis. A typical application consists of three major steps: Specify an imputation
model and deploy an Bayesian estimation algorithm that creates several copies of the
data, each containing different estimates of the missing values; perform one or more
analyses on the completed data sets and get point estimates and standard errors from
each; and use “Rubin’s rules” (Little & Rubin, 2020; Rubin, 1987) to combine estimates
and standard errors into a single package of results.
Notwithstanding philosophical differences between the frequentist and Bayesian
paradigms, statistical theory and numerous data analysis examples from earlier chap-
ters tell us that maximum likelihood, Bayesian estimation, and multiple imputation are
usually numerically equivalent if they leverage the same assumptions and use the same
variables. As such, personal preference is often the only reason for selecting one method
over another. In truth, the most important consideration isn’t which method to use, but
rather how to accurately represent the distribution of the data. Two approaches have fea-
tured prominently throughout this book: multivariate normal distributions (possibly
with latent response variables) and factored regression specifications.
In practice, a factored regression specification is always appropriate whenever a
multivariate normal distribution is appropriate, but the converse is not true. As such,
what factors determine whether a normal distribution is a good approximation to the
data? In general, any model that features an interaction term, curvilinear effect, random
slope, or other type of nonlinearity is at odds with a multivariate normal distribution.
In contrast, additive models that lack these special terms and assume constant variation
are compatible with a normal curve. The flow chart in Figure 11.1 provides a recipe for
identifying the appropriate specification for a given analysis and variable set. It high-
lights that any analysis with an incomplete nonlinear term requires a factored regression
specification. A multivariate normal distribution accommodates the obvious scenario
where all variables are approximately continuous and normal, and adopting a latent
response variable framework further accommodates binary, ordinal, and multicategori-
472 Applied Missing Data Analysis

1. Analysis features a
Factored regression
nonlinearity (interaction, YES
specification
polynomial, random slope)?

NO

2. Analysis restricted to Multivariate normal


YES
normal variables? distribution

NO

3. Analysis restricted to Multivariate normal


normal, binary, ordinal, YES distribution with latent
nominal variables? response variables

NO

FIGURE 11.1. Flowchart for identifying an appropriate specification for a given analysis and
variable set.

cal nominal variables. With few exceptions, other types of response scales (e.g., count
outcomes) require a factored regression specification.
All three contemporary missing data-­handling methods accommodate multivariate
and factored regression specifications. For maximum likelihood and Bayesian analyses,
the structural equation modeling framework is a convenient way to implement multi-
variate missing data handling, and agnostic multiple imputation schemes such as the
joint model and fully conditional specification are another option. At this point in his-
tory, Bayesian estimation and model-based multiple imputation are arguably superior
for factored regression specifications, because they support a broader range of analysis
models and offer greater flexibility with mixtures of discrete and numeric variables.
Multilevel models with random coefficients and/or mixed response types are an impor-
tant example, as unbiased maximum likelihood estimators are not widely available.
Given an appropriate distributional specification, software availability and data analytic
preferences ultimately determine which method to use. The next section reviews the
Wrap‑Up 473

current software landscape and paints with broad brushstrokes information about dif-
ferent software tools.

11.3 SOFTWARE LANDSCAPE

Given the pace at which computational options evolve, I’ve purposefully avoided
software-­centric presentations throughout the book. Nevertheless, software availability
and data analytic preferences play a major role in selecting a missing data-­handling
technique, so a cursory review of the current software landscape is important. Fortu-
nately, numerous mature platforms exist that will surely be around a decade from now,
and these tools will only improve over time. What follows is an incomplete snapshot of
the software landscape in 2022.
Considering general-­use commercial software, most researchers reading this book
probably use SAS, SPSS, or Stata. All three applications have structural equation mod-
eling modules that provide a general way to implement maximum likelihood missing
data handling, and all offer agnostic multiple imputation schemes. Commercial struc-
tural equation modeling software programs such as EQS (Bentler, 2000–2008), LISREL
(Jöreskog & Sörbom, 2021), and Mplus (Muthén & Muthén, 1998–2017) are also very
capable platforms for estimating a broad range of models with maximum likelihood,
and some also offer multiple imputation. Except for Mplus, the commercial programs
mentioned here generally do not support factored regression specifications and thus are
ill-­suited for models with incomplete nonlinear effects.
Mplus has no peer in the commercial software space when it comes to missing
data-­handling options, as it offers sophisticated and flexible options for maximum like-
lihood, Bayesian estimation, and multiple imputation. Given its structural equation
modeling roots, Mplus’s computational machinery is primarily multivariate in nature,
although it does use or allow for factored regression specifications in some situations.
Caution is warranted with incomplete nonlinear effects, because some model specifica-
tions can misrepresent the distributions of incomplete predictors in a way that mimics
bias-­inducing just-­another-­variable and reverse random coefficient approaches; multi-
level models with incomplete random slope predictors are an example (Enders, Hayes,
et al., 2018). The most recent version 8.6 of the software features new factored regression
specifications for latent variable interactions that mimic those described in Section 10.8
(Asparouhov & Muthén, 2021a), but no information is available on the performance of
these methods with missing data.
Turning to free software options, Blimp (Keller & Enders, 2021) is an all-­purpose
data analysis and latent variable modeling program that harnesses the flexibility of
Bayesian estimation in a user-­friendly application that requires minimal scripting and
no deep-level knowledge about Bayes. The application, which is available for macOS,
Windows, and Linux, was developed with funding from Institute of Educational Sci-
ences awards R305D150056 (myself and Roy Levy) and R305D190002 (myself, Brian
Keller, and Han Du). The software began as a platform for implementing multilevel mul-
tiple imputation via fully conditional specification (Enders, Keller, et al., 2018), and its
474 Applied Missing Data Analysis

second release transitioned the software to a full-­featured multilevel analysis package


(Enders et al., 2020). Blimp 3 introduces wide ranging and powerful capabilities for
multivariate analyses with latent variables (e.g., path model and structural equation
models), and models can include up to three levels with mixtures of binary, ordinal,
multicategorical nominal, normal, skewed continuous, and count variables. Blimp’s esti-
mation architecture is built entirely on factored regressions, and I used this software for
all Bayesian and multiple imputation examples in this book.
Not surprisingly, the full gamut of missing data-­handling options is available on
the R platform (R Core Team, 2021). For multivariate specifications with maximum
likelihood estimation, the structural equation modeling package lavaan (Rosseel, 2012;
Rosseel, Jorgensen, & ­Rockwood, 2021) and the accompanying semTools (Jorgensen et
al., 2021) package are good options, and the mdmb package (Robitzsch & Lüdke, 2021)
offers a factored regression maximum likelihood estimator for single-­level regression
models (Lüdtke et al., 2020a).
Multiple imputation options abound in R. Packages that implement agnostic imputa-
tion include MICE (van Buuren et al., 2021; van Buuren & Groothuis-­Oudshoorn, 2011),
pan (Grund, Lüdke, & Robitzsch, 2016b; Schafer, 2018), jomo (Quartagno & Carpenter,
2016, 2020), and Amelia (Honaker, King, & Blackwell, 2021). Model-based (including
substantive model-­compatible) multiple imputation routines based on factored regres-
sion specifications are available in mdmb (Grund, Lüdtke, et al., 2021; ­Robitzsch &
Lüdke, 2021), smcfcs (Bartlett, Keogh, & Bonneville, 2021), and jomo (Quartagno &
Carpenter, 2016, 2020). Regardless of where the imputations originate, the mitml pack-
age (Grund, Robitzsch, et al., 2021) is a comprehensive toolkit for pooling and signifi-
cance testing, as is the aforementioned semTools package.
Finally, several R packages can implement factored regression specifications with
missing data in the Bayesian framework. For example, the rstan (Guo, Gabry, Goodrich,
& Weber, 2020) and rjags (Plummer, 2019) packages provide an R interface to the spe-
cialized Bayesian analysis programs Stan and JAGs, respectively. These options, while
very flexible, generally require a high degree of familiarity with the finer details of
Bayesian estimation. Other packages such as brms (Bürkner, 2021) and blavaan (Merkle,
Fitzsimmons, Uanhoro, & Goodrich, 2020; Merkle & Rosseel, 2018; Merkle, Rosseel,
Goodrich, & Garnier-­Villarreal, 2021) provide user-­friendly interfaces to these compu-
tational engines. The aforementioned mdmb package (Robitzsch & Lüdke, 2021) also
offers Bayesian estimation for single-­level and multilevel regression models with miss-
ing data.

11.4 REPORTING RESULTS FROM A MISSING DATA ANALYSIS

A consistent cross-­disciplinary finding in the literature is that missing data-­reporting


practices leave much to be desired (Bodner, 2006; Jeličić et al., 2009; Karahalios,
Baglietto, Carlin, English, & Simpson, 2012; Klebanoff & Cole, 2008; Peugh & Enders,
2004; Sterne et al., 2009; Wood et al., 2004). In fact, numerous published resources offer
specific recommendations about what to report in a research paper. For example, the
Wrap‑Up 475

American Psychological Association’s Task Force on Statistical Inference (Wilkinson


and Task Force on Statistical Inference, 1999) suggests the following:

Before presenting results, report complications, protocol violations, and other unanticipated
events in data collection. These include missing data, attrition, and nonresponse. Discuss
analytic techniques devised to ameliorate these problems. Describe nonrepresentativeness
statistically by reporting patterns and distributions of missing data and contaminations.
Document how the actual analysis differs from the analysis planned before complications
arose. The use of techniques to ensure that the reported results are not produced by anoma-
lies in the data (e.g., outliers, points of high influence, nonrandom missing data, selection
bias, attrition problems) should be a standard component of all analyses. (p. 597)

A similar set of recommendations appears in Sterne et al. (2009, p. 5, Box 2).

• Report the number of missing values for each variable of interest, or the number of cases
with complete data for each important component of the analysis. Give reasons for miss-
ing values if possible, and indicate how many individuals were excluded, because of miss-
ing data when reporting the flow of participants through the study. If possible, describe
reasons for missing data in terms of other variables (rather than just reporting a universal
reason such as treatment failure).
• Clarify whether there are important differences between individuals with complete and
incomplete data—for example, by providing a table comparing the distributions of key
exposure and outcome variables in these different groups.
• Describe the type of analysis used to account for missing data (eg, multiple imputation),
and the assumptions that were made (e.g., missing at random).

Reporting Guidelines
To facilitate an uptick in better reporting practices, the checklist below compiles rec-
ommendations from a variety of sources in the literature (Burton & Altman, 2004;
Manly & Wells, 2015; Sterne et al., 2009; Sterner, 2011; van Buuren, 2012, pp. 254–255;
­Vandenbroucke et al., 2007). Of course, length requirements limit what can be reported
in the body of a journal article, but online supplemental documents have no such restric-
tions and are an ideal repository for specific information. The remainder of this section
is a discussion of each recommendation and provides illustrative templates.

1. Report rates of missing data for primary analysis variables.


2. Discuss distributional assumptions and steps taken to evaluate their impact on
the analysis results.
• Sensitivity to normal distributional assumptions
• Comparisons of observed and imputed data
3. Discuss the assumed missing data process.
• List specific reasons for missing data, if identifiable.
• Speculate about whether a MNAR process is plausible.
476 Applied Missing Data Analysis

4. Describe auxiliary variables and the process used to identify them.


• Correlates or predictors of nonresponse, comparisons of individuals with and
without missing data
• Partial or bivariate correlations with incomplete analysis variables
5. Report missing data-­handling method(s).
• Adequacy for conducting the intended complete-­data analysis
• Special design features and nonlinear effects (e.g., multilevel data structures,
random effects, interactive or polynomial terms)
6. Discuss the software tool(s) used to analyze data.
• Software and version
• Specific commands or procedures used
• Algorithmic settings
• Convergence diagnostics
• Auxiliary variable specification
7. Describe sensitivity analysis results.
• Impact of normal distribution or other assumptions
• Impact of prior distributions for covariance matrices
• Impact of an MNAR processes
8. Report details specific to Bayesian estimation and multiple imputation.
• Sensitivity to different prior distributions
• Variables used in the imputation phase
• Number of imputations
• Algorithmic details (number of burn-in iterations, parallel versus sequential
chains, total number of iterations)
• Convergence diagnostics (potential scale reduction factor)
• Special pooling considerations (transformations prior to pooling)

Recommendation 1: Missing Data Rates


Although the recommendation to report missing data rates seems obvious, methodolog-
ical literature reviews show that authors often fail to mention missing data at all (e.g.,
the presence of missing values must be inferred by discrepant degrees of freedom values
across analyses). In additional to describing the range of missing values across key study
variables in the body of a research report, providing a covariance coverage matrix that
gives the proportion of observed values for each variable (on the matrix diagonal) or
variable pair (below the diagonal) can be useful, as this information speaks to the preci-
sion of the bivariate associations. To illustrate, Table 11.1 shows the coverage matrix for
the descriptive analysis presented in Section 10.2. An online supplemental document is
an ideal location for this information.
Wrap‑Up 477

TABLE 11.1. Observed Data Proportions for Each Variable or Variable Pair
(Covariance Coverage)
Variable 1 2 3 4 5 6 7 8 9
1. AGE 1.00
2. WORKHRS 1.00 1.00
3. EXERCISE .88 .88 .88
4. ANXIETY .98 .98 .87 .98
5. STRESS .93 .93 .82 .91 .93
6. CONTROL .95 .95 .83 .93 .88 .95
7. INTERFERE 1.00 1.00 .88 .98 .93 .95 1.00
8. DEPRESS 1.00 1.00 .88 .98 .93 .95 1.00 1.00
9. DISABILITY .90 .90 .78 .88 .82 .84 .90 .90 .90

Recommendation 2: Distributional Assumptions


Missing data handling inevitably requires distributional assumptions for the incomplete
variables. It is rarely necessary to foist inappropriate functions on discrete variables, as
software tools now offer sophisticated estimation and imputation routines for binary,
ordinal, multicategorical nominal, and count variables. However, numerical variables
require attention, as maximum likelihood, Bayesian estimation, and multiple imputa-
tion all leverage the normal distribution, albeit in different ways. Mardia’s index of kur-
tosis is available for missing data (Yuan, Lambert, & Fouladi, 2004), and complete-­case
estimates of skewness and kurtosis are better than nothing at all.
Bayesian estimation and multiple imputation are particularly useful for evaluating
the impact of non-­normality, because they produce explicit estimates of the missing
values. Graphing imputations next to the observed data can provide a window into an
estimator’s inner machinery, as severe misspecifications can produce large numbers of
out-of-range or implausible values (e.g., negative imputes for a strictly positive variable).
This concern is particularly germane to skewed variables with very large missing data
rates. If such problems do arise, it is straightforward to perform a sensitivity analysis
that examines whether key estimates or conclusions change when sampling imputations
from a skewed distribution that better matches the observed data’s shape (e.g., the Yeo–­
Johnson transformation, predictive mean matching). A side-by-side comparison of two
or more sets of analysis results can be presented in an online supplemental document,
with any discrepant findings noted in main body of text. Sections 5.5, 10.3, and 10.4
provide information on this topic.
To illustrate this recommendation, reconsider the analysis example from Section
10.3. To refresh, the analysis model was a logistic regression with age, educational
attainment, gender, and age at first alcohol use predicting a binary measure of drinking
frequency. The following vignette illustrates how to report a simple sensitivity analysis
that considers an alternative distributional assumption for the age at first alcohol use
variable.
478 Applied Missing Data Analysis

The age at first alcohol use exhibited substantial positive skewness and excess kurtosis;
the complete-­case estimates were 1.82 and 8.12, respectively. Inspecting histograms of the
observed and imputed data revealed that sampling imputations from a normal distribu-
tion produced values well below 10, the lowest reported age; the lowest imputation was
approximately 0.30, and about 2% of all imputes were less than 10. To investigate the prac-
tical impact of the normality assumption, we performed a sensitivity analysis that instead
sampled skewed imputations from a Yeo–­Johnson distribution (Lüdtke, Robitzsch, & West,
2020b; Yeo & Johnson, 2000). The resulting imputations followed a positively skewed dis-
tribution that better resembled the shape of the observed data, ranging from 5.89 to 42.83.
Altering the distribution of the missing values increased the variable’s slope coefficient by
nearly half a standard error unit, but its sign and significance test were unaffected. From
this, we can conclude that our main conclusions were stable across different assumptions
about the missing data distribution.

Maximum likelihood estimation is a bit more of a black box when it comes to dis-
tributional assumptions, as it too will intuit that missing values extend into an implau-
sible range without ever producing explicit evidence of its assumptions. Moreover,
appropriate transformations are more difficult to implement in this context, because
software packages that estimate the necessary shape parameters (e.g., for a Box–Cox
or Yeo–­Johnson transformation) generally require complete data. Finally, discrepancies
between normal-­theory and robust (sandwich estimator) standard errors can signal a
model misspecification, but what constitutes a discrepancy is somewhat subjective.

Recommendation 3: Missing Data Process


With few exceptions, a conditionally MAR mechanism should be the default assumption
about missingness, as there is rarely a good justification for using methods that assume
a purely unsystematic process (e.g., deletion methods). Because the phrase “missing at
random” is often misunderstood as haphazard missingness, adding a brief definition of
this process and adopting the phrase “conditionally missing at random” (Graham, 2009)
can add clarity. To illustrate this suggestion, consider a school-­based study of math
achievement (e.g., the analysis example presented in Section 7.3). The following passage
provides a template for reporting this assumption: “We used missing data-­handling pro-
cedures that assume a conditionally MAR process where a student’s unseen data values
are unrelated to missingness after controlling for his or her observed data.”
Because there is no way to verify or test the MAR mechanism’s key propositions,
any evidence in support of this assumption must come from expert judgment and
knowledge about data collection. To this end, listing specific reasons for missing data
(if known) provides important information about the presumed missingness process.
Additionally, speculating about the plausibility of an MNAR process is useful, as such a
discussion can frame a sensitivity analysis that deploys selection and/or pattern mixture
models. Continuing with the school-­based study, the following passage illustrates this
recommendation.

In many cases, the research team was able to ascertain that test scores were missing, because
a student moved to another district. As student mobility often correlates with sociodemo-
Wrap‑Up 479

graphic factors like family income, we conditioned on this characteristic by introducing


free or reduced-­price lunch eligibility as a proxy auxiliary variable. In cases where the rea-
sons for missingness were unknown, it is hypothetically plausible that the unobserved test
scores carry information about missingness (e.g., low-­achieving students opt out of testing,
students skip exam questions, because they do not possess adequate knowledge to formu-
late a response). The online supplemental document presents the results from a sensitivity
analysis that considered this MNAR process.

Recommendation 4: Auxiliary Variables


Related to the previous recommendation, authors should report information about
additional auxiliary variables and the methods used to select those variables, as condi-
tioning on additional data can make the MAR assumption more plausible. This infor-
mation could include comparisons of individuals with and without missing data, as
distributional differences potentially signal the need to condition on variables that
aren’t already part of the main analysis plan. However, bivariate associations should
be the primary basis for selecting auxiliary variables, as any correlates of missingness
can only induce nonresponse bias if they also have salient semipartial (residual) cor-
relations with the incomplete variables. Sections 1.4 through 1.6 provide additional
information.
To illustrate this reporting recommendation, reconsider the chronic pain data and
the moderated regression analysis from Sections 3.8, 5.4, and 7.11. To refresh, the analy-
sis model was a moderated regression with pain severity, gender, depression, and the
gender-­by-­depression interaction predicting psychosocial disability (see Equation 7.26).
The following passage provides a template for discussing auxiliary variables:

For each incomplete variable, we performed comparisons of individuals with and with-
out missing data, and we flagged any variables that produced a standardized mean differ-
ence larger than Cohen’s (1988) small effect size benchmark of ± 0.20. These comparisons
revealed that people with missing disability scores were younger (d = –0.30); participants
without missing depression scores were more anxious (d = 0.33); and people with miss-
ing pain ratings exercised more frequently (d = 0.30), exhibited higher anxiety (d = 0.43),
and reported more stress (d = 0.24). Collectively, these comparisons rule out a MCAR pro-
cess, and they potentially signal the need to condition on one or more of these additional
variables if their semipartial (i.e., residual) correlations with the analysis variables exceed
approximately ± .30 (Collins et al., 2001). Based on their strong semipartial correlations
with one or more analysis variables, we designated anxiety, stress, and pain interference
with daily life as auxiliary variables; pain interference did not predict missingness but con-
ditioning on this variable could improve power.

If journal space is a limiting factor, a lengthier description of auxiliary variables can be


relegated to an online supplement, and a more succinct description like the one below
can appear in the body of text:

One the basis of their salient semipartial (residual) correlations with the incomplete analy-
sis variables, we designated anxiety, stress, and pain interference with daily life as auxil-
480 Applied Missing Data Analysis

iary variables. The online supplemental document describes the variable selection process,
including comparisons of individuals with and without missing data.

Recommendation 5: Missing Data‑Handling Methods


It goes without saying that research reports should describe their analytic methods, and
such descriptions should also provide a brief justification for the missing data-­handling
procedure(s) that conveys an appropriate distribution specification (see Section 11.2). To
illustrate this recommendation, consider a paper in which the researchers used linear
regression models for their primary analyses. A justification for the analytic method
might look like this: “We used FIML estimation based on the multivariate normal distri-
bution, because the additive regression models featured incomplete variables that were
numeric and approximately symmetric.” If the regression models featured incomplete
multicategorical predictors, a simple justification could be as follows: “We used multiple
imputation to treat missing data prior to fitting the regression models, because this
procedure readily accommodates mixtures of numerical and categorical missing val-
ues.” Finally, the following statement would be appropriate for a regression model with
an interaction effects: “Because classic missing data-­handling methods based on multi-
variate normality are known to introduce bias when applied to regression models with
interactive effects, we used model-based multiple imputation with a factored regression
(sequential) specification.”

Recommendation 6: Software Tools and Implementation Details


When performing missing data analyses, it is especially important to report specific
details about the software tools used, as missing data-­handling capabilities vary wildly
across data analysis programs (and even within the same program). This information
should include the specific procedure, package, or subroutine used, as well as its version
number. Because software packages often deploy inappropriate default settings (e.g., the
SPSS multiple imputation procedure uses a woefully inadequate burn-in period of only
five iterations), authors should also report algorithmic details and convergence diagnos-
tics. This recommendation is particularly germane to Bayesian estimation and multiple
imputation, where convergence diagnostics such as the potential scale reduction factor
should determine these algorithmic settings (see Section 4.9).
The passage below provides a template for describing the algorithmic details for
multiple imputation.

We used fully conditional specification multiple imputation in Blimp 3 (Keller & Enders,
2021) to treat missing values. Potential scale reduction factor convergence diagnostics
(­Gelman & Rubin, 1992) from a preliminary run indicated that a burn-in period of 2,000
iterations was sufficiently conservative. Based on this information, we created 100 imputed
data sets by saving the filled-­in data from the final iteration of 100 MCMC chains, each
with random starting values. The imputation model included the analysis variables as well
as three additional auxiliary variable variables. We then used the R packages lme4 (version
1.1-27.1; Bates et al., 2021) and mitml (version 0.4-1; Grund, Robitzsch, et al., 2021) to fit
Wrap‑Up 481

the analysis models and pool the resulting parameter estimates and standard errors (Rubin,
1987).

The following passage similarly describes the algorithmic details for a Bayesian analysis:

We used Bayesian estimation in Blimp 3 (Keller & Enders, 2021) to treat missing values,
and we used a factored regression (sequential) specification to incorporate three auxiliary
variables. Potential scale reduction factor convergence diagnostics (Gelman & Rubin, 1992)
from a preliminary run indicated that a burn-in period of 2,000 iterations was sufficiently
conservative. Based on this information, we used eight MCMC chains with random starting
values to generate posterior summaries consisting of 10,000 estimates following the initial
burn-in period. We verified this setting was sufficient by examining the effective number of
independent MCMC samples for each parameter, all of which were greater than the recom-
mended value of 100 (Gelman et al., 2014, p. 287).

Finally, maximum likelihood estimation routines are typically more of a black box,
offering fewer tweakable settings. The following passage illustrates how to describe this
approach:

We used the FIML estimator in Mplus 8.6 (Muthén & Muthén, 1998–2017) with robust
standard errors (i.e., the MLR estimator), and we used Graham’s (2003) saturated correlates
approach to incorporate three additional auxiliary variables. For each of the primary analy-
ses, we fit the model using 10 sets of random starting values, all of which achieved the same
final solution.

Recommendation 7: Sensitivity Analyses


A sensitivity analysis is “one in which several statistical models are considered simulta-
neously and/or one in which a statistical model is further scrutinized using specialized
tools (such as diagnostic measures)” (Beunckens et al., 2007, p. 477). As noted earlier,
one such application involves exploring whether distributional assumptions influence
one’s substantive conclusions (see Recommendation 2 and information in Sections
10.3 and 10.4). Another example applies to Bayesian analyses or multiple imputation,
where sensitivity analyses are useful for gauging the impact of the prior distribution
on covariance matrix estimates. This is particularly germane to the between-­cluster
covariance matrix from a multilevel model, and the analysis examples in Sections 8.2
through 8.4 illustrate this application. Perhaps the most important use of a sensitivity
analysis is to explore alternative missing data processes. Such an application involves
conjecturing about plausible MNAR mechanisms and specifying one or more selection
or pattern mixture models that align with those processes. An online supplemental
document can present side-by-side comparisons of two or more sets of analysis results
(see examples in Sections 9.6, 9.8, and 9.13), with any discrepant findings noted in
main body of text.
To illustrate how to report a sensitivity analysis, reconsider the examples from Sec-
tion 9.6, which applied a series of selection models to a multiple regression analysis with
an outcome that could be MNAR (illness severity ratings from a schizophrenia clinical
482 Applied Missing Data Analysis

trial). The passage below provides a template for describing the results in a published
paper:

We performed a sensitivity analysis to examine whether assuming an MNAR process influ-


enced the main findings. Using the selection modeling framework, we considered a focused
MNAR mechanism (Gomer & Yuan, 2021) where the dependent variable alone predicted
missingness (e.g., participants with the highest illness severity ratings are more likely to
quit the study), and we also considered diffuse processes with additional determinants of
missingness beyond the dependent variable (e.g., treatment group membership, gender).
The sensitivity analysis produced noticeable differences in some key parameters; rela-
tive to a conditionally MAR analysis, the intercept (placebo group average) was lower by
nearly nine-­tenths of a standard error unit, and the treatment group difference was smaller
by about one-third of a standard error. However, the direction of all coefficients and the con-
clusions about statistical significance were unaffected by the missing data process. Impor-
tantly, there is no way to determine which analysis is more correct, as the results reflect
different, plausible assumptions about the missing data process. The online supplemental
document describes the sensitivity analysis results in more detail.

As a second illustration, reconsider the analysis examples from Section 9.8, which
applied pattern mixture models to a multiple regression from the same psychiatric trial
data. The passage below provides a template for summarizing those results in a paper:

We performed a sensitivity analysis to examine whether assuming an MNAR process influ-


enced the main findings. Using the pattern mixture modeling framework, we considered
a focused MNAR mechanism (Gomer & Yuan, 2021) where participants with missing out-
come scores had a different mean (regression intercept) than people with complete data,
and we also considered a diffuse process where the treatment condition slope coefficient
differed between missing data patterns. In both cases, we varied the strength and direction
of the MNAR process across a range of plausible standardized effect sizes.
The sensitivity analysis revealed that decisions about statistical significance tests
were consistent, even with group differences as large as half a standard deviation unit.
These analyses further revealed that participants with missing data would need to differ
by more than ±0.30 standard deviation units to change the intercept coefficient (placebo
group mean) by at least half a standard error unit (an amount we judge to be practically sig-
nificant). Even larger differences of ±0.50 standard deviation units were needed to change
the slope coefficient by a similar amount. Importantly, there is no way to determine which
analysis is more correct, as the results reflect different, plausible assumptions about the
missing data process. The online supplemental document describes the sensitivity analysis
results in more detail.

As a final illustration, reconsider the analysis examples from Section 9.13, which
applied selection and pattern mixture models to a longitudinal growth curve. The pas-
sage below provides a template for summarizing those results in a paper:

We performed a sensitivity analysis to examine whether assuming an MNAR process


influenced the main findings. We considered three alternatives: (1) a selection model for
outcome-­dependent missingness where a participant’s dependent variable score at a par-
ticular measurement occasion predicts concurrent nonresponse (Diggle & Kenward, 1994),
Wrap‑Up 483

(2) a shared parameter model for random coefficient-­dependent missingness where one’s
underlying growth trajectory is responsible for missing data (Albert & Follmann, 2009; Wu
& Carroll, 1988), and (3) a random coefficient pattern mixture model where completers and
dropouts form qualitatively different subgroups with distinct growth trajectories (Hedeker
& Gibbons, 1997).
The Diggle–­ Kenward selection model and the Hedeker–­ Gibbons pattern mixture
model produced nontrivial differences in some key parameters. Both analyses suggested
a flatter (less negative) trajectory for the placebo group and a steeper decline for the medi-
cation condition, with changes to the parameters as large as one standard error unit in
some cases (an amount we judge to be practically significant). Considered as a whole, the
sensitivity analysis results suggest that an MNAR process is quite plausible for these data.
Importantly, there is no way to determine which model is more correct, as the results reflect
different, plausible assumptions about the missing data process. The online supplemental
document describes the sensitivity analysis results in more detail.

Recommendation 8: Bayesian Estimation


and Multiple Imputation
Finally, Bayesian estimation and multiple imputation need specific recommendations,
because these procedures generally require intervention and tweaking on the part of
the researcher. For Bayesian analyses, reporting algorithmic details such as the number
of burn-in and total iterations is important, as is information about convergence; as
noted elsewhere, algorithmic settings are analysis-­specific and can only be determined
after examining diagnostics such as trace plots and potential scale reduction factors
(see Section 4.9). When applying multiple imputation, researchers should additionally
report the variables used in the imputation phase, the number of imputations, and any
specific pooling considerations (e.g., transformations prior to pooling). The templates
from Recommendation 6 show how to report this information, and interested readers
can find additional reporting guidelines for multiple imputation in the literature (Manly
& Wells, 2015; Sterne et al., 2009; van Ginkel, Linting, Rippe, & van der Voort, 2020).

11.5 FINAL THOUGHTS AND RECOMMENDED READINGS

Missing data methodology has evolved considerably since the first edition of this book
was published in 2010. Rewinding more than a decade, researchers primarily had to rely
on techniques that assume a multivariate normal distribution. Major innovations since
that time include missing data-­handling methods for mixtures of discrete and numeri-
cal variables, non-­normal data, multilevel data, models with interactive or nonlinear
effects, and factored regression specifications, to name a few. At this point in history,
elegant missing data solutions exist for most analyses that researchers use in their day-
to-day practice, and there is no shortage of capable software tools. Methodologies from
the first edition of this book now enjoy widespread use in published research articles,
and I hope this second edition contributes to the uptake of new and improved meth-
odologies. Finally, I recommend the following articles for readers who want additional
details on the reporting recommendations offered in this chapter:
484 Applied Missing Data Analysis

Manly, C. A., & Wells, R. S. (2015). Reporting the use of multiple imputation for missing data in
higher education research. Research in Higher Education, 56, 397–409.

Nicholson, J. S., Deboeck, P. R., & Howard, W. (2017). Attrition in developmental psychology: A
review of modern missing data reporting and practices. International Journal of Behavioral
Development, 41, 143–153.

Sterne, J. A., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., . . . Carpenter,
J. R. (2009). Multiple imputation for missing data in epidemiological and clinical research:
Potential and pitfalls. British Medical Journal, 338, Article b2393.
Appendix

Data Set Descriptions

alcoholuse.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
MALE Gender dummy code 0 0 = Female, 1 = Male
AGE Age in years 0 Numerical (12 to 85)
ETHNIC Ethnicity 0 1 = Non-Hispanic/Black, 2 =
Hispanic, 3 = Black
COLLEGE College education dummy code 27.9 0 = High school or less, 1 =
Some college or more
AGETRYCIG Age first tried cigarettes 68.7 Numerical (10 to 48)
AGETRYALC Age first tried alcohol 32.9 Numerical (10 to 47)
ALCDAYS Drinking days per month 9.7 Count (0 to 30)
CIGDAYS Smoking days per month 13.6 Count (0 to 30)
DRINKER Alcohol use frequency 9.7 0 = Less than weekly, 1 = At
classification least once per week

diary.dat
Name Definition Missing % Scale
PERSON Individual identifier 0 Integer index
DAY Day identifier 0 Integer index (0 to 20)
PAIN Pain rating composite 3.9 Numerical (1 to 10)
SLEEP Sleep rating composite 8.9 Numerical (0 to 10)
POSAFF Positive affect composite 13.4 Numerical (1 to 7)
(continued)

485
486 Appendix

diary.dat (continued)
Name Definition Missing % Scale
NEGAFF Negative affect composite 13.3 Numerical (1 to 7)
LIFEGOAL Life goal pursuit composite 14.5 Numerical (1 to 7)
FEMALE Gender dummy code 0 0 = Male, 1 = Female
EDUC Education level 4.7 Ordinal (1 to 7)
DIAGNOSES Number of diagnosed ailments 7.1 Numerical (1 to 8)
ACTIVITY Activity level composite 12.4 Numerical (0 to 5)
PAINACCEPT Pain acceptance composite 2.4 Numerical (0 to 5)
CATASTROPIZE Catastrophizing composite 0 Numerical (0 to 5)
STRESS Stress composite 0 Numerical (0 to 5)

drugtrial.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
MALE Gender dummy code 0 0 = Female, 1 = Male
DRUG Medication condition dummy 0 0 = Placebo, 1 = Medication
code
SEVERITY0 Illness severity at baseline 0.7 Numerical (1 to 7)
SEVERITY1 Illness severity at 1 week 2.5 Numerical (1 to 7)
SEVERITY3 Illness severity at 3 weeks 14.4 Numerical (1 to 7)
SEVERITY6 Illness severity at 6 weeks 23.3 Numerical (1 to 7)
DROPGRP Dropout group 0 1 = Completer, 2 = 3-week
dropout, 3 = 6-week
dropout
EARLYDROP 3-week dropout dummy code 0 1 = 3-week dropout, 0 =
Completer or 6-week
dropout
LATEDROP 6-week dropout dummy code 0 1 = 6-week dropout, 0 =
Completer or 3-week
dropout
DROPOUT Dropout indicator 0 0 = Completer, 1 = Dropout
SDROPOUT3 3-week survival dropout 0 0 = Completer, 1 = Dropout
indicator
SDROPOUT6 6-week survival dropout 11 0 = Completer, 1 = Dropout
indicator
Appendix 487

drugtrial2level.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
MALE Gender dummy code 0 0 = Female, 1 = Male
DRUG Medication condition dummy 0 0 = Placebo, 1 = Medication
code
SEVERITY Illness severity 10.2 Numerical (1 to 7)
WEEK Time scores 0 Numerical (0, 1, 3, 6)
DROPGRP Dropout group 0 0 = Completer, 1 = 3-week
dropout, 2 = 6-week
dropout
EARLYDROP 3-week dropout dummy code 0 1 = 3-week dropout, 0 =
Completer or 6-week
dropout
LATEDROP 6-week dropout dummy code 0 1 = 6-week dropout, 0 =
Completer or 3-week
dropout
DROPOUT Dropout indicator 0 0 = Completer, 1 = Dropout
SDROPOUT Survival dropout indicator 2.7 0 = Completer, 1 = Dropout

eatingrisk.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
BMI Body mass index 11.5 Numerical (17.39 to 28.27)

Drive for Thinness Questionnaire items

DRIVE1 Am terrified about being 5 Ordinal (1 to 6)


overweight.
DRIVE2 Avoid eating when I am hungry. 11.5 Ordinal (1 to 6)
DRIVE3 Feel extremely guilty after 0.5 Ordinal (1 to 6)
eating.
DRIVE4 Am preoccupied with a desire 8 Ordinal (1 to 6)
to be thinner.
DRIVE5 Think about burning up 0.5 Ordinal (1 to 6)
calories when I exercise.
DRIVE6 Am preoccupied with . . . fat on 9 Ordinal (1 to 6)
my body.
DRIVE7 Like my stomach to be empty. 0 Ordinal (1 to 6)
(continued)
488 Appendix

eatingrisk.dat (continued)
Name Definition Missing % Scale
Dieting Behavior Questionnaire items

DIETING1 Aware of the calorie content of 3.5 Ordinal (1 to 6)


foods that I eat.
DIETING2 Particularly avoid food with a 0 Ordinal (1 to 6)
high carbohydrate.
DIETING3 Avoid foods with sugar in 8 Ordinal (1 to 6)
them.
DIETING4 Eat diet foods. 0.5 Ordinal (1 to 6)
DIETING5 Engage in dieting behavior. 12.5 Ordinal (1 to 6)

employee.dat
Name Definition Missing % Scale
EMPLOYEE Employee identifier 0 Integer index
TEAM Team identifier 0 Integer index
TURNOVER Intend to quit job in the next 6 5.1 0 = No, 1 = Yes
months
MALE Gender dummy code 0 0 = Female, 1 = Male
EMPOWER Employee empowerment 16.2 Numerical (14 to 42)
composite
LMX Leader–member exchange 4.1 Numerical (0 to 17)
(relationship quality with
supervisor) composite
WORKSAT Work satisfaction rating 4.8 Ordinal (1 to 7)
CLIMATE Leadership climate composite 9.5 Numerical (12 to 33)
(team-level)
COHESION Team cohesion composite 5.7 Numerical (2 to 10)
(team-level)
Appendix 489

math.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
MALE Gender dummy code 0 0 = Female, 1 = Male
FRLUNCH Lunch assistance dummy code 5.2 0 = None, 1 = Free or reduced-
price lunch
ACHIEVEGRP Achievement classification 2.4 1 = Typically achieving, 2 =
Low achieving, 3 = Learning
disabled
STANREAD Standardized reading 10.4 Numerical (27.2 to 69.2)
EFFICACY Math self-efficacy rating scale 10.0 Ordinal (1 to 6)
ANXIETY Math anxiety composite 8.8 Numerical (0 to 56)
MATHPRE Math achievement pretest 0 Numerical (26 to 76)
MATHPOST Math achievement posttest 16.8 Numerical (35 to 85)

pain.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
TXGRP Treatment group dummy code 0 0 = Waitlist control, 1 =
Treatment
MALE Gender dummy code 0 0 = Female, 1 = Male
AGE Age in years 0 Numerical (19 to 78)
EDUC Highest education 0 1 = Some college or less, 2 =
College, 3 = Post-BA
WORKHRS Work hours per week 12.0 Numerical (0 to 94)
EXERCISE Exercise frequency 1.8 Ordinal (1 to 8)
PAINGRPS Chronic pain intensity rating 7.3 1 = No or little, 2 = Moderate, 3
= Severe
PAIN Severe pain dummy code 7.3 0 = No, little, moderate pain, 1
= Severe pain
ANXIETY Anxiety composite 5.5 Numerical (7 to 26)
STRESS Stress rating 0 Ordinal (1 to 7)
CONTROL Perceived control over pain 0 Numerical (6 to 30)
composite
DEPRESS Depression composite 13.5 Numerical (7 to 28)
INTERFERE Pain interference with life 10.5 Numerical (6 to 41)
composite
DISABILITY Psychosocial disability 9.1 Numerical (10 to 34)
composite
(continued)
490 Appendix

pain.dat (continued)
Name Definition Missing % Scale
Depression Questionnaire items

DEPRESS1 Couldn’t experience any 4.7 Ordinal (1 to 4)


positive feelings at all.
DEPRESS2 Difficult to work up the 2.2 Ordinal (1 to 4)
initiative to do things.
DEPRESS3 I felt that I had nothing to look 1.8 Ordinal (1 to 4)
forward to.
DEPRESS4 I felt down-hearted and blue. 1.5 Ordinal (1 to 4)
DEPRESS5 Unable to become enthusiastic 2.2 Ordinal (1 to 4)
about anything.
DEPRESS6 I felt I wasn’t worth much as a 4.0 Ordinal (1 to 4)
person.
DEPRESS7 I felt that life was meaningless. 2.9 Ordinal (1 to 4)

Pain Interference Questionnaire items

INTERFERE1 Gave up enjoyable activities. 1.8 Ordinal (1 to 7)


INTERFERE2 Not able to fulfill 1.5 Ordinal (1 to 7)
responsibilities at home.
INTERFERE3 Not able to enjoy your 1.5 Ordinal (1 to 7)
relationships.
INTERFERE4 Not able to pursue personal 2.5 Ordinal (1 to 7)
goals.
INTERFERE5 Unable to provide basic care for 3.3 Ordinal (1 to 7)
myself.
INTERFERE6 Unable to think clearly, 0.7 Ordinal (1 to 7)
concentrate, or remember.

Psychosocial Disability Questionnaire items

PSYDISAB1 I isolate myself as much as I 3.3 Ordinal (1 to 6)


can from the family.
PSYDISAB2 I am doing fewer social 4.7 Ordinal (1 to 6)
activities.
PSYDISAB3 I sometimes behave as if I were 3.6 Ordinal (1 to 6)
confused.
PSYDISAB4 I laugh or cry suddenly. 3.6 Ordinal (1 to 6)
PSYDISAB5 I act irritable and impatient 4.7 Ordinal (1 to 6)
with myself.
PSYDISAB6 I do not speak clearly when I 3.6 Ordinal (1 to 6)
am under stress.
Appendix 491

problemsolving2level.dat
Name Definition Missing % Scale
SCHOOL School identifier 0 Integer index
STUDENT Student identifier 0 Integer index
CONDITION Experimental condition 0 0 = Control school, 1 =
Experimental school
TEACHEXP Teacher years of experience 10.8 Numerical (4.3 to 24.6)
ESLPCT % English as second language 0 Numerical (10 to 100)
ETHNIC Ethnicity/race 9.0 1 = White, 2 = Black, 3 =
Hispanic
MALE Gender dummy code 0 0 = female, 1 = male
FRLUNCH Lunch assistance dummy code 4.7 0 = None, 1 = Free or reduced-
price lunch
ACHIEVEGRP Achievement classification 2.1 1 = Typically achieving, 2 =
Low achieving, 3 = Learning
disabled
STANMATH Standardized reading 7.4 Numerical (5.3 to 87.8)
EFFICACY1 Math self-efficacy pretest 0 Numerical (0 to 12)
EFFICACY2 Math self-efficacy posttest 20.5 Numerical (0 to 12)
PROBSOLVE1 Math problem-solving pretest 0 Numerical (37 to 66)
PROBSOLVE2 Math problem-solving posttest 20.5 Numerical (37 to 65)

problemsolving3level.dat
Name Definition Missing % Scale
SCHOOL School identifier 0 Integer index
STUDENT Student identifier 0 Integer index
WAVE Monthly wave identifier 0 Integer index (1 to 7)
CONDITION Experimental condition 0 0 = Control school, 1 =
Experimental school
TEACHEXP Teacher years of experience 10.8 Numerical (4.3 to 24.6)
ESLPCT % English as second language 0 Numerical (10 to 100)
ETHNIC Ethnicity/race 9.0 1 = White, 2 = Black, 3 =
Hispanic
MALE Gender dummy code 0 0 = Female, 1 = Male
FRLUNCH Lunch assistance dummy code 4.7 0 = None, 1 = Free or reduced-
price lunch

(continued)
492 Appendix

problemsolving3level.dat (continued)
Name Definition Missing % Scale
ACHIEVEGRP Achievement classification 2.1 1 = Typically achieving, 2 =
Low achieving, 3 = Learning
disabled
STANMATH Standardized reading 7.4 Numerical (5.3 to 87.8)
MONTH0 Time scores (baseline = 0) 0 Numerical (0 to 6)
MONTH7 Time scores (endpoint = 0) 0 Numerical (–6 to 0)
PROBSOLVE Math problem solving 11.4 Numerical (37 to 68)
EFFICACY Math self-efficacy 11.4 Numerical (0 to 14)

smoking.dat
Name Definition Missing % Scale
ID Participant identifier 0 Integer index
INTENSITY Smoking intensity (cigarettes 21.2 Numerical (2 to 29)
per day)
HVYSMOKER Heavy smoking indicator 21.2 0 = 10 or fewer cigarettes per
day, 1 = 11 or more per day
AGE Age at assessment 0 Numerical (18 to 25)
PARSMOKE Parental smoking dummy code 3.6 0 = Nonsmoker, 1 = Smoker
FEMALE Female dummy code 0 0 = Male, 1 = Female
RACE Race categories 6 1 = White, 2 = Black, 3 =
Hispanic, 4 = Other
INCOME Household income 11.4 Ordinal (1 to 20)
EDUC Highest education 5.4 1 = Less than HS, 2 = HS or
some college, 3 = BA or
higher
References

Abrams, K., Ashby, D., & Errington, D. (1994). Simple Bayesian analysis in clinical trials: A tuto-
rial. Controlled Clinical Trials, 15, 349–359.
Agresti, A. (2012). Categorical data analysis (3rd ed.). Hoboken, NJ: Wiley.
Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. New-
bury Park, CA: Sage.
Aitchison, J., & Bennett, J. A. (1970). Polychotomous quantal response by maximum indicant.
Biometrika, 57, 253–262.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Auto-
matic Control, 19, 716–723.
Ake, C. F. (2005, April). Rounding after multiple imputation with non-binary cateorical covariates.
Paper presented at the SAS Users Group International, Philadelphia, PA.
Alacam, E., Du, H., Enders, C. K., & Keller, B. T. (2022). A model-based approach to treating com-
posite scores with missing items. Manuscript submitted for publication.
Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data.
Journal of the American Statistical Association, 88, 669–679.
Albert, P. S., & Follmann, D. A. (2009). Shared-parameter models. In G. Fitzmaurice, M. David-
ian, G. Vebeke, & G. Molenberghs (Eds.), Longitudinal data analysis (pp. 433–452). Boca
Raton, FL: Chapman & Hall.
Albert, P. S., Follmann, D. A., Wang, S. A., & Suh, E. B. (2002). A latent autoregressive model for
longitudinal binary data subject to informative missingness. Biometrics, 58, 631–642.
Allison, P. D. (2002). Missing data. Newbury Park, CA: Sage.
Allison, P. D. (2005, April). Imputation of categorical variables with PROC MI. Paper presented at
the SAS Users Group International, Philadelphia, PA.
Anderson, D., & Burnham, K. (2004). Model selection and multi-model inference (2nd ed.). New
York: Springer-Verlag.
Anderson, T. W. (1957). Maximum-likelihood estimates for a multivariate normal-distribution
when some observations are missing. Journal of the American Statistical Association, 52,
200–203.
Andrews, M., & Baguley, T. (2013). Prior approval: The growth of Bayesian methods in psychol-
ogy. British Journal of Mathematical and Statistical Psychology, 66, 1–7.
Andridge, R. R. (2011). Quantifying the impact of fixed effects modeling of clusters in multiple
imputation for cluster randomized trials. Biometrical Journal, 53, 57–74.

493
494 References

Arbuckle, J. L. (1996). Full information estimation in the presence of incomplete data. In G. A.


Marcoulides & R. E. Schumacker (Eds.), Advanced structural equation modeling (pp. 243–
278). Mahwah, NJ: Erlbaum.
Arminger, G., & Sobel, M. E. (1990). Pseudo-maximum likelihood estimation of mean and
covariance structures with missing data. Journal of the American Statistical Association, 85,
195–203.
Arnold, B. C., Castillo, E., & Sarabia, J. M. (1999). Conditional specification of statistical models.
New York: Springer.
Arnold, B. C., Castillo, E., & Sarabia, J. M. (2001). Conditionally specified distributions: An
introduction. Statistical Science, 16, 249–274.
Arnold, B. C., & Press, S. J. (1989). Compatible conditional distributions. Journal of the American
Statistical Association, 84, 152–156.
Asparouhov, T., & Muthén, B. (2007, July). Computationally efficient estimation of multilevel high-
dimensional latent variable models. Paper presented at the Proceedings of the Joint Statistical
Meetings, Section on Statistics in Epidemiology, Alexandria, VA.
Asparouhov, T., & Muthén, B. (2010a). Bayesian analysis using Mplus: Technical implementa-
tion. Retrieved from www.statmodel.com/download/bayes3.pdf.
Asparouhov, T., & Muthén, B. (2010b). Chi-square statistics with multiple imputation. Retrieved
from www.statmodel.com/download/mi7.pdf.
Asparouhov, T., & Muthén, B. (2010c). Multiple imputation with Mplus. Retrieved from www.
statmodel.com/download/imputations7.pdf.
Asparouhov, T., & Muthén, B. (2010d). Plausible values for latent variables using Mplus. Retrieved
from www.statmodel.com/download/plausible.pdf.
Asparouhov, T., & Muthén, B. (2021a). Bayesian estimation of single and multilevel models with
latent variable interactions. Structural Equation Modeling: A Multidisciplinary Journal, 28,
314–328.
Asparouhov, T., & Muthén, B. (2021b). Expanding the Bayesian structural equation, multilevel
and mixture models to logit, negative-binomial and nominal variables. Structural Equation
Modeling: A Multidisciplinary Journal, 28, 622–637.
Barnard, J., McCulloch, R., & Meng, X.-L. (2000). Modeling covariance matrices in terms of
standard deviations and correlations, with application to shrinkage. Statistica Sinica, 10,
1281–1311.
Barnard, J., & Rubin, D. B. (1999). Small-sample degrees of freedom with multiple imputation.
Biometrika, 86, 948–955.
Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social
psychological research: Conceptual, strategic, and statistical considerations. Journal of Per-
sonality and Social Psychology, 51, 1173–1182.
Bartlett, J., Keogh, R., & Bonneville, C. T. (2021). Package ‘smcfcs.’ Retrieved from https://cran.r-
project.org/web/packages/smcfcs/smcfcs.pdf.
Bartlett, J. W., Seaman, S. R., White, I. R., & Carpenter, J. R. (2015). Multiple imputation of
covariates by fully conditional specification: Accommodating the substantive model. Statis-
tical Methods in Medical Research, 24, 462–487.
Bates, D., Maechler, M., Bolker, B., Walker, S., Christensen, R. H. B., Singmann, H., . . . Krivitsky,
P. N. (2021). Package ‘lme4.’ Retrieved from https://cran.r-project.org/web/packages/lme4/lme4.
pdf.
Bauer, D. J., & Curran, P. J. (2005). Probing interactions in fixed and multilevel regression: Infer-
ential and graphical techniques. Multivariate Behavioral Research, 40, 373–400.
Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in
References 495

the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical


Statistics, 41, 164–171.
Beale, E. M. L., & Little, R. J. A. (1975). Missing values in multivariate analysis. Journal of the
Royal Statistical Society B: Statistical Methodology, 37, 129–145.
Belin, T. R., Hu, M.-Y., Young, A. S., & Grusky, O. (1999). Performance of a general location model
with an ignorable missing-data assumption in a multivariate mental health services study.
Statistics in Medicine, 18, 3123–3135.
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107,
238–246.
Bentler, P. M. (2000–2008). EQS 6 structural equations program manual. Los Angeles: Multivariate
Software.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of
covariance structures. Psychological Bulletin, 88, 588–606.
Bentler, P. M., & Liang, J. (2011). Two-level mean and covariance structures: Maximum likeli-
hood via an EM algorithm. In S. Reise & N. Duan (Eds.), Multilevel modeling: Methodological
advances, issues, and applications (pp. 53–70). Mahwah, NJ: Erlbaum.
Beran, R., & Srivastava, M. S. (1985). Bootstrap tests and confidence-regions for functions of a
covariance matrix. Annals of Statistics, 13, 95–115.
Bernaards, C. A., Belin, T. R., & Schafer, J. L. (2007). Robustness of a multivariate normal approx-
imation for imputation of incomplete binary data. Statistics in Medicine, 26, 1368–1382.
Beunckens, C., Molenberghs, G., Thijs, H., & Verbeke, G. (2007). Incomplete hierarchical data.
Statistical Methods in Medical Research, 16, 457–492.
Beunckens, C., Molenberghs, G., Verbeke, G., & Mallinckrodt, C. (2008). A latent-class mixture
model for incomplete longitudinal Gaussian data. Biometrics, 64, 96–105.
Birnbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick (Eds.), Statistical
theories of mental test scores (pp. 395–479). Menlo Park, CA: Addison-Wesley.
Black, A. C., Harel, O., & McCoach, D. B. (2011). Missing data techniques for multilevel data:
Implications of model misspecification. Journal of Applied Statistics, 38, 1845–1865.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:
Application of an EM algorithm. Psychometrika, 46, 443–459.
Bodner, T. E. (2006). Missing data: Prevalence and reporting practices. Psychological Reports, 99,
675–680.
Bodner, T. E. (2008). What improves with increased missing data imputations? Structural Equa-
tion Modeling: A Multidisciplinary Journal, 15, 651–675.
Bohrnstedt, G. W., & Goldberger, A. S. (1969). On the exact covariance of products of random
variables. Journal of the American Statistical Association, 64, 1439–1442.
Bojinov, I. I., Pillai, N. S., & Rubin, D. B. (2020). Diagnosing missing always at random in multi-
variate data. Biometrika, 107, 246–253.
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
Bollen, K. A., & Curran, P. J. (2005). Latent curve models. Hoboken, NJ: Wiley.
Bollen, K. A., & Stine, R. A. (1992). Bootstrapping goodness-of-fit measures in structural equa-
tion models. Sociological Methods and Research, 21, 205–229.
Box, G. E., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical
Society B: Statistical Methodology, 26, 211–243.
Brandmaier, A. M., Ghisletta, P., & von Oertzen, T. (2020). Optimal planned missing data design
for linear latent growth curve models. Behavior Research Methods, 52, 1445–1458.
Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative
simulations. Journal of Computational and Graphical Statistics, 7, 434–455.
496 References

Browne, M. W. (1984). Asymptotically distribution-free methods for the analysis of covariance


structures. British Journal of Mathematical and Statistical Psychology, 37, 62–83.
Browne, M. W., & Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological Meth-
ods and Research, 21, 230–258.
Browne, W. J. (1998). Applying MCMC methods to multi-level models. (PhD thesis), University of
Bath, United Kingdom.
Browne, W. J. (2006). MCMC algorithms for constrained variance matrices. Computational Sta-
tistics and Data Analysis, 50, 1655–1677.
Browne, W. J., & Draper, D. (2000). Implementation and performance issues in the Bayesian and
likelihood fitting of multilevel models. Computational Statistics, 15, 391–420.
Buck, S. F. (1960). A method of estimation of missing values in multivariate data suitable for use
with an electronic computer. Journal of the Royal Statistical Society, Series B, 22, 302–306.
Bürkner, P.-C. (2021). Package ‘brms.’ Retrieved from https://cran.r-project.org/web/packages/brms/
brms.pdf.
Burton, A., & Altman, D. G. (2004). Missing covariate data within cancer prognostic studies: A
review of current reporting and proposed guidelines. British Journal of Cancer, 91, 4–8.
Buse, A. (1982). The likelihood ratio, Wald, and Lagrange multiplier tests: An expository note.
American Statistician, 36, 153–157.
Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for the equivalence of factor covari-
ance and mean structures: The issue of partial measurement invariance. Psychological Bul-
letin, 105, 456–466.
Cai, L. (2008). SEM of another flavour: Two new applications of the supplemented EM algorithm.
British Journal of Mathematical and Statistical Psychology, 61, 309–329.
Cai, L., & Lee, T. (2009). Covariance structure model fit testing under missing data: An applica-
tion of the supplemented EM algorithm. Multivariate Behavioral Research, 44, 281–304.
Carpenter, J. R., Goldstein, H., & Kenward, M. G. (2011). REALCOM-IMPUTE software for mul-
tilevel multiple imputation with mixed response types. Journal of Statistical Software, 45,
1–14.
Carpenter, J. R., & Kenward, M. G. (2013). Multiple imputation and its application. West Sussex,
UK: Wiley.
Casella, G. (2001). Empirical Bayes Gibbs sampling. Biostatistics, 2, 485–500.
Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. American Statistician, 46,
167–174.
Cham, H., Reshetnyak, E., Rosenfeld, B., & Breitbart, W. (2017). Full information maximum
likelihood estimation for latent variable interactions with incomplete indicators. Multivari-
ate Behavioral Research, 52, 12–30.
Chen, D. (2018). A comparison of alternative bias-corrections in the bias-corrected bootstrap test of
mediation. (PhD thesis), University of Nebraska, Lincoln.
Chen, F. (2011, April). The RANDOM statement and more: Moving on with PROC MCMC. Paper
presented at the Proceedings of the SAS Global Forum 2011 Conference, Las Vegas. NV.
Chen, H. Y., & Little, R. (1999). A test of missing completely at random for generalised estimation
equations with missing data. Biometrika, 86, 1–13.
Chung, S., & Cai, L. (2019). Alternative multiple imputation inference for categorical structural
equation modeling. Multivariate Behavioral Research, 54, 323–337.
Chung, Y., Gelman, A., Rabe-Hesketh, S., Liu, J. C., & Dorie, V. (2015). Weakly informative prior
for point estimation of covariance matrices in hierarchical models. Journal of Educational
and Behavioral Statistics, 40, 136–157.
Cobb, L., Koppstein, P., & Chen, N. H. (1983). Estimation and moment recursion relations for
References 497

multimodal distributions of the exponential family. Journal of the American Statistical Asso-
ciation, 78, 124–130.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Erlbaum.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2002). Applied multiple regression/correlation
analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Erlbaum.
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive
strategies in modern missing data procedures. Psychological Methods, 6, 330–351.
Cook, R. J., Zeng, L., & Yi, G. Y. (2004). Marginal analysis of incomplete longitudinal binary
data: A cautionary note on LOCF imputation. Biometrics, 60, 820–828.
Cowles, M. K. (1996). Accelerating Monte Carlo Markov chain convergence for cumulative-link
generalized linear models. Statistics and Computing, 6, 101–111.
Cowles, M. K., & Carlin, B. P. (1996). Markov chain Monte Carlo convergence diagnostics: A
comparative review. Journal of the American Statistical Association, 91, 883–904.
Coxe, S., West, S. G., & Aiken, L. S. (2009). The analysis of count data: A gentle introduction to
poisson regression and its alternatives. Journal of Personality Assessment, 91, 121–136.
Craig, C. C. (1936). On the frequency function of xy. Annals of Mathematical Statistics, 7, 1–15.
Darnieder, W. F. (2011). Bayesian methods for data-dependent priors. (PhD thesis), The Ohio State
University, Columbus, OH.
Demirtas, H., Freels, S. A., & Yucel, R. M. (2008). Plausibility of multivariate normality assump-
tion when multiple imputing non-Gaussian continuous outcomes: A simulation assessment.
Journal of Statistical Computation and Simulation, 78, 69–84.
Demirtas, H., & Hedeker, D. (2008a). Imputing continuous data under some non-­Gaussian dis-
tributions. Statistica Neerlandica, 62, 193–205.
Demirtas, H., & Hedeker, D. (2008b). Multiple imputation under power polynomials. Communi-
cations in Statistics—Simulation and Computation, 37, 1682–1695.
Demirtas, H., & Schafer, J. L. (2003). On the performance of random-coefficient pattern-mixture
models for non-ignorable drop-out. Statistics in Medicine, 22, 2553–2575.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society B: Statistical Methodology, 39,
1–38.
Diggle, P., & Kenward, M. G. (1994). Informative drop-out in longitudinal data analysis. Journal
of the Royal Statistical Society C: Applied Statistics, 43, 49–93.
Dixon, W. J. (1988). BMDP statistical software. Los Angeles: University of California Press.
Du, H., Enders, C. K., Keller, B. T., Bradbury, T., & Karney, B. (2021, February 2). A Bayes-
ian latent variable selection model for nonignorable missingness. Multivariate Behavioral
Research. [Epub ahead of print]
Duncan, S. C., Duncan, T. E., & Hops, H. (1996). Analysis of longitudinal data within accelerated
longitudinal designs. Psychological Methods, 1, 236–248.
Dyklevych, O. (2014). Bayesian inference in the multinomial probit model: A case study. Master’s
thesis, Örebro University, Örebro, Sweden.
Dziak, J. J., Coffman, D. L., Lanza, S. T., Li, R., & Jermiin, L. S. (2020). Sensitivity and specificity
of information criteria. Briefings in Bioinformatics, 21, 553–565.
Edgett, G. L. (1956). Multiple regression with missing observations among the independent vari-
ables. Journal of the American Statistical Association, 51, 122–131.
Edwards, M. C., Wirth, R. J., Houts, C. R., & Xi, N. (2012). Categorical data in the structural
equation modeling framework. In R. H. Hoyle (Ed.), Handbook of structural equation model-
ing (pp. 195–208). New York: Guilford Press.
498 References

Eekhout, I., de Vet, H. C., Twisk, J. W., Brand, J. P., de Boer, M. R., & Heymans, M. W. (2014).
Missing data in a multi-item instrument were best handled by multiple imputation at the
item score level. Journal of Clinical Epidemiology, 67, 335–342.
Eekhout, I., Enders, C. K., Twisk, J. W. R., de Boer, M. R., de Vet, H. C. W., & Heymans, M. W.
(2015a). Analyzing incomplete item scores in longitudinal data by including item score
information as auxiliary variables. Structural Equation Modeling: A Multidisciplinary Journal,
22, 588–602.
Eekhout, I., Enders, C. K., Twisk, J. W. R., de Boer, M. R., de Vet, H. C. W., & Heymans, M. W.
(2015b). Including auxiliary item information in longitudinal data analyses improved han-
dling missing questionnaire outcome data. Journal of Clinical Epidemiology, 68, 637–645.
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Associa-
tion, 82, 171–185.
Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross-valida-
tion. American Statistician, 37, 36–48.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Boca Raton, FL: Chapman
& Hall.
Enders, C. K. (2001). The impact of nonnormality on full information maximum-likelihood esti-
mation for structural equation models with missing data. Psychological Methods, 6, 352–370.
Enders, C. K. (2002). Applying the Bollen–Stine bootstrap for goodness-of-fit measures to struc-
tural equation models with missing data. Multivariate Behavioral Research, 37, 359–377.
Enders, C. K. (2008). A note on the use of missing auxiliary variables in FIML-based structural
equation models. Structural Equation Modeling: A Multidisciplinary Journal, 15, 434–448.
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
Enders, C. K. (2011). Missing not at random models for latent growth curve analyses. Psychologi-
cal Methods, 16, 1–16.
Enders, C. K., Baraldi, A. N., & Cham, H. (2014). Estimating interaction effects with incomplete
predictor variables. Psychological Methods, 19, 39–55.
Enders, C. K., Du, H., & Keller, B. T. (2020). A model-based imputation procedure for multilevel
regression models with random coefficients, interaction effects, and other nonlinear terms.
Psychological Methods, 25, 88–112.
Enders, C. K., & Gottschall, A. C. (2011). Multiple imputation strategies for multiple group struc-
tural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 18, 35–54.
Enders, C. K., Hayes, T., & Du, H. (2018). A comparison of multilevel imputation schemes for
random coefficient models: Fully conditional specification and joint model imputation with
random covariance matrices. Multivariate Behavioral Research, 53, 695–713.
Enders, C. K., & Keller, B. T. (2019). Blimp technical appendix: Centering covariates in a Bayesian
multilevel analysis. Available at www.appliedmissingdata.com/blimp-papers.
Enders, C. K., Keller, B. T., & Levy, R. (2018). A fully conditional specification approach to
multilevel imputation of categorical and continuous variables. Psychological Methods, 23,
298–317.
Enders, C. K., & Mansolf, M. (2018). Assessing the fit of structural equation models with multi-
ply imputed data. Psychological Methods, 23, 76–93.
Enders, C. K., Mistler, S. A., & Keller, B. T. (2016). Multilevel multiple imputation: A review and
evaluation of joint modeling and chained equations imputation. Psychological Methods, 21,
222–240.
Enders, C. K., & Tofighi, D. (2007). Centering predictor variables in cross-sectional multilevel
models: A new look at an old issue. Psychological Methods, 12, 121–138.
Erler, N. S., Rizopoulos, D., Jaddoe, V. W., Franco, O. H., & Lesaffre, E. M. (2019). Bayesian
References 499

imputation of time-varying covariates in linear mixed models. Statistical Methods in Medical


Research, 28, 555–568.
Erler, N. S., Rizopoulos, D., Rosmalen, J., Jaddoe, V. W., Franco, O. H., & Lesaffre, E. M. (2016).
Dealing with missing covariates in epidemiologic studies: A comparison between multiple
imputation and a full Bayesian approach. Statistics in Medicine, 35, 2955–2974.
Fears, T. R., Benichou, J., & Gail, M. H. (1996). A reminder of the fallibility of the Wald statistic.
American Statistician, 50, 226–227.
Feldman, B. J., & Rabe-Hesketh, S. (2012). Modeling achievement trajectories when attrition is
informative. Journal of Educational and Behavioral Statistics, 37, 703–736.
Finch, H. (2008). Estimation of item response theory parameters in the presence of missing data.
Journal of Educational Measurement, 45, 225–245.
Finch, J. F., West, S. G., & MacKinnon, D. P. (1997). Effects of sample size and nonnormality on
the estimation of mediated effects in latent variable models. Structural Equation Modeling: A
Multidisciplinary Journal, 4, 87–107.
Finkeiner, C. (1979). Estimation for the multiple factor model when data are missing. Psy-
chometrika, 44, 409–420.
Finney, S. J., & DiStefano, C. (2013). Nonnormal and categorical data in structural equation
models. In G. R. Hancock & R. O. Mueller (Eds.), A second course in structural equation mod-
eling (2nd ed., pp. 439–492). Charlotte, NC: Information Age.
Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples
from an indefiniately large population. Biometrika, 10, 507–521.
Freedman, D. A. (2006). On the so-called “Huber sandwich estimator” and “robust standard
errors.” American Statistician, 60, 299–302.
Fritz, M. S., Taylor, A. B., & MacKinnon, D. P. (2012). Explanation of two anomalous results in
statistical mediation analysis. Multivariate Behavioral Research, 47, 61–87.
Frühwirth-Schnatter, S., & Früwirth, R. (2010). Data augmentation and MCMC for binary and
multinomial logit models. In T. Kneib & G. Tutz (Eds.), Statistical modelling and regression
structures (pp. 111–131). Heidelberg, Germany: Springer.
Garner, D. M., Olmsted, M. P., Bohr, Y., & Garfinkel, P. E. (1982). The Eating Attitudes Test: Psy-
chometric features and clinical correlates. Psychological Medicine, 12, 871–878.
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal
densities. Journal of the American Statistical Association, 85, 398–409.
Gelman, A. (2004). Parameterization and Bayesian modeling. Journal of the American Statistical
Association, 99, 537–545.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian
Analysis, 1, 515–533.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian
data analysis (3rd ed.). Boca Raton, FL: CRC Press.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models.
New York: Cambridge University Press.
Gelman, A., Lee, D., & Guo, J. (2015). Stan: A probabilistic programming language for Bayesian
inference and optimization. Journal of Educational and Behavioral Statistics, 40, 530–543.
Gelman, A., & Raghunathan, T. (2001). [Conditionally specified distributions: An introduction]:
Comment. Statistical Science, 16, 268–269.
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences.
Statistical Science, 7, 457–472.
Genz, A. (1993). Comparison of methods for the computation of multivariate normal probabili-
ties. Computing Science and Statistics, 25, 400–405.
500 References

Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., . . . Hothorn, T. (2019). Package ‘mvt-
norm.’ Retrieved from https://cran.r-project.org/web/packages/mvtnorm/mvtnorm.pdf.
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to calculating poste-
rior moments. In J. M. Bernado, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian
Statistics 4 (pp. 169–193). Oxford, UK: Carendon Press.
Geyer, C. J. (1992). Practical Markov chain Monte Carlo. Statistical Science, 7, 473–483.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.). (1996). Markov chain Monte Carlo in
practice. London: Chapman & Hall.
Glynn, R. J., & Laird, N. M. (1986). Regression estimates and missing data: Complete-case analysis
(Technical Report). Cambridge, MA: Harvard School of Public Health, Department of Bio-
statistics.
Gold, M. S., & Bentler, P. M. (2000). Treatments of missing data: A Monte Carlo comparison of
RBHDI, iterative stochastic regression imputation, and expectation-maximization. Struc-
tural Equation Modeling: A Multidisciplinary Journal, 7, 319–355.
Goldstein, H., Carpenter, J. R., & Browne, W. J. (2014). Fitting multilevel multivariate models
with missing data in responses and covariates that may include interactions and non-linear
terms. Journal of the Royal Statistical Society A: Statistics in Society, 177, 553–564.
Goldstein, H., Carpenter, J., Kenward, M. G., & Levin, K. A. (2009). Multilevel models with mul-
tivariate mixed response types. Statistical Modelling, 9, 173–197.
Gomer, K., & Yuan, K.-H. (2021, June 28). Subtypes of the missing not at random missing data
mechanism. Psychological Methods. [Epub ahead of print]
Gonzalez, R., & Griffin, D. (2001). Testing parameters in structural equation modeling: Every
“one” matters. Psychological Methods, 6, 258–269.
Gottfredson, N. C., Bauer, D. J., & Baldwin, S. A. (2014). Modeling change in the presence of non-
randomly missing data: Evaluating a shared parameter mixture model. Structural Equation
Modeling: A Multidisciplinary Journal, 21, 196–209.
Gottfredson, N. C., Bauer, D. J., Baldwin, S. A., & Okiishi, J. C. (2014). Using a shared parameter
mixture model to estimate change during treatment when termination is related to recovery
speed. Journal of Consulting and Clinical Psychology, 82, 813–827.
Gottfredson, N. C., Sterba, S. K., & Jackson, K. M. (2017). Explicating the conditions under
which multilevel multiple imputation mitigates bias resulting from random coefficient-
dependent missing longitudinal data. Prevention Science, 18, 12–19.
Gottschall, A. C., West, S. G., & Enders, C. K. (2012). A comparison of item-level and scale-level
multiple imputation for questionnaire batteries. Multivariate Behavioral Research, 47, 1–25.
Gourieroux, C., Monfort, A., & Trognon, A. (1984). Pseudo maximum likelihood methods: The-
ory. Econometrica, 52, 681–700.
Graham, J. W. (2003). Adding missing-data-relevant variables to FIML-based structural equation
models. Structural Equation Modeling: A Multidisciplinary Journal, 10, 80–100.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of
Psychology, 60, 549–576.
Graham, J. W. (2012). Missing data: Analysis and design. New York: Springer.
Graham, J. W., Cumsille, P. E., & Shevock, A. E. (2013). Methods for handling missing data. In J.
A. Schinka & W. F. Velicer (Eds.), Research methods in psychology (Vol. 3). New York: Wiley.
Graham, J. W., Hofer, S. M., & MacKinnon, D. P. (1996). Maximizing the usefulness of data
obtained with planned missing value patterns: An application of maximum likelihood pro-
cedures. Multivariate Behavioral Research, 31, 197–218.
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really
needed?: Some practical clarifications of multiple imputation theory. Prevention Science, 8,
206–213.
References 501

Graham, J. W., Taylor, B. J., & Cumsille, P. E. (2001). Planned missing data designs in analysis of
change. In L. Collins & A. Sayer (Eds.), New methods for the analysis of change (pp. 335–353).
Washington, DC: American Psychological Association.
Graham, J. W., Taylor, B. J., Olchowski, A. E., & Cumsille, P. E. (2006). Planned missing data
designs in psychological research. Psychological Methods, 11, 323–343.
Greene, W. H. (2017). Econometric analysis (8th ed.). Boston: Prentice Hall.
Grimm, K. J., Ram, N., & Estabrook, R. (2016). Growth modeling: Structural equation and multi-
level modeling approaches. New York: Guilford Press.
Grund, S., Lüdke, O., & Robitzsch, A. (2016a). Multiple imputation of missing covariate values
in multilevel models with random slopes: A cautionary note. Behavior Research Methods, 48,
640–649.
Grund, S., Lüdke, O., & Robitzsch, A. (2016b). Multiple imputation of multilevel missing data:
an introduction to the R package pan. Sage Open, 6, 1–17.
Grund, S., Lüdke, O., & Robitzsch, A. (2016c). Pooling ANOVA results from multiply imputed
datasets: A simulation study. Methodology, 12, 75–88.
Grund, S., Lüdke, O., & Robitzsch, A. (2017). Multiple imputation of missing data at level 2: A
comparison of fully conditional and joint modeling in multilevel designs. Journal of Educa-
tional and Behavioral Statistics, 43, 316–353.
Grund, S., Lüdke, O., & Robitzsch, A. (2018). Multiple imputation of missing data for multi-
level models: Simulations and recommendations. Organizational Research Methods, 21, 111–
149.
Grund, S., Lüdtke, O., & Robitzsch, A. (2021, May 23). Multiple imputation of missing data in
multilevel models with the R package mdmb: A flexible sequential modeling approach.
Behavior Research Methods. [Epub ahead of print]
Grund, S., Robitzsch, A., & Lüdke, O. (2021). Package ‘mitml.’ Retrieved from https://cran.r-proj-
ect.org/web/packages/mitml/mitml.pdf.
Guo, J., Gabry, J., Goodrich, B., & Weber, S. (2020). Package ‘rstan.’ Retrieved from https://cran.r-
project.org/web/packages/rstan/rstan.pdf.
Hamaker, E. L., & Muthén, B. (2020). The fixed versus random effects debate and how it relates
to centering in multilevel modeling. Psychological Methods, 25, 365–379.
Hancock, G. R., & Liu, M. (2012). Bootstrapping standard errors and data-model fit statistics in
structural equation modeling. In R. H. Hoyle (Ed.), Handbook of structural equation modeling
(pp. 296–306). New York: Guilford Press.
Hardt, J., Herke, M., & Leonhart, R. (2012). Auxiliary variables in multiple imputation in regres-
sion with missing X: A warning against including too many in small sample research. BMC
Medical Research Methodology, 12, Article 184.
Harel, O. (2007). Inferences on missing information under multiple imputation and two-stage
multiple imputation. Statistical Methodology, 4, 75–89.
Hartley, H. O. (1958). Maximum likelihood estimation from incomplete data. Biometrics, 14,
174–194.
Hartley, H. O., & Hocking, R. R. (1971). The analysis of incomplete data. Biometrics, 27, 783–823.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applica-
tions. Biometrika, 57, 97–109.
Hayes, A. F. (2013). Introduction to mediation, moderation, and conditional process analysis: A
regression-based approach. New York: Guilford Press.
Hayes, A. F., & Cai, L. (2007). Using heteroskedasticity-consistent standard error estimators in
OLS regression: An introduction and software implementation. Behavior Research Methods,
39, 709–722.
He, Y. L., & Raghunathan, T. E. (2009). On the performance of sequential regression multiple
502 References

imputation methods with non normal error distributions. Communications in Statistics—


Simulation and Computation, 38, 856–883.
Heckman, J. J. (1976). The common structure of statistical models of truncation, sample selection
and limited dependent variables and a simple estimator for such models. Annals of Economic
and Social Measurement, 5, 475–492.
Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153–161.
Hedeker, D., & Gibbons, R. D. (1997). Application of random-effects pattern-mixture models for
missing data in longitudinal studies. Psychological Methods, 2, 64–78.
Hedeker, D., & Gibbons, R. D. (2006). Longitudinal data analysis. Hoboken, NJ: Wiley.
Hedges, L. V., & Hedberg, E. C. (2007). Intraclass correlation values for planning group-random-
ized trials in education. Educational Evaluation and Policy Analysis, 29, 60–87.
Hoff, P. D. (2009). A first course in Bayesian statistical methods. New York: Springer.
Hogan, J. W., & Laird, N. M. (1997a). Mixture models for the joint distribution of repeated mea-
sures and event times. Statistics in Medicine, 16, 239–257.
Hogan, J. W., & Laird, N. M. (1997b). Model-­based approaches to analysing incomplete longitu-
dinal and failure time data. Statistics in Medicine, 16, 259–272.
Holmes, C. C., & Held, L. (2006). Bayesian auxiliary variable models for binary and multinomial
regression. Bayesian Analysis, 1, 145–168.
Honaker, J., King, G., & Blackwell, M. (2021). Package ‘Amelia.’ Retrieved from https://cran.r-
project.org/web/packages/amelia/amelia.pdf.
Horton, N. J., Lipsitz, S. R., & Parzen, M. (2003). A potential for bias when rounding in multiple
imputation. American Statistician, 57, 229–232.
Howard, W. J., Rhemtulla, M., & Little, T. D. (2015). Using principal components as auxiliary
variables in missing data estimation. Multivariate Behavioral Research, 50, 285–299.
Hox, J. J., Moerbeek, M., & Van de Schoot, R. (2017). Multilevel analysis: Techniques and applica-
tions. New York: Routledge.
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisci-
plinary Journal, 6, 1–55.
Hubbard, A., & Enders, C. K. (2022). Applying multiple imputation pooling rules to estimands with
non-normal sampling distributions. Manuscript in progress.
Huisman, M. (2000). Imputation of missing item responses: Some simple techniques. Quality and
Quantity, 34, 331–351.
Ibrahim, J. G. (1990). Incomplete data in generalized linear models. Journal of the American Sta-
tistical Association, 85, 765–769.
Ibrahim, J. G., Chen, M. H., & Lipsitz, S. R. (2002). Bayesian methods for generalized linear mod-
els with covariates missing at random. Canadian Journal of Statistics, 30, 55–78.
Ibrahim, J. G., Chen, M. H., Lipsitz, S. R., & Herring, A. H. (2005). Missing-data methods for
generalized linear models: A comparative review. Journal of the American Statistical Associa-
tion, 100, 332–346.
Ibrahim, J. G., Lipsitz, S. R., & Chen, M. H. (1999). Missing covariates in generalized linear
models when the missing data mechanism is non-ignorable. Journal of the Royal Statistical
Society B: Statistical Methodology, 61, 173–190.
Imai, K., & van Dyk, D. A. (2005). A Bayesian analysis of the multinomial probit model using
marginal data augmentation. Journal of Econometrics, 124, 311–334.
Jackman, S. (2000). Estimation and inference via Bayesian simulation: An introduction to Mar-
kov chain Monte Carlo. American Journal of Political Science, 44, 375–404.
Jackman, S. (2009). Bayesian analysis for the social sciences. West Sussex, UK: Wiley.
Jamshidian, M., & Bentler, P. M. (1999). ML estimation of mean and covariance structures with
References 503

missing data using complete data routines. Journal of Educational and Behavioral Statistics,
24, 21–41.
Jamshidian, M., & Jalal, S. (2010). Tests of homoscedasticity, normality, and missing completely
at random for incomplete multivariate data. Psychometrika, 75, 649–674.
Jansen, I., Hens, N., Molenberghs, G., Aerts, M., Verbeke, G., & Kenward, M. G. (2006). The
nature of sensitivity in monotone missing not at random models. Computational Statistics
and Data Analysis, 50, 830–858.
Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceed-
ings of the Royal Society of London A: Mathematical and Physical Sciences, 186, 453–461.
Jeffreys, H. (1961). Theory of probability (3rd ed.). London: Oxford University Press.
Jeličić, H., Phelps, E., & Lerner, R. A. (2009). Use of missing data methods in longitudinal stud-
ies: The persistence of bad practices in developmental psychology. Developmental Psychol-
ogy, 45, 1195–1199.
Jinadasa, K., & Tracy, D. (1992). Maximum likelihood estimation for multivariate normal dis-
tribution with monotone sample. Communications in Statistics—Theory and Methods, 21,
41–50.
Johnson, E. G. (1992). The design of the National Assessment of Educational Progress. Journal of
Educational Measurement, 29, 95–110.
Johnson, V. E. (1996). Studying convergence of Markov chain Monte Carlo algorithms using
coupled sample paths. Journal of the American Statistical Association, 91, 154–166.
Johnson, V. E., & Albert, J. H. (1999). Ordinal data modeling. New York: Springer.
Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis.
Psychometrika, 34, 183–202.
Jöreskog, K. G., & Moustaki, I. (2001). Factor analysis of ordinal variables: A comparison of three
approaches. Multivariate Behavioral Research, 36, 347–387.
Jöreskog, K. G., & Sörbom, D. (2021). LISREL 11 for Windows. Skokie, IL: Scientific Software
International.
Jorgensen, T. D., Pornprasertmanit, S., Schoemann, A. M., & Rosseel, Y. (2021). Package ‘sem-
Tools.’ Retrieved from https://cran.r-project.org/web/packages/semtools/semtools.pdf.
Jose, P. E. (2013). Doing statistical mediation and moderation. New York: Guilford Press.
Judd, C. M., & Kenny, D. A. (1981). Process analysis: Estimating mediation in treatment evalua-
tions. Evaluation Review, 5, 602–619.
Kaplan, D. (1990). Evaluating and modifying covariance structure models: A review and recom-
mendation. Multivariate Behavioral Research, 25, 137–155.
Kaplan, D. (2009). Structrual equation modeling: Foundations and extensions (2nd ed.). Thousand
Oaks, CA: Sage.
Kaplan, D. (2014). Bayesian statistics for the social sciences. New York: Guilford Press.
Kaplan, D., & Depaoili, S. (2012). Bayesian structural equation modeling. In R. Hoyle (Ed.),
Handbook of structural equation modeling (pp. 650–673). New York: Guilford Press.
Karahalios, A., Baglietto, L., Carlin, J. B., English, D. R., & Simpson, J. A. (2012). A review of the
reporting and handling of missing data in cohort studies with repeated assessment of expo-
sure measures. BMC Medical Research Methodology, 12, 1–10.
Kasim, R. M., & Raudenbush, S. W. (1998). Application of Gibbs sampling to nested variance
components models with heterogeneous within-­group variance. Journal of Educational and
Behavioral Statistics, 32, 93–116.
Kass, R. E., & Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal
of the American Statistical Association, 91, 1343–1370.
Keller, B. T. (2022). Model-based missing data handling for scale score and latent variable interac-
tions. Manuscript in progress.
504 References

Keller, B. T., & Enders, C. K. (2021). Blimp user’s guide (Version 3). Retrieved from www.applied-
missingdata.com/blimp.
Keller, B. T., & Enders, C. K. (2022). An investigation of factored regression missing data methods for
multilevel models with cross-level interactions. Manuscript submitted for publication.
Kenward, M. G. (1998). Selection models for repeated measurements with non-random dropout:
An illustration of sensitivity. Statistics in Medicine, 17, 2723–2732.
Kenward, M. G., & Molenberghs, G. (1998). Likelihood based frequentist inference when data
are missing at random. Statistical Science, 13, 236–247.
Kenward, M. G., & Molenberghs, G. (2014). A perspective and historical overview on selection,
pattern-mixture and shared parameter models. In G. Molenberghs, G. Fitzmaurice, M. G.
Kenward, A. Tsiatis, & G. Verbeke (Eds.), Missing data methodology handbooks of modern
statistical methods (pp. 53–90). Boca Raton, FL: CRC Press.
Kim, J. K., Brick, J. M., Fuller, W. A., & Kalton, G. (2006). On the bias of the multiple-­imputation
variance estimator in survey sampling. Journal of the Royal Statistical Society B: Statistical
Methodology, 68, 509–521.
Kim, K. H., & Bentler, P. M. (2002). Tests of homogeneity of means and covariance matrices for
multivariate incomplete data. Psychometrika, 67, 609–623.
Kim, S., Belin, T. R., & Sugar, C. A. (2018). Multiple imputation with non-additively related
variables: Joint-modeling and approximations. Statistical Methods in Medical Research, 27,
1683–1694.
Kim, S., Sugar, C. A., & Belin, T. R. (2015). Evaluating model-based imputation methods for miss-
ing covariates in regression models with interactions. Statistics in Medicine, 34, 1876–1888.
King, G., & Roberts, M. E. (2015). How robust standard errors expose methodological problems
they do not fix, and what to do about it. Political Analysis, 23, 159–179.
Klebanoff, M. A., & Cole, S. R. (2008). Use of multiple imputation in the epidemiologic literature.
American Journal of Epidemiology, 168, 355–357.
Kleinke, K. (2017). Multiple imputation under violated distributional assumptions: A systematic
evaluation of the assumed robustness of predictive mean matching. Journal of Educational
and Behavioral Statistics, 42, 371–404.
Kline, R. B. (2015). Principles and practice of structural equation modeling (4th ed.). New York:
Guilford Press.
Kopylov, I. (2008). Subjective probability. In T. Rudas (Ed.), Handbook of probability: Theory and
applications (pp. 35–48). Thousand Oaks, CA: Sage.
Kreft, I. G., de Leeuw, J., & Aiken, L. S. (1995). The effect of different forms of centering in hier-
archical linear models. Multivariate Behavioral Research, 30, 1–21.
Kruschke, J. K., & Liddell, T. M. (2018). Bayesian data analysis for newcomers. Psychonomic Bul-
letin and Review, 25, 155–177.
Kunkle, D., & Kaizer, E. E. (2017). A comparison of existing methods for multiple imputation in
individual participant data meta-analysis. Statistics in Medicine, 36, 3507–3532.
Lee, K. J., & Carlin, J. B. (2017). Multiple imputation in the presence of non-­normal data. Statis-
tics in Medicine, 36, 606–617.
Lee, M. D., & Wagenmakers, E. J. (2005). Bayesian statistical inference in psychology: Comment
on Trafimow (2003). Psychological Review, 112, 662–668.
Lee, T., & Cai, L. (2012). Alternative multiple imputation inference for mean and covariance
structure modeling. Journal of Educational and Behavioral Statistics, 37, 675–702.
Levy, R., & Enders, C. (2021, May 6). Full conditional distributions for Bayesian multilevel mod-
els with additive or interactive effects and missing data on covariates. Communications in
Statistics—Simulation and Computation. [Epub ahead of print]
Levy, R., & Mislevy, R. J. (2016). Bayesian psychometric modeling. Boca Raton, FL: CRC Press.
References 505

Li, K.-H., Meng, X.-L., Raghunathan, T. E., & Rubin, D. B. (1991). Significance levels from
repeated p-values with multiply-imputed data. Statistica Sinica, 1, 65–92.
Li, K. H., Raghunathan, T. E., & Rubin, D. B. (1991). Large-sample significance levels from mul-
tiply imputed data using moment-based statistics and an F reference distribution. Journal of
the American Statistical Association, 86, 1065–1073.
Liang, J., & Bentler, P. M. (2004). An EM algorithm for fitting two-level structural equation mod-
els. Psychometrika, 69, 101–122.
Lipsitz, S. R., & Ibrahim, J. G. (1996). A conditional model for incomplete covariates in paramet-
ric regression models. Biometrika, 83, 916–922.
Little, R. (2009). Selection and pattern-mixture models. In G. Fitzmaurice, M. Davidian, G.
Vebeke, & G. Molenberghs (Eds.), Longitudinal data analysis (pp. 409–431). Boca Raton, FL:
Chapman & Hall.
Little, R. J. (1988a). Missing-data adjustments in large surveys. Journal of Business and Economic
Statistics, 6, 287–296.
Little, R. J. A. (1988b). A test of missing completely at random for multivariate data with missing
values. Journal of the American Statistical Association, 83, 1198–1202.
Little, R. J. A. (1992). Regression with missing X’s: A review. Journal of the American Statistical
Association, 87, 1227–1237.
Little, R. J. A. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the
American Statistical Association, 88, 125–134.
Little, R. J. A. (1994). A class of pattern-mixture models for normal incomplete data. Biometrika,
81, 471–483.
Little, R. J. A. (1995). Modeling the drop-out mechanism in repeated-measures studies. Journal of
the American Statistical Association, 90, 1112–1121.
Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. Hoboken, NJ: Wiley.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Hoboken,
NJ: Wiley.
Little, R. J. A., & Rubin, D. B. (2020). Statistical analysis with missing data (3rd ed.). Hoboken, NJ:
Wiley.
Little, T. D. (2013). Longitudinal structural equation modeling. New York: Guilford Press.
Little, T. D., & Rhemtulla, M. (2013). Planned missing data designs for developmental research-
ers. Child Development Perspectives, 7, 199–204.
Liu, C. (1995). Missing data imputation using the multivariate t distribution. Journal of Multivari-
ate Analysis, 53, 139–158.
Liu, G., & Gould, A. L. (2002). Comparison of alternative strategies for analysis of longitudinal
trials with dropouts. Journal of Biopharmaceutical Statistics, 12, 207–226.
Liu, H. Y., Zhang, Z. Y., & Grimm, K. J. (2016). Comparison of inverse Wishart and separation-
strategy priors for Bayesian estimation of covariance parameter matrix in growth curve
analysis. Structural Equation Modeling: A Multidisciplinary Journal, 23, 354–367.
Liu, J. C., Gelman, A., Hill, J., Su, Y. S., & Kropko, J. (2014). On the stationary distribution of
iterative imputations. Biometrika, 101, 155–173.
Liu, Y., & Enders, C. K. (2017). Evaluation of multi-parameter test statistics for multiple imputa-
tion. Multivariate Behavioral Research, 52, 371–390.
Liu, Y., & Sriutaisuk, S. (2019). Evaluation of model fit in structural equation models with ordi-
nal missing data: An examination of the D2 method. Structural Equation Modeling: A Multi-
disciplinary Journal, 27, 561–583.
Lomnicki, Z. A. (1967). On the distribution of products of random variables. Journal of the Royal
Statistical Society, 29, 513–524.
Longford, N. (1989). Contextual effects and group means. Multilevel Modelling Newsletter, 1, 5–11.
506 References

Lord, F. M. (1955). Estimation of parameters from incomplete data. Journal of the American Sta-
tistical Association, 50, 870–876.
Lord, F. M. (1962). Estimating norms by item-sampling. Educational and Psychological Measure-
ment, 22, 259–267.
Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm.
Journal of the Royal Statistical Society B: Statistical Methodology, 44, 226–233.
Lüdtke, O., Marsh, H. W., Robitzsch, A., Trautwein, U., Asparouhov, T., & Muthén, B. (2008).
The multilevel latent covariate model: A new, more reliable approach to group-level effects
in contextual studies. Psychological Methods, 13, 201–229.
Lüdtke, O., Robitzsch, A., & Grund, S. (2017). Multiple imputation of missing data in multilevel
designs: A comparison of different strategies. Psychological Methods, 22, 141–165.
Lüdtke, O., Robitzsch, A., & West, S. G. (2020a). Analysis of interactions and nonlinear effects
with missing data: A factored regression modeling approach using maximum likelihood
estimation. Multivariate Behavioral Research, 55, 361–381.
Lüdtke, O., Robitzsch, A., & West, S. G. (2020b). Regression models involving nonlinear effects
with missing data: A sequential modeling approach using Bayesian estimation. Psychologi-
cal Methods, 25, 157–181.
Lunn, D., Jackson, C., Thomas, A., & Spiegelhalter, D. (2013). The BUGS book. Boca Raton, FL:
CRC Press.
Lynch, S. M. (2007). Introduction to applied Bayesian statistics and estimation for social scientists.
Berlin: Springer.
MacCallum, R. (1986). Specification searches in covariance structure modeling. Psychological
Bulletin, 100, 107–120.
MacCallum, R. C., Roznowski, M., & Necowitz, L. B. (1992). Model modifications in covari-
ance structure analysis: The problem of capitalization on chance. Psychological Bulletin, 111,
490–504.
MacKinnon, D. P. (2008). Introduction to statistical mediation analysis. New York: Erlbaum.
MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G., & Sheets, V. (2002). A com-
parison of methods to test mediation and other intervening variable effects. Psychological
Methods, 7, 83–104.
MacKinnon, D. P., Lockwood, C. M., & Williams, J. (2004). Confidence limits for the indi-
rect effect: Distribution of the product and resampling methods. Multivariate Behavioral
Research, 39, 99–128.
Madley-Dowd, P., Hughes, R., Tilling, K., & Heron, J. (2019). The proportion of missing data
should not be used to guide decisions on multiple imputation. Journal of Clinical Epidemiol-
ogy, 110, 63–73.
Magnus, J. R., & Neudecker, J. (1999). Matrix differential calculus with applications in statistics and
econometrics (3rd ed.). West Sussex, UK: Wiley.
Mallinckrodt, C. H., Clark, W. S., & David, S. R. (2001). Accounting for dropout bias using
mixed-effects models. Journal of Biopharmaceutical Statistics, 11, 9–21.
Manly, C. A., & Wells, R. S. (2015). Reporting the use of multiple imputation for missing data in
higher education research. Research in Higher Education, 56, 397–409.
Mansolf, M., Jorgensen, T. D., & Enders, C. K. (2020). A multiple imputation score test for model
modification in structural equation models. Psychological Methods, 25, 393–411.
Marshall, A., Altman, D. G., Holder, R. L., & Royston, P. (2009). Combining estimates of interest
in prognostic modelling studies after multiple imputation: Current practice and guidelines.
BMC Medical Research Methodology, 9, 1–8.
Matz, A. W. (1978). Maximum likelihood parameter estimation for the quartic exponential dis-
tribution. Technometrics, 20, 475–484.
References 507

Maydeu-Olivares, A., & Joe, H. (2005). Limited-and full-information estimation and goodness-
of-fit testing in 2n contingency tables: A unified framework. Journal of the American Statisti-
cal Association, 100, 1009–1020.
Mazza, G. L., Enders, C. K., & Ruehlman, L. S. (2015). Addressing item-level missing data: A
comparison of proration and full information maximum likelihood estimation. Multivariate
Behavioral Research, 50, 504–519.
McCulloch, R., & Rossi, P. E. (1994). An exact likelihood analysis of the multinomial probit
model. Journal of Econometrics, 64, 207–240.
McCulloch, R. E., Polson, N. G., & Rossi, P. E. (2000). A Bayesian analysis of the multinomial
probit model with fully identified parameters. Journal of Econometrics, 99, 173–193.
McDonald, R. P., & Ho, M. H. (2002). Principles and practice in reporting structural equation
analyses. Psychological Methods, 7, 64–82.
McKelvey, R. D., & Zavoina, W. (1975). A statistical model for the analysis of ordinal level depen-
dent variables. Journal of Mathematical Sociology, 4, 103–120.
McLachlan, G. J., & Krishnan, T. (2007). The EM algorithm and extensions. Hoboken, NJ: Wiley.
McNeish, D. (2016a). On using Bayesian methods to address small sample problems. Structural
Equation Modeling: A Multidisciplinary Journal, 23, 750–773.
McNeish, D. M. (2016b). Using data-dependent priors to mitigate small sample bias in latent
growth models: A discussion and illustration using Mplus. Journal of Educational and Behav-
ioral Statistics, 41, 27–56.
McNeish, D., & Kelley, K. (2019). Fixed effects models versus mixed effects models for clustered
data: Reviewing the approaches, disentangling the differences, and making recommenda-
tions. Psychological Methods, 24, 20–35.
McNeish, D., Stapleton, L. M., & Silverman, R. D. (2017). On the unnecessary ubiquity of hierar-
chical linear modeling. Psychological Methods, 22, 114–140.
Mealli, F., & Rubin, D. B. (2016). Clarifying missing at random and related definitions, and
implications when coupled with exchangeability. Biometrika, 103, 491–491.
Mehta, P. D., & Neale, M. C. (2005). People are variables too: Multilevel structural equations
modeling. Psychological Methods, 10, 259–284.
Mehta, P. D., & West, S. G. (2000). Putting the individual back into individual growth curves.
Psychological Methods, 5, 23–43.
Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical
Science, 9, 538–558.
Meng, X.-L., & Rubin, D. B. (1991). Using EM to obtain asymptotic variance–covariance matri-
ces: The SEM algorithm. Journal of the American Statistical Association, 86, 899–909.
Meng, X.-L., & Rubin, D. B. (1992). Performing likelihood ratio tests with multiply-imputed data
sets. Biometrika, 79, 103–111.
Merkle, E. C., Fitzsimmons, E., Uanhoro, J., & Goodrich, B. (2020). Efficient Bayesian structural
equation modeling in Stan. Retrieved from https://arxiv.org/pdf/2008.07733.pdf.
Merkle, E. C., & Rosseel, Y. (2018). blavaan: Bayesian structural equation models via parameter
expansion. Journal of Statistical Software, 85, 1–30.
Merkle, E. C., Rosseel, Y., Goodrich, B., & Garnier-Villarreal, M. (2021). Package ‘blavaan.’
Retrieved from https://cran.r-project.org/web/packages/blavaan/blavaan.pdf.
Mi, X., Miwa, T., & Hothorn, T. (2009). mvtnorm: New numerical algorithm for multivariate
normal probabilities. R Journal, 1, 37–39.
Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological
Bulletin, 105, 156–166.
Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex sam-
ples. Psychometrika, 56, 177–196.
508 References

Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population charac-
teristics from sparse matrix samples of item responses. Journal of Educational Measurement,
29, 133–161.
Mistler, S. A., & Enders, C. K. (2011). An introduction to planned missing data designs for devel-
opmental research. In B. Laursen, T. Little, & N. Card (Eds.), Handbook of developmental
research methods (pp. 742–754). New York: Guilford Press.
Mistler, S. A., & Enders, C. K. (2017). A comparison of joint model and fully conditional speci-
fication imputation for multilevel missing data. Journal of Educational and Behavioral Statis-
tics, 42, 432–466.
Mohan, K., Pearl, J., & Tian, J. (Eds.). (2013). Graphical models for inference with missing data. Red
Hook, NY: Curran Associates.
Molenberghs, G., Beunckens, C., Sotto, C., & Kenward, M. G. (2008). Every missingness not at
random model has a missingness at random counterpart with equal fit. Journal of the Royal
Statistical Society B: Statistical Methodology, 70, 371–388.
Molenberghs, G., & Kenward, M. (2007). Missing data in clinical studies. West Sussex, UK: Wiley.
Molenberghs, G., Michiels, B., Kenward, M. G., & Diggle, P. J. (1998). Monotone missing data and
pattern-mixture models. Statistica Neerlandica, 52, 153–161.
Molenberghs, G., Thijs, H., Jansen, I., Beunckens, C., Kenward, M. G., Mallinckrodt, C., & Car-
roll, R. J. (2004). Analyzing incomplete longitudinal clinical trial data. Biostatistics, 5, 445–
464.
Molenberghs, G., & Verbeke, G. (2001). A review on linear mixed models for longitudinal data,
possibly subject to dropout. Statistical Modelling, 1, 235–269.
Molenberghs, G., Verbeke, G., Thijs, H., Lesaffre, E., & Kenward, M. G. (2001). Influence analy-
sis to assess sensitivity of the dropout process. Computational Statistics and Data Analysis,
37, 93–113.
Montgomery, D. C. (2020). Design and analysis of experiments (10th ed.). Hoboken, NJ: Wiley.
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statisti-
cal methods. Statistics in Medicine, 38, 2074–2102.
Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical,
and continuous latent variable indicators. Psychometrika, 49, 115–132.
Muthén, B., & Asparouhov, T. (2008). Growth mixture modeling: Analysis with non-Gaussian
random effects. In G. Fitzmaurice, M. Davidian, G. Verbeke, & G. Molenberghs (Eds.), Lon-
gitudinal data analysis (pp. 143–165). Boca Raton, FL: Chapman & Hall.
Muthén, B., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible
representation of substantive theory. Psychological Methods, 17, 313–335.
Muthén, B., Asparouhov, T., Hunter, A. M., & Leuchter, A. F. (2011). Growth modeling with non-
ignorable dropout: Alternative analyses of the STAR*D antidepressant trial. Psychological
Methods, 16, 17–33.
Muthén, B., du Toit, S. H. C., & Spisic, D. (1997). Robust inference using weighted least squares
and quadratic estimating equations in latent variable modeling with categorical and con-
tinuous outcomes. Unpublished technical report. Retrieved from www.statmodel.com/down-
load/article_075.pdf.
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are
not missing completely at random. Psychometrika, 52, 431–462.
Muthén, B., & Masyn, K. (2005). Discrete-time survival mixture analysis. Journal of Educational
and Behavioral Statistics, 30, 27–58.
Muthén, B., Muthén, L., & Asparouhov, T. (2016). Regression and mediation analysis using Mplus.
Los Angeles: Muthén & Muthén.
References 509

Muthén, B., & Shedden, K. (1999). Finite mixture modeling with mixture outcomes using the
EM algorithm. Biometrics, 55, 463–469.
Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus user’s guide. (8th ed.). Los Angeles: Muthén
& Muthén.
Muthén, L. K., & Muthén, B. O. (2002). How to use a Monte Carlo study to decide on sample size
and determine power. Structural Equation Modeling: A Multidisciplinary Journal, 9, 599–620.
Mykland, P., Tierney, L., & Yu, B. (1995). Regeneration in Markov-chain samplers. Journal of the
American Statistical Association, 90, 233–241.
Nandram, B., & Chen, M.-H. (1996). Reparameterizing the generalized linear model to accelerate
Gibbs sampler convergence. Journal of Statistical Computation and Simulation, 54, 129–144.
Neelon, B. (2019). Bayesian zero-inflated negative binomial regression based on pólya-gamma
mixtures. Bayesian Analysis, 14, 829–855.
Nesselroade, J. R., & Baltes, P. B. (1979). Longitudinal research in the study of behavior and develop-
ment. New York: Academic Press.
Nielsen, S. F. (2003). Proper and improper multiple imputation. International Statistical Review,
71, 593–607.
O’Brien, S. M., & Dunson, D. B. (2004). Bayesian multivariate logistic regression. Biometrics, 60,
739–746.
O’Hagan, A. (2008). The Bayesian approach to statistics. In T. Rudas (Ed.), Handbook of probabil-
ity: Theory and applications (pp. 85–100). Thousand Oaks, CA: Sage.
Olkin, I., & Tate, R. F. (1961). Multivariate correlation models with mixed discrete and continu-
ous variables. Annals of Mathematical Statistics, 32, 448–465.
Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psy-
chometrika, 44, 443–460.
Orchard, T., & Woodbury, M. A. (1972). A missing information principle: Theory and applica-
tions. In Proceedings from the Sixth Berkeley Symposium on Mathematical Statistics and Prob-
ability: Vol. 1. Theory of statistics (pp. 697–715). Berkeley: University of California Press.
Palomo, J., Dunson, D. B., & Bollen, K. (2007). Bayesian structural equation modeling. In S.-Y.
Lee (Ed.), Handbook of latent variable and related models (pp. 163–188). Amsterdam: Elsevier.
Pan, Q., & Wei, R. (2016). Fraction of missing information (g) at different missing data fractionvs
in the 2012 NAMCS Physician Workflow Mail Survey. Applied Mathematics, 7, 1057–1067.
Park, T., & Lee, S. Y. (1997). A test of missing completely at random for longitudinal data with
missing observations. Statistics in Medicine, 16, 1859–1871.
Pawitan, Y. (2000). A reminder of the fallibility of the Wald statistic: Likelihood explanation.
American Statistician, 54, 54–56.
Paxton, P., Curran, P. J., Bollen, K. A., Kirby, J., & Chen, F. N. (2001). Monte Carlo experiments:
Design and implementation. Structural Equation Modeling: A Multidisciplinary Journal, 8,
287–312.
Peugh, J. L., & Enders, C. K. (2004). Missing data in educational research: A review of reporting
practices and suggestions for improvement. Review of Educational Research, 74, 525–556.
Peyre, H., Leplege, A., & Coste, J. (2011). Missing data methods for dealing with missing items
in quality of life questionnaires: A comparison by simulation of personal mean score, full
information maximum likelihood, multiple imputation, and hot deck techniques applied to
the SF-36 in the French 2003 decennial health survey. Quality of Life Research, 20, 287–300.
Plummer, M. (2019). Package ‘rjags.’ Retrieved from https://cran.r-project.org/web/packages/rjags/
rjags.pdf.
Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using pólya–
gamma latent variables. Journal of the American Statistical Association, 108, 1339–1349.
510 References

Poon, W.-Y., & Lee, S.-Y. (1998). Analysis of two-level structural equation models via EM-type
algorithm. Statistica Sinica, 8, 749–766.
Potthoff, R. F., Tudor, G. E., Pieper, K. S., & Hasselblad, V. (2006). Can one assess whether miss-
ing data are missing at random in medical studies? Statistical Methods in Medical Research,
15, 213–234.
Pritikin, J. N., Brick, T. R., & Neale, M. C. (2018). Multivariate normal maximum likelihood
with both ordinal and continuous variables, and data missing at random. Behavior Research
Methods, 50, 490–500.
Puhani, P. A. (2000). The Heckman correction for sample selection and its critique. Journal of
Economic Surveys, 14, 53–68.
Quartagno, M., & Carpenter, J. R. (2016). Multiple imputation for IPD meta-analysis: Allowing
for heterogeneity and studies with missing covariates. Statistics in Medicine, 35, 2938–2954.
Quartagno, M., & Carpenter, J. R. (2019). Multiple imputation for discrete data: Evaluation of the
joint latent normal model. Biometrical Journal, 61, 1003–1019.
Quartagno, M., & Carpenter, J. (2020). Package ‘jomo.’ Retrieved from https://cran.r-project.org/
web/packages/jomo/jomo.pdf.
R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria.
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). Generalized multilevel structural equation
modeling. Psychometrika, 69, 167–190.
Rabe-Hesketh, S., Skondral, A., & Zheng, X. (2012). Multilevel structural equation modeling.
In R. H. Hoyle (Ed.), Handbook of structrual equation modeling (pp. 512–531). New York:
Guilford Press.
Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25,
111–163.
Raftery, A. E., & Lewis, S. M. (1992). [Practical Markov chain Monte Carlo]: Comment: One long
run with diagnostics: Implementation strategies for Markov chain Monte Carlo. Statistical
Science, 7, 493–497.
Raghunathan, T. E., & Grizzle, J. E. (1995). A split questionnaire survey design. Journal of the
American Statistical Association, 90, 54–63.
Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J., & Solenberger, P. (2001). A multivariate
technique for multiply imputing missing values using a sequence of regression models.
Survey Methodology, 27, 85–95.
Rao, C. R. (1948). Large sample tests of statistical hypotheses concerning several parameters
with applications to problems of estimation. Mathematical Proceedings of the Cambridge Phil-
osophical Society, 44, 50–57.
Raudenbush, S. W. (1995). Maximum likelihood estimation for unbalanced multilevel covari-
ance structure models via the EM algorithm. British Journal of Mathematical and Statistical
Psychology, 48, 359–370.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis
methods (2nd ed.). Thousand Oaks, CA: Sage.
Raudenbush, S. W., Bryk, A. S., Cheong, Y., & Congdon, R. (2019). HLM for Windows [Computer
software]. Skokie, IL: Scientific Software Internation.
Raykov, T. (2011). On testability of missing data mechanisms in incomplete data sets. Structural
Equation Modeling: A Multidisciplinary Journal, 18, 419–429.
Raykov, T., Lichtenberg, P. A., & Paulson, D. (2012). Examining the missing completely at ran-
dom mechanism in incomplete data sets: A multiple testing approach. Structural Equation
Modeling: A Multidisciplinary Journal, 19, 399–408.
Raykov, T., & Marcoulides, G. A. (2004). Using the delta method for approximate interval esti-
References 511

mation of parameter functions in SEM. Structural Equation Modeling: A Multidisciplinary


Journal, 11, 621–637.
Raykov, T., & Marcoulides, G. A. (2014). Identifying useful auxiliary variables for incomplete
data analyses: A note on a group difference examination approach. Educational and Psycho-
logical Measurement, 74, 537–550.
Raykov, T., & West, B. T. (2015). On enhancing plausibility of the missing at random assump-
tion in incomplete data analyses via evaluation of response-auxiliary variable correlations.
Structural Equation Modeling: A Multidisciplinary Journal, 23, 45–53.
Reiter, J. P. (2007). Small-sample degrees of freedom for multi-component significance tests with
multiple imputation for missing data. Biometrika, 94, 502–508.
Reiter, J. P., & Raghunathan, T. E. (2007). The multiple adaptations of multiple imputation. Jour-
nal of the American Statistical Association, 102, 1462–1471.
Reiter, J. P., Raghunathan, T. E., & Kinney, S. K. (2006). The importance of modeling the survey
design in multiple imputation for missing data. Survey Methodology, 32, 143–150.
Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). When can categorical variables be
treated as continuous?: A comparison of robust continuous and categorical SEM estimation
methods under suboptimal conditions. Psychological Methods, 17, 354–373.
Rhemtulla, M., & Hancock, G. R. (2016). Planned missing data designs in educational psychol-
ogy research. Educational Psychologist, 51, 305–316.
Rhemtulla, M., & Little, T. D. (2012). Planned missing data designs for research in cognitive
development. Journal of Cognition and Development, 13, 425–438.
Rhemtulla, M., Savalei, V., & Little, T. D. (2016). On the asymptotic relative efficiency of planned
missingness designs. Psychometrika, 81, 60–89.
Rights, J. D., & Sterba, S. K. (2019). Quantifying explained variance in multilevel models: An
integrative framework for defining R-squared measures. Psychological Methods, 24, 309–338.
Ritter, C., & Tanner, M. A. (1992). Facilitating the Gibbs sampler: The Gibbs stopper and the
Griddy–Gibbs sampler. Journal of the American Statistical Association, 87, 861–868.
Robert, C. P. (1995). Simulation of truncated normal variables. Statistics and Computing, 5, 121–
125.
Robert, C. P., & Casella, G. (2004). Monte Carlo statistical methods (2nd ed.). New York: Springer.
Robins, J. M., & Wang, N. (2000). Inference for imputation estimators. Biometrika, 87, 113–124.
Robitzsch, A., & Lüdke, O. (2021). Package ‘mdmb.’ Retrieved from https://cran.r-project.org/web/
packages/mdmb/mdmb.pdf.
Rosseel, Y. (2012). lavaan: An R Package for structural equation modeling. Journal of Statistical
Software, 48, 1–36.
Rosseel, Y., Jorgensen, T. D., & Rockwood, N. J. (2021). Package ‘lavaan.’ Retrieved from https://
cran.r-project.org/web/packages/lavaan/lavaan.pdf.
Roth, P. L., Switzer, F. S., & Switzer, D. M. (1999). Missing data in multiple item scales: A Monte
Carlo analysis of missing data techniques. Organizational Research Methods, 2, 211–232.
Roy, J. (2003). Modeling longitudinal data with nonignorable dropouts using a latent dropout
class model. Biometrics, 59, 829–836.
Royston, P. (2005). Multiple imputation of missing values: Update. Stata Journal, 5, 188–201.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology, 66, 688–701.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Rubin, D. B. (1991). EM and beyond. Psychometrika, 56, 241–254.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Asso-
ciation, 91, 473–489.
512 References

Rubin, D. B. (2003). Discussion on multiple imputation. International Statistical Review, 71, 619–
625.
Rubin, D. B. (2004). The design of a general and flexible system for handling nonresponse in
sample surveys. American Statistician, 58, 298–302.
Rubin, D. B., Stern, H. S., & Vehovar, V. (1995). Handling “don’t know” survey responses: The
case of the Slovenian plebiscite. Journal of the American Statistical Association, 90, 822–828.
Saris, W. E., Satorra, A., & Sörbom, D. (1987). The detection and correction of specification
errors in structural equation models. Sociological Methodology, 17, 105–129.
Sartori, A. E. (2003). An estimator for some binary-outcome selection models without exclusion
restrictions. Political Analysis, 11, 111–138.
Satorra, A., & Bentler, P. M. (1988). Scaling corrections for chi-square statistics in covariance
structure analysis. In ASA 1988 Proceedings of the Business and Economic Statistics Section
(pp. 308–313). Alexandria, VA: American Statistical Association.
Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and standard errors in covari-
ance structure analysis. In A. V. Eye & C. C. Clogg (Eds.), Latent variables analysis: Applica-
tions for developmental research (pp. 399–419). Thousand Oaks, CA: Sage.
Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test statistic for moment struc-
ture analysis. Psychometrika, 66, 507–514.
Savalei, V. (2010). Expected versus observed information in SEM with incomplete normal and
nonnormal data. Psychological Methods, 15, 352–367.
Savalei, V. (2014). Understanding robust corrections in structural equation modeling. Structural
Equation Modeling: A Multidisciplinary Journal, 21, 149–160.
Savalei, V., & Bentler, P. M. (2005). A statistically justified pairwise ML method for incomplete
nonnormal data: A comparison with direct ML and pairwise ADF. Structural Equation Mod-
eling: A Multidisciplinary Journal, 12, 183–214.
Savalei, V., & Bentler, P. M. (2009). A two-stage approach to missing data: Theory and application
to auxiliary variables. Structural Equation Modeling: A Multidisciplinary Journal, 16, 477–497.
Savalei, V., & Falk, C. F. (2014). Robust two-stage approach outperforms robust full information
maximum likelihood with incomplete nonnormal data. Structural Equation Modeling: A Mul-
tidisciplinary Journal, 21, 280–302.
Savalei, V., & Rhemtulla, M. (2012). On obtaining estimates of the fraction of missing informa-
tion from full information maximum likelihood. Structural Equation Modeling: A Multidisci-
plinary Journal, 19, 477–494.
Savalei, V., & Rhemtulla, M. (2017). Normal theory two-stage estimator for models with compos-
ites when data are missing at the item level. Journal of Educational and Behavioral Statistics,
42, 405–431.
Savalei, V., & Rosseel, Y. (2021, April 14). Computational options for standard errors and test
statistics with incomplete normal and nonnormal data. Structural Equation Modeling: A Mul-
tidisciplinary Journal. [Epub ahead of print]
Savalei, V., & Yuan, K. H. (2009). On the model-based bootstrap with missing data: Obtaining a
p-value for a test of exact fit. Multivariate Behavioral Research, 44, 741–763.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. New York: Chapman & Hall.
Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8,
3–15.
Schafer, J. L. (2001). Multiple imputation with PAN. In A. G. Sayer & L. M. Collins (Eds.), New
methods for the analysis of change (pp. 355–377). Washington, DC: American Psychological
Association.
Schafer, J. L. (2003). Multiple imputation in multivariate problems when the imputation and
analysis models differ. Statistica Neerlandica, 57, 19–35.
References 513

Schafer, J. L. (2018). Package ‘pan.’ Retrieved from https://cran.r-project.org/web/packages/pan/pan.


pdf.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological
Methods, 7, 147–177.
Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data prob-
lems: A data analyst’s perspective. Multivariate Behavioral Research, 33, 545–571.
Schafer, J. L., & Yucel, R. M. (2002). Computational strategies for multivariate linear mixed-
effects models with missing values. Journal of Computational and Graphical Statistics, 11,
437–457.
Scheuren, F. (2005). Multiple imputation: How it began and continues. American Statistician, 59,
315–319.
Schluchter, M. D. (1992). Methods for the analysis of informatively censored longitudinal data.
Statistics in Medicine, 11, 1861–1870.
Schomaker, M., & Heumann, C. (2018). Bootstrap inference when using multiple imputation.
Statistics in Medicine, 37, 2252–2266.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Seaman, S., Galati, J., Jackson, D., & Carlin, J. (2013). What is meant by “missing at random”?
Statistical Science, 28, 257–268.
Seaman, S. R., Bartlett, J. W., & White, I. R. (2012). Multiple imputation of missing covariates
with non-linear effects and interactions: An evaluation of statistical methods. BMC Medical
Research Methodology, 12, 1–13.
Shin, Y. (2013). Efficient handling of predictors and outcomes having missing values. In L. Rut-
kowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assess-
ment data analysis: Background, technical issues, and methods of data analysis (pp. 451–479).
Boca Raton, FL: Chapman & Hall.
Shin, Y., & Raudenbush, S. W. (2007). Just-identified versus overidentified two-level hierarchical
linear models with missing data. Biometrics, 63, 1262–1268.
Shin, Y., & Raudenbush, S. W. (2013). Efficient analysis of Q-level nested hierarchical gen-
eral linear models given ignorable missing data. International Journal of Biostatistics, 9,
109–133.
Shrout, P. E., & Bolger, N. (2002). Mediation in experimental and nonexperimental studies: New
procedures and recommendations. Psychological Methods, 7, 422–445.
Sijtsma, K., & van der Ark, L. A. (2003). Investigation and treatment of missing item scores in
test and questionnaire data. Multivariate Behavioral Research, 38, 505–528.
Silvia, P. J., Kwapil, T. R., Walsh, M. A., & Myin-Germeys, I. (2014). Planned missing-data designs
in experience-sampling research: Monte Carlo simulations of efficient designs for assessing
within-person constructs. Behavior Research Methods, 46, 41–54.
Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event
occurrence. New York: Oxford University Press.
Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitu-
dinal, and structural equation models. Boca Raton, FL: Chapman & Hall.
Smith, A. F. M., & Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler and related
Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statisti-
cal Methodology), 55, 3–23.
Snijders, T. A. B., & Bosker, R. J. (2012). Multilevel analysis: An introduction to basic and advanced
multilevel modeling (2nd ed.). Thousand Oaks, CA: Sage.
Sörbom, D. (1989). Model modification. Psychometrika, 54, 371–384.
Sorensen, T., & Vasishth, S. (2015). Bayesian linear mixed models using Stan: A tutorial for psy-
chologists, linguists, and cognitive scientists. Retrieved from https://arxiv.org/abs/1506.06201.
514 References

Springer, M. D., & Thompson, W. E. (1966). The distribution of independent random variables.
SIAM Journal on Applied Mathematics, 14, 511–526.
Spybrook, J., Bloom, H., Congdon, R., Hill, C., Martinez, A., & Raudenbush, S. W. (2011). Opti-
mal design plus empirical evidence: Documentation for the “Optimal Design” software ver-
sion 3.0. Retrieved from http://hlmsoft.net/od/od-manual-20111016-v300.pdf.
Stapleton, L. M. (2013). Multilevel structural equation modeling with complex sample data. In G.
R. Hancock & R. O. Mueller (Eds.), Structural equation modeling: A second course (pp. 521–
562). Charlotte, NC: Information Age.
Steiger, J. H. (1989). EZPATH: A supplementary module for SYSTAT and SYGRAPH. Evanston, IL:
SYSTAT.
Steiger, J. H. (1990). Structural model evaluation and modification: An interval estimation
approach. Multivariate Behavioral Research, 25, 173–180.
Steiger, J. H., & Lind, J. C. (1980, May). Statistically-based tests for the number of common factors.
Paper presented at the Annual Meeting of the Psychometric Society, Iowa City, IA.
Sterba, S. K., & Gottfredson, N. C. (2014). Diagnosing global case influence on MAR versus
MNAR model comparisons. Structural Equation Modeling: A Multidisciplinary Journal, 22,
294–307.
Stern, H. (1998). A primer on the Bayesian approach to statistical inference. Stats, 23, 3–9.
Sterne, J. A., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., . . . Carpenter, J.
R. (2009). Multiple imputation for missing data in epidemiological and clinical research:
Potential and pitfalls. British Medical Journal, 338, Article b2393.
Sterner, W. R. (2011). What is missing in counseling research?: Reporting missing data. Journal
of Counseling and Development, 89, 56–62.
Su, Y.-S., Gelman, A. E., Hill, J., & Yajima, M. (2011). Multiple imputation with diagnostics (mi)
in R: Opening windows into the black box. Journal of Statistical Software, 45, 1–31.
Taljaard, M., Donner, A., & Klar, N. (2008). Imputation strategies for missing continuous out-
comes in cluster randomized trials. Biometrical Journal, 50, 329–345.
Thijs, H., Molenberghs, G., Michiels, B., Verbeke, G., & Curran, D. (2002). Strategies to fit
pattern-­mixture models. Biostatistics, 3, 245–265.
Thijs, H., Molenberghs, G., & Verbeke, G. (2000). The milk protein trial: Influence analysis of
the dropout process. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 42,
617–646.
Thoemmes, F., & Mohan, K. (2015). Graphical representation of missing data problems. Struc-
tural Equation Modeling: A Multidisciplinary Journal, 22, 631–642.
Thoemmes, F., & Rose, N. (2014). A cautious note on auxiliary variables that can increase bias in
missing data problems. Multivariate Behavioral Research, 49, 443–459.
Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analy-
sis. Psychometrika, 38, 1–10.
U.S. Census Bureau. (2019). 2018 National Survey of Children’s Health: Analysis with multiply
imputed data. Retrieved from www2.census.gov/programs-surveys/nsch/technical-documenta-
tion/methodology/nsch-analysis-with-imputed-data-guide.pdf.
Vach, W. (1994). Logistic regression with missing values in the covariates. Berlin: Springer-Verlag.
van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional
specification. Statistical Methods in Medical Research, 16, 219–242.
van Buuren, S. (2010). Item imputation without specifying scale structure. Methodology, 6, 31–36.
van Buuren, S. (2011). Multiple imputation of multilevel data. In J. J. Hox & J. K. Roberts (Eds.),
Handbook of advanced multilevel analysis (pp. 173–196). New York: Routledge.
van Buuren, S. (2012). Flexible imputation of missing data. New York: Chapman & Hall.
van Buuren, S., Brand, J. P. L., Groothuis-Oudshoorn, C. G. M., & Rubin, D. B. (2006). Fully
References 515

conditional specification in multivariate imputation. Journal of Statistical Computation and


Simulation, 76, 1049–1064.
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). MICE: Multivariate imputation by chained
equations in R. Journal of Statistical Software, 45, 1–67.
van Buuren, S., Groothuis-­Oudshoorn, K., Vink, G., Schouten, R., Robitzsch, A., Rockenschaub,
P., . . . Arel-Bundock, V. (2021). Package ‘mice.’ Retrieved from https://cran.r-project.org/web/
packages/mice/mice.pdf.
Vandenbroucke, J. P., Von Elm, E., Altman, D. G., Gøtzsche, P. C., Mulrow, C. D., Pocock, S. J., . . .
Initiative, S. (2007). Strengthening the Reporting of Observational Studies in Epidemiology
(STROBE): Explanation and elaboration. PLoS Medicine, 4, 1628–1654.
van de Schoot, R., Winter, S. D., Ryan, O., Zondervan-Zwijnenburg, M., & Depaoli, S. (2017). A
systematic review of Bayesian articles in psychology: The last 25 years. Psychological Meth-
ods, 22, 217–239.
van Ginkel, J. R., Linting, M., Rippe, R. C., & van der Voort, A. (2020). Rebutting existing mis-
conceptions about multiple imputation as a method for handling missing data. Journal of
Personality Assessment, 102, 297–308.
Vera, J. D., & Enders, C. K. (2021). Is item imputation always better?: An investigation of missing
questionnaires in longitudinal growth models. Structural Equation Modeling: A Multidisci-
plinary Journal, 28, 506–517.
Verbeke, G., Lesaffre, E., & Spiessens, B. (2001). The practical use of different strategies to han-
dle dropout in longitudinal studies. Drug Information Journal, 35, 419–434.
Verbeke, G., & Molenberghs, G. (2000). Linear mixed models for longitudinal data. New York:
Springer-Verlag.
Verbeke, G., Molenberghs, G., Thijs, H., Lesaffre, E., & Kenward, M. G. (2001). Sensitivity analy-
sis for nonrandom dropout: A local influence approach. Biometrics, 57, 7–14.
Vink, G., Frank, L. E., Pannekoek, J., & van Buuren, S. (2014). Predictive mean matching imputa-
tion of semicontinuous variables. Statistica Neerlandica, 68, 61–90.
von Davier, M., Gonzalez, E., & Mislevy, R. (2009). What are plausible values and why are
they useful? IERI Monograph Series: cIssues and Methodologies in Large-Scale Assessments,
2, 9–36.
von Hippel, P. T. (2004). Biases in SPSS 12.0 Missing Value Analysis. American Statistician, 58,
160–164.
von Hippel, P. T. (2007). Regression with missing Ys: An improved strategy for analyzing multi-
ply imputed data. Sociological Methodology, 37, 83–117.
von Hippel, P. T. (2009). How to impute interactions, squares, and other transformed variables.
Sociological Methodology, 39, 265–291.
von Hippel, P. T. (2013). Should a normal imputation model be modified to impute skewed vari-
ables? Sociological Methods and Research, 42, 105–138.
von Hippel, P. T. (2020). How many imputations do you need?: A two-stage calculation using a
quadratic rule. Sociological Methods and Research, 49, 699–718.
Vrieze, S. I. (2012). Model selection and psychological theory: A discussion of the differences
between the Akaike information criterion (AIC) and the Bayesian information criterion
(BIC). Psychological Methods, 17, 228–243.
Wagenmakers, E.-J., Love, J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., . . . Boutin, B. (2018).
Bayesian inference for psychology: Part II. Example applications with JASP. Psychonomic
Bulletin and Review, 25, 58–76.
Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Love, J., . . . Epskamp, S. (2018).
Bayesian inference for psychology: Part I. Theoretical advantages and practical ramifica-
tions. Psychonomic Bulletin and Review, 25, 35–57.
516 References

Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number
of observations is large. Transactions of the American Mathematical Society, 54, 426–482.
Wang, N., & Robins, J. M. (1998). Large-sample theory for parametric multiple imputation pro-
cedures. Biometrika, 85, 935–948.
West, S. G., & Thoemmes, F. (2010). Campbell’s and Rubin’s perspectives on causal inference.
Psychological Methods, 15, 18–37.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test
for heteroskedasticity. Econometrica, 48, 817–838.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50,
1–26.
White, H. (1996). Estimation, inference and specification analysis. New York: Cambridge Univer-
sity Press.
White, I. R., & Carlin, J. B. (2010). Bias and efficiency of multiple imputation compared with
complete-­case analysis for missing covariate values. Statistics in Medicine, 29, 2920–2931.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations:
Issues and guidance for practice. Statistics in Medicine, 30, 377–399.
Whittaker, T. A. (2012). Using the modification index and standardized expected parameter
change for model modification. Journal of Experimental Education, 80, 26–44.
Widaman, K. F. (2006). Missing data: What to do with or without them. Monographs of the Society
for Research in Child Development, 71, 42–64.
Widaman, K. F., & Thompson, J. S. (2003). On specifying the null model for incremental fit indi-
ces in structural equation modeling. Psychological Methods, 8, 16–37.
Wilkinson and Task Force on Statistical Inference. (1999). Statistical methods in psychology
journals: Guidelines and explanations. American Psychologist, 54, 594–604.
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite
hypotheses. Annals of Mathematical Statistics, 9, 60–62.
Winship, C., & Mare, R. D. (1992). Models for sample selection bias. Annual Review of Sociology,
18, 327–350.
Wirth, R., & Edwards, M. C. (2007). Item factor analysis: Current approaches and future direc-
tions. Psychological Methods, 12, 58–79.
Wood, A. M., White, I. R., & Thompson, S. G. (2004). Are missing outcome data adequately
handled? A review of published randomized controlled trials in major medical journals.
Clinical Trials, 1, 368–376.
Wothke, W. (2000). Longitudinal and multi-group modeling with missing data. In T. D. Little,
K. U. Schnabel, & J. Baumert (Eds.), Modeling longitudinal and multilevel data: Practical
issues, applied approaches, and specific examples (pp. 1–24). Mahwah, NJ: Erlbaum.
Wu, M. C., & Carroll, R. J. (1988). Estimation and comparison of changes in the presence of
informative right censoring by modeling the censoring process. Biometrics, 44, 175–188.
Wu, W., & Jia, F. (2013). A new procedure to test mediation with missing data through non-
parametric bootstrapping and multiple imputation. Multivariate Behavioral Research, 48,
663–691.
Wu, W., Jia, F., & Enders, C. (2015). A comparison of imputation strategies for ordinal missing
data on Likert scale variables. Multivariate Behavioral Research, 50, 484–503.
Wu, W., Jia, F., Rhemtulla, M., & Little, T. D. (2016). Search for efficient complete and planned
missing data designs for analysis of change. Behavior Research Methods, 48, 1047–1061.
Xu, S., & Blozis, S. A. (2011). Sensitivity analysis of mixed models for incomplete longitudinal
data. Journal of Educational and Behavioral Statistics, 36, 237–256.
Yeo, I. K., & Johnson, R. A. (2000). A new family of power transformations to improve normality
or symmetry. Biometrika, 87, 954–959.
References 517

Yuan, K.-H. (2009a). Identifying variables responsible for data not missing at random. Psy-
chometrika, 74, 233–256.
Yuan, K.-H. (2009b). Normal distribution based pseudo ML for missing data: With applications
to mean and covariance structure analysis. Journal of Multivariate Analysis, 100, 1900–1918.
Yuan, K.-H., & Bentler, P. M. (2000). Three likelihood-based methods for mean and covariance
structure analysis with nonnormal missing data. Sociological Methodology, 30, 165–200.
Yuan, K.-H., & Bentler, P. M. (2010). Consistency of normal distribution based pseudo maximum
likelihood estimates when data are missing at random. American Statistician, 64, 263–267.
Yuan, K.-H., Bentler, P. M., & Zhang, W. (2005). The effect of skewness and kurtosis on mean and
covariance structure analysis: The univariate case and its multivariate implication. Socio-
logical Methods and Research, 34, 240–258.
Yuan, K.-H., & Hayashi, K. (2006). Standard errors in covariance structure models: Asymptot-
ics versus bootstrap. British Journal of Mathematical and Statistical Psychology, 59, 397–417.
Yuan, K.-H., Lambert, P. L., & Fouladi, R. T. (2004). Mardia’s multivariate kurtosis with missing
data. Multivariate Behavioral Research, 39, 413–437.
Yuan, K.-H., Tong, X., & Zhang, Z. (2014). Bias and efficiency for SEM with missing data and aux-
iliary variables: Two-stage robust method versus two-stage ML. Structural Equation Model-
ing: A Multidisciplinary Journal, 22, 178–192.
Yuan, K.-H., Yang-Wallentin, F., & Bentler, P. M. (2012). ML versus MI for missing data with vio-
lation of distribution conditions. Sociological Methods and Research, 41, 598–629.
Yuan, K.-H., & Zhang, Z. Y. (2012). Robust structural equation modeling with missing data and
auxiliary variables. Psychometrika, 77, 803–826.
Yuan, Y., & MacKinnon, D. P. (2009). Bayesian mediation analysis. Psychological Methods, 14,
301–322.
Yucel, R. M. (2008). Multiple imputation inference for multivariate multilevel continuous data
with ignorable non-response. Philosophical Transactions of the Royal Society A: Mathematical
and Physical Sciences, 366, 2389–2403.
Yucel, R. M. (2011). Random-covariances and mixed-effects models for imputing multivariate
multilevel continuous data. Statistical Modelling, 11, 351–370.
Yucel, R. M., He, Y., & Zaslavsky, A. M. (2008). Using calibration to improve rounding in imputa-
tion. American Statistician, 62, 1–5.
Yucel, R. M., He, Y., & Zaslavsky, A. M. (2011). Gaussian-­based routines to impute categorical
variables in health surveys. Statistics in Medicine, 30, 3447–3460.
Zellner, A., & Min, C.-K. (1995). Gibbs sampler convergence criteria. Journal of the American
Statistical Association, 90, 921–927.
Zhang, Q., & Wang, L. (2017). Moderation analysis with missing data in the predictors. Psycho-
logical Methods, 22, 649–666.
Zhang, X., Boscardin, W. J., & Belin, T. R. (2008). Bayesian analysis of multivariate nominal
measures using multivariate multinomial probit models. Computational Statistics and Data
Analysis, 52, 3697–3708.
Zhang, Z., & Wang, L. (2012). A note on the robustness of a full Bayesian method for nonignor-
able missing data analysis. Brazilian Journal of Probability and Statistics, 26, 244–264.
Zhang, Z., & Wang, L. (2013). Methods for mediation analysis with missing data. Psychometrika,
78, 154–184.
Zhang, Z., Wang, L., & Tong, X. (2015). Mediation analysis with missing data through multiple
imputation and bootstrap. In L. A. van der Ark, D. M. Bolt, W.-C. Wang, J. A. Douglas, &
S.-M. Chow (Eds.), Quantitative psychology research: The 79th meeting of the Psychometric
Society, Madison, Wisconsin, 2014 (pp. 341–355). New York: Springer.
Author Index

Abrams, K., 148 B


Agresti, A., 90, 95, 120, 121, 224, 256
Aiken, L. S., 125, 127, 128, 199, 203, 211, 289, Baglietto, L., 474
314, 463 Baguley, T., 147, 185, 186
Aitchison, J., 244 Baldwin, S. A., 359, 385
Aitkin, M., 112 Baltes, P. B., 41
Akaike, H., 358 Baraldi, A. N., 99
Ake, C. F., 266 Barnard, J., 183, 286, 292, 295, 311, 319, 323,
Alacam, E., 441, 442, 469 329, 335, 337, 407, 449, 452, 457
Albert, J. H., 2, 50, 90, 120, 121, 155, 211, 222, Baron, R. M., 422
224, 235, 236, 237, 243, 245, 255, 256, 260 Bartlett, J. W., 99, 188, 197, 199, 202, 253, 262,
Albert, P. S., 379, 381, 382, 384, 483 275, 276, 288, 290, 299, 474
Allison, P. D., 222, 266 Bates, D., 480
Altman, D. G., 406, 475 Bauer, D. J., 203, 359, 385
Anderson, D., 359 Baum, L. E., 112
Anderson, T. W., 98 Beale, E. M. L., 98, 112
Andrews, M., 147, 185, 186 Beaton, A. E., 439
Andridge, R. R., 331, 332 Belin, T. R., 127, 221, 222, 245, 263
Arbuckle, J. L., 98, 115, 116, 118, 120, 146 Benichou, J., 79
Arminger, G., 67, 124 Bennett, J. A., 244
Arnold, B. C., 116, 195, 242, 275, 276 Bentler, P. M., 14, 15, 67, 81, 82, 89, 109, 110,
Ashby, D., 148 112, 124, 125, 133, 135, 136, 137, 139, 344,
Asparouhov, T., 99, 116, 182, 183, 217, 220, 222, 345, 430, 431, 432, 435, 436, 437, 473
242, 253, 256, 257, 260, 263, 286, 295, 297, Beran, R., 69, 81, 83
312, 323, 331, 332, 344, 359, 400, 433, 439, Bernaards, C. A., 222
462, 464, 469, 473 Beunckens, C., 348, 358, 360, 385, 481

519
520 Author Index

Birnbaum, A., 95, 96, 143 Chen, D., 425


Black, A. C., 331 Chen, F. N., 160, 183
Blackwell, M., 474 Chen, H. Y., 14
Blozis, S. A., 379 Chen, M. H., 99, 221, 236, 352, 435, 446
Bock, R. D., 112 Chen, N. H., 213
Bodner, T. E., 24, 268, 279, 285, 404, 474 Cheong, Y., 311
Bohr, Y., 428 Chib, S., 2, 50, 155, 222, 235, 236, 237, 245, 256,
Bohrnstedt, G. W., 127, 289 260
Bojinov, I. I., 13, 14 Chung, S., 296, 433, 438
Bolger, N., 423 Chung, Y., 156
Bollen, K. A., 69, 81, 83, 137, 160, 192, 433, 437, Clark, W. S., 31
460, 461 Cobb, L., 213
Bonett, D. G., 432 Coffman, D. L., 359
Bonneville, C. T., 474 Cohen, J., 15, 20, 21, 22, 125, 128, 199, 211, 371,
Boscardin, W. J., 245 479
Bosker, R. J., 301, 312 Cohen, P., 125
Box, G. E., 412 Cole, S. R., 474
Bradbury, T., 349 Collins, L. M., 16, 18, 19, 20, 22, 31, 46, 133,
Brand, J. P. L., 262, 300 214, 264, 287, 288, 292, 436, 469, 479
Brandmaier, A. M., 40, 466, 467, 469 Congdon, R., 311
Breitbart, W., 99 Cook, R. J., 31
Brick, T. R., 99 Coste, J., 27
Brooks, S. P., 172 Cowles, M. K., 172, 236, 237, 260, 435, 446
Brosseau-Liard, P. É., 67 Cox, D. R., 412
Browne, M. W., 242, 432, 437 Coxe, S., 463
Browne, W. J., 188, 253, 309 Craig, C. C., 126, 289
Bryk, A. S., 112, 301, 311, 312, 319 Crowther, M. J., 160
Buck, S. F., 27 Cudeck, R., 432
Bürkner, P C., 474 Cumsille, P. E., 2, 37, 46, 135, 469
Burnham, K., 359 Curran, D., 388
Burton, A., 475 Curran, P. J., 160, 203, 460
Buse, A., 79, 97, 293
Byrne, B. M., 433
D

C Darnieder, W. F., 183


David, S. R., 31
Cai, L., 67, 112, 115, 136, 296, 433, 436, 437, de Leeuw, J., 314
438, 439 Deboeck, P. R., 484
Carlin, B. P., 172 Demirtas, H., 205, 266, 269, 361, 373, 381, 384,
Carlin, J. B., 4, 25, 35, 36, 205, 269, 278, 411, 387, 388, 389, 394, 411, 458
412, 474, 484 Dempster, A. P., 70, 98, 112, 114, 115, 344
Carpenter, J. R., 116, 188, 222, 223, 242, 245, Depaoli, S., 147, 187, 192, 217
253, 262, 263, 274, 299, 301, 331, 332, 334, Diggle, P. J., 382, 388, 390, 400, 483
336, 337, 340, 347, 474, 484 DiStefano, C., 429, 437
Carroll, R. J., 384, 483 Dixon, W. J., 15
Casella, G., 147, 148, 158, 159, 172, 177, 183, 186 Donner, A., 331
Castillo, E., 116, 275 Dorie, V., 156
Cham, H., 99 Draper, D., 309
Author Index 521

Du, H., 188, 316, 346, 349, 352, 358, 441, 469, 473 Freels, S. A., 205, 269
du Toit, S. H. C., 429 Fritz, M. S., 425
Duncan, S. C., 41 Frühwirth-Schnatter, S., 222, 256
Duncan, T. E., 41 Früwirth, R., 222, 256
Dunson, D. B., 192, 222, 256
Dyklevych, O., 245
Dziak, J. J., 359 G

Gabry, J., 474


E Gail, M. H., 79
Galati, J., 4
Edgett, G. L., 98 Garfinkel, P. E., 428
Edwards, M. C., 121, 122, 240, 265, 428 Garner, D. M., 428
Eekhout, I., 443, 446, 447 Garnier-Villarreal, M., 474
Efron, B., 29, 30, 67, 69, 423, 425 Gelfand, A. E., 159
Enders, C. K., 21, 24, 27, 39, 40, 67, 69, 81, 99, Gelman, A., 117, 147, 156, 159, 172, 177, 178, 179,
121, 124, 125, 126, 127, 147, 188, 194, 197, 183, 198, 203, 209, 211, 213, 216, 226, 232,
199, 201, 202, 204, 207, 223, 245, 253, 256, 236, 238, 243, 244, 250, 255, 259, 267, 268,
257, 262, 264, 273, 274, 276, 279, 282, 288, 269, 276, 279, 292, 301, 311, 312, 318, 322,
289, 290, 293, 295, 297, 298, 301, 305, 306, 328, 334, 362, 389, 404, 410, 415, 417, 419,
309, 314, 316, 318, 321, 323, 324, 331, 332, 425, 435, 445, 449, 451, 456, 464, 480, 481
336, 337, 338, 340, 342, 343, 344, 346, 349, Genz, A., 248, 252
373, 379, 382, 400, 402, 406, 413, 432, 433, George, E. I., 148, 158, 159, 186
436, 438, 440, 441, 447, 451, 456, 462, 467, Geweke, J., 172
469, 473, 474, 480, 481 Geyer, C. J., 172
English, D. R., 474 Ghisletta, P., 40, 469
Epskamp, S., 187 Gibbons, R. D., 361, 368, 373, 374, 379, 385, 389,
Erler, N. S., 116, 188, 193, 199, 202, 288, 290, 393, 400, 458, 483
301, 304, 316, 343 Gilks, W. R., 207, 236
Errington, D., 148 Gilreath, T. D., 268
Estabrook, R., 183 Glynn, R. J., 25, 35
Gold, M. S., 67
Goldberger, A. S., 127, 289
F Goldstein, H., 188, 194, 197, 201, 222, 223, 245,
253, 262, 301, 305, 309, 331, 332, 334, 337, 412
Falk, C. F., 67, 124, 136, 432 Gomer, K., 11, 349, 358, 468, 482
Fears, T. R., 79 Gong, G., 30, 67, 69, 423
Feldman, B. J., 379, 381 Gonzalez, E., 439
Finch, H., 446 Gonzalez, R., 436
Finch, J. F., 67 Goodrich, B., 474
Finkeiner, C., 98 Gottfredson, N. C., 359, 360, 362, 366, 385, 389,
Finney, S. J., 429, 437 391, 396, 400, 457, 459, 461, 462
Fisher, R. A., 406 Gottschall, A. C., 290, 402, 440, 446, 447
Fitzsimmons, E., 474 Gould, A. L., 31
Follmann, D. A., 379, 381, 382, 384, 483 Gourieroux, C., 67
Fouladi, R. T., 477 Graham, J. W., 1, 2, 8, 19, 20, 23, 27, 37, 38, 39,
Franco, O. H., 116 40, 41, 42, 43, 46, 133, 134, 135, 140, 146,
Frank, L. E., 278 214, 268, 279, 290, 332, 402, 404, 409, 425,
Freedman, D. A., 67 431, 440, 466, 469, 478, 481
522 Author Index

Greene, W. H., 67, 79, 95, 97 Hops, H., 41


Griffin, D., 436 Horton, N. J., 222, 266
Grimm, K. J., 156, 183, 460 Hothorn, T., 248
Grizzle, J. E., 37, 38, 39 Houts, C. R., 121
Groothuis-Oudshoorn, C. G. M., 262, 300 Howard, W. J., 19, 135, 288, 362, 484
Groothuis-Oudshoorn, K., 272, 273, 278, 412, Hox, J. J., 301
474 Hoyle, R., 469
Grund, S., 202, 276, 295, 297, 298, 316, 318, 324, Hu, L. T., 432, 436
338, 342, 343, 347, 439, 474, 480 Hu, M. Y., 263
Grusky, O., 263 Hubbard, A., 282, 406
Guo, J., 147, 474 Hughes, R., 35, 46
Huisman, M., 27, 440
Hunter, A. M., 359, 400
H

Hamaker, E. L., 301, 314, 321 I


Hancock, G. R., 37, 39, 67, 69
Hardt, J., 23, 288 Ibrahim, J. G., 99, 115, 116, 120, 122, 123, 127,
Harel, O., 268, 279, 285, 331, 404 192, 193, 221, 352, 358, 367
Hartley, H. O., 98 Imai, K., 245
Hasselblad, V., 14
Hastings, W. K., 207, 236
Hayashi, K., 89 J
Hayes, A. F., 67, 422
Hayes, T., 316, 318, 336, 337, 343, 344, 473 Jackman, S., 148, 158, 159, 166, 187
He, Y. L., 222, 411 Jackson, C., 183
Heckman, J. J., 352, 356 Jackson, D., 4
Hedberg, E. C., 302 Jackson, K. M., 457
Hedeker, D., 266, 361, 368, 373, 374, 379, 385, Jaddoe, V. W., 116
389, 393, 400, 411, 458, 483 Jalal, S., 14, 98
Hedges, L. V., 302 Jamil, T., 187
Held, L., 222, 256 Jamshidian, M., 14, 98, 112
Herke, M., 23 Jansen, I., 358, 382
Heron, J., 35, 46 Jeffreys, H., 157, 167
Herring, A. H., 99 Jelicic, H., 24, 474
Heumann, C., 427 Jermiin, L. S., 359
Hill, J., 117, 269, 301 Jia, F., 40, 274, 427, 428, 469
Ho, M. H., 432 Jinadasa, K., 2
Hocking, R. R., 98 Joe, H., 438
Hofer, S. M., 2 Johnson, E. G., 37
Hoff, P. D., 147, 159, 161, 168, 172, 177, 181, 184, Johnson, R. A., 170, 205, 272, 408, 413, 414,
228 478
Hoffman, J. M., 423 Johnson, V. E., 2, 50, 90, 120, 121, 155, 172, 211,
Hogan, J. W., 368 224, 236, 243, 255, 256, 260
Holder, R. L., 406 Jöreskog, K. G., 428, 437, 473
Hollis, M., 14 Jorgensen, T. D., 433, 436, 474
Holmes, C. C., 222, 256 Jose, P. E., 422
Honaker, J., 474 Judd, C. M., 422
Author Index 523

K Lanza, S. T., 359


Lee, D., 147
Kaizer, E. E., 316 Lee, K. J., 205, 269, 278, 411, 412
Kam, C. M., 16, 46 Lee, M. D., 148
Kaplan, B., 439 Lee, S. Y., 14, 345
Kaplan, D., 14, 137, 147, 159, 172, 177, 192, 217, Lee, T., 136, 296, 433, 436, 437, 438, 439
433, 437 Leonhart, R., 23
Karahalios, A., 474 Lepkowski, J. M., 262
Karney, B., 349 Leplege, A., 27
Kasim, R. M., 309 Lerner, R. A., 24
Kass, R. E., 156, 157 Lesaffre, E., 116, 360, 365
Keller, B. T., 147, 188, 204, 223, 245, 256, 257, Leuchter, A. F., 359, 400
273, 276, 279, 301, 316, 318, 321, 323, 324, Levin, K. A., 223
331, 332, 338, 340, 342, 343, 346, 349, 382, Levy, R., 147, 149, 159, 177, 207, 223, 473
413, 438, 441, 445, 449, 450, 451, 452, 456, Lewis, C., 432
462, 469, 473, 480, 481 Lewis, S. M., 172
Kelley, K., 301, 332 Li, K. H., 294, 295, 297, 298, 405, 436, 439
Kenny, D. A., 422 Li, R., 359
Kenward, M. G., 4, 13, 65, 66, 111, 112, 116, 222, Liang, J., 112, 344, 345
223, 242, 245, 253, 262, 263, 301, 331, 332, Lichtenberg, P. A., 15
334, 337, 340, 358, 360, 368, 382, 388, 390, Liddell, T. M., 148
400, 483, 484 Lind, J. C., 432
Keogh, R., 474 Linting, M., 483
Kim, J. K., 284 Lipsitz, S. R., 99, 115, 116, 120, 122, 123, 127,
Kim, K. H., 14, 15 221, 222, 352
Kim, S., 127, 188, 199, 202, 221, 276, 288, 289, Little, R. J. A., 1, 2, 3, 4, 10, 14, 15, 16, 25, 35, 45,
290 98, 100, 112, 114, 115, 189, 261, 278, 279, 299,
King, G., 69, 474 343, 344, 355, 368, 371, 372, 374, 379, 381,
Kinney, S. K., 331 382, 384, 400, 471
Kirby, J., 160 Little, T. D., 19, 37, 38, 39, 40, 41, 43, 469
Klar, N., 331 Liu, C., 275, 411
Klebanoff, M. A., 474 Liu, G., 31
Kleinke, K., 412 Liu, H. Y., 156
Kline, R. B., 433, 461 Liu, J. C., 116, 156, 195, 202, 242, 276
Koppstein, P., 213 Liu, M., 67, 69
Kopylov, I., 149 Liu, Y., 293, 295, 297, 298, 438, 439
Kreft, I. G., 314, 321 Lockwood, C. M., 423
Krishnan, T., 112, 115 Lomnicki, Z. A., 126, 289
Kropko, J., 117 Longford, N., 319
Kruschke, J. K., 148 Lord, F. M., 37, 98
Kunkle, D., 316 Louis, T. A., 115
Kwapil, T. R., 37 Love, J., 147, 187
Lüdtke, O., 99, 115, 116, 120, 121, 122, 125, 126,
127, 144, 146, 170, 188, 193, 197, 199, 202,
L 205, 206, 207, 212, 221, 272, 276, 288, 289,
290, 295, 314, 316, 319, 321, 324, 331, 332,
Laird, N. M., 25, 35, 70, 368 338, 347, 352, 408, 409, 411, 413, 469, 474,
Lambert, P. L., 477 478
524 Author Index

Lunn, D., 183 Muthén, B., 14, 98, 99, 112, 115, 116, 117, 120,
Ly, A., 187 122, 123, 147, 160, 182, 183, 217, 220, 222,
Lynch, S. M., 147, 157, 159, 161, 168, 172, 177, 226, 242, 253, 256, 257, 260, 263, 286, 295,
187, 209, 211, 228, 236, 243, 255, 309 297, 301, 312, 314, 321, 323, 331, 332, 344,
345, 359, 364, 379, 381, 382, 385, 391, 400,
422, 425, 429, 433, 436, 437, 439, 462, 464,
M 465, 469, 473, 481
Muthén, L. K., 99, 147, 160, 256, 382, 436, 462,
MacCallum, R., 432, 433, 436 465, 473, 481
MacKinnon, D. P., 2, 67, 350, 422, 423, 425, 428 Myin-Germeys, I., 37
Madley-Dowd, P., 35, 46 Mykland, P., 172
Magnus, J. R., 88, 89
Mallinckrodt, C. H., 31, 385
Manly, C. A., 475, 483, 484 N
Mansolf, M., 297, 433, 436
Marcoulides, G. A., 14, 15, 103, 373 Nandram, B., 236, 435, 446
Mare, R. D., 352 Neale, M. C., 99, 344, 345, 380
Marshall, A., 406 Necowitz, L. B., 433
Marsman, M., 148, 187 Neelon, B., 462
Masyn, K., 381 Nesselroade, J. R., 41
Matz, A. W., 213 Neudecker, J., 88, 89
Maydeu-Olivares, A., 438 Nicholson, J. S., 484
Mazza, G. L., 27, 440, 443 Nielsen, S. F., 284
McCoach, D. B., 331
McCulloch, R., 183, 245
McDonald, R. P., 432 O
McKelvey, R. D., 232, 239
McLachlan, G. J., 112, 115 O’Brien, S. M., 222, 256
McNeish, D., 183, 221, 301, 332 O’Hagan, A., 149
Mealli, F., 1, 4, 13, 45 Okiishi, J. C., 359
Mehta, P. D., 344, 345, 380 Olchowski, A. E., 2, 46, 268
Meng, X. L., 115, 183, 287, 294, 296, 297, 298, Olkin, I., 263
405, 436, 439 Olmsted, M. P., 428
Merkle, E. C., 183, 185, 192, 220, 474 Olsen, M. K., 262, 263, 267, 268, 299, 412
Mi, X., 248, 252 Olsson, U., 438
Micceri, T., 67 Orchard, T., 98, 112
Michiels, B., 388
Min, C. K., 172
Mislevy, R. J., 147, 149, 159, 177, 439 P
Mistler, S. A., 40, 331, 332, 340, 346, 466
Miwa, T., 248 Palomo, J., 192
Moerbeek, M., 301 Pan, Q., 285
Mohan, K., 6 Pannekoek, J., 278
Molenberghs, G., 13, 31, 65, 66, 111, 112, 301, Park, T., 14
348, 358, 360, 368, 379, 382, 385, 388 Parzen, M., 222
Monfort, A., 67 Paulson, D., 15
Montgomery, D. C., 37 Pawitan, Y., 79
Morris, T. P., 160 Paxton, P., 160
Moustaki, I., 428 Pearl, J., 6
Author Index 525

Petrie, T., 112 Robert, C. P., 147, 159, 172, 177, 229
Peugh, J. L., 24, 474 Roberts, G. O., 159
Peyre, H., 27, 440, 446 Roberts, M. E., 69
Phelps, E., 24 Robins, J. M., 284
Pickles, A., 122 Robitzsch, A., 99, 116, 125, 127, 146, 202, 221,
Pieper, K. S., 14 276, 295, 316, 324, 338, 347, 413, 469, 474,
Pillai, N. S., 13 478, 480
Plummer, M., 474 Rockwood, N. J., 474
Polson, N. G., 222, 245, 256, 257, 260, 263, 462, Rose, N., 20, 23, 142, 217, 288, 362
464 Rosenfeld, B., 99
Poon, W. Y., 345 Rosseel, Y., 119, 146, 183, 185, 192, 220, 436,
Pornprasertmanit, S., 436 474
Potthoff, R. F., 14 Rossi, P. E., 245
Press, S. J., 275 Roth, P. L., 27, 440
Pritikin, J. N., 99, 117, 120, 122 Roy, J., 385
Puhani, P. A., 352, 355 Royston, P., 289, 406, 484
Roznowski, M., 433
Rubin, D. B., 1, 2, 3, 4, 5, 7, 10, 12, 13, 19, 20, 37,
Q 45, 46, 70, 100, 112, 114, 115, 146, 172, 177,
198, 203, 213, 216, 232, 238, 244, 250, 255,
Quartagno, M., 245, 263, 274, 301, 332, 336, 337, 259, 261, 262, 267, 268, 279, 282, 284, 285,
347, 474 286, 288, 292, 293, 294, 295, 296, 297, 299,
300, 311, 318, 319, 322, 323, 328, 329, 334,
335, 337, 342, 344, 355, 362, 371, 373, 389,
R 401, 404, 407, 409, 410, 415, 417, 419, 425,
427, 435, 436, 439, 445, 449, 451, 452, 456,
R Core Team, 474 457, 461, 462, 464, 471, 480, 481
Rabe-Hesketh, S., 122, 128, 132, 156, 344, 379, Ruehlman, L. S., 27
381 Ryan, O., 147, 187
Raftery, A. E., 172, 359, 365, 391, 396
Raghunathan, T. E., 37, 38, 39, 262, 272, 275,
276, 284, 286, 294, 295, 297, 298, 331, 405, S
411
Ram, N., 183 Sarabia, J. M., 116, 275
Rao, C. R., 79, 124 Saris, W. E., 79, 124
Raudenbush, S. W., 112, 301, 309, 311, 312, 319, Sartori, A. E., 355
335, 343, 344, 345 Satorra, A., 79, 81, 82, 125, 430, 431, 435, 436
Raykov, T., 4, 14, 15, 19, 20, 22, 103, 288, 373 Savalei, V., 13, 38, 65, 66, 67, 69, 81, 89, 97,
Reiter, J. P., 284, 286, 294, 295, 297, 298, 331, 109, 110, 111, 112, 119, 124, 133, 135,
332 136, 137, 139, 146, 285, 430, 432, 435, 437,
Reshetnyak, E., 99 447
Rhemtulla, M., 19, 37, 38, 39, 40, 41, 43, 67, 85, Sayer, A., 469
99, 136, 180, 217, 285, 403, 426, 430, 431, 432, Schafer, J. L., 1, 2, 14, 16, 19, 20, 23, 27, 31, 41,
435, 438, 447, 469 46, 114, 116, 133, 214, 217, 222, 262, 263, 267,
Richardson, S., 207 268, 282, 287, 288, 294, 297, 299, 309, 332,
Rights, J. D., 312, 329, 465 334, 337, 347, 361, 373, 381, 384, 387, 388,
Rippe, R. C., 483 389, 394, 406, 409, 412, 440, 458, 474
Ritter, C., 172 Scheuren, F., 261, 300
Rizopoulos, D., 116 Schluchter, M. D., 381
526 Author Index

Schoemann, A. M., 436 T


Schomaker, M., 427
Schwarz, G., 358 Taljaard, M., 331
Scott, J. G., 222 Tanner, M. A., 172
Seaman, S. R., 4, 13, 125, 127, 188, 202, 276, Tate, R. F., 263
289, 299 Taylor, A. B., 425
Shavelson, R. J., 433 Taylor, B. J., 2, 37, 46, 469
Shedden, K., 112 Thijs, H., 348, 360, 388
Sheehan, K. M., 439 Thoemmes, F., 6, 20, 23, 37, 142, 217, 288, 362
Sheets, V., 423 Thomas, A., 183
Shevock, A. E., 135 Thompson, J. S., 432
Shin, Y., 301, 343, 344 Thompson, S. G., 24
Shrout, P. E., 423 Thompson, W. E., 126, 289
Sijtsma, K., 27, 440 Tian, J., 6
Silverman, R. D., 301 Tibshirani, R. J., 30, 67, 69, 423
Silvia, P. J., 37 Tierney, L., 172
Simpson, J. A., 474 Tilling, K., 35, 46
Singer, J. D., 381 Tofighi, D., 314, 321
Skrondal, A., 122, 128, 132, 344 Tong, X., 124, 427
Smith, A. F. M., 159 Tracy, D., 2
Snijders, T. A. B., 301, 312 Trognon, A., 67
Sobel, M. E., 67, 124 Tucker, L. R., 432
Solenberger, P., 262 Tudor, G. E., 14
Sörbom, D., 79, 124, 432, 436, 473
Sorensen, T., 148
Sotto, C., 358 U
Soules, G., 112
Spiegelhalter, D. J., 183, 207 Uanhoro, J., 474
Spiessens, B., 365 U.S. Census Bureau, 404
Spisic, D., 429
Spratt, M., 484
Springer, M. D., 126, 289 V
Spybrook, J., 302
Sriutaisuk, S., 297, 438, 439 Vach, W., 25
Srivastava, M. S., 69, 81, 83 van Buuren, S., 14, 25, 41, 161, 169, 261, 262,
Stapleton, L. M., 301, 344 263, 269, 272, 273, 274, 275, 276, 277, 278,
Steiger, J. H., 432 279, 282, 289, 290, 294, 300, 301, 331, 332,
Sterba, S. K., 312, 329, 360, 362, 366, 389, 391, 338, 347, 371, 406, 412, 446, 474, 475
396, 400, 457, 465 van de Schoot, R., 147, 185, 187, 301
Stern, H. S., 148, 409 van der Ark, L. A., 27, 440
Sterne, J. A., 474, 475, 483, 484 van der Voort, A., 483
Sterner, W. R., 475 van Dyk, D. A., 245
Stine, R. A., 69, 81, 83 van Ginkel, J. R., 483
Su, Y. S., 117, 269, 412, 413 Van Hoewyk, J., 262
Sugar, C. A., 127, 221 Vandenbroucke, J. P., 475
Suh, E. B., 382 Vasishth, S., 148
Switzer, D. M., 27 Vehovar, V., 409
Switzer, F. S., 27 Vera, J. D., 447
Author Index 527

Verbeke, G., 301, 348, 358, 360, 365, 379, 385, Wirth, R. J., 121, 122, 428
388 Wood, A. M., 24, 30, 406, 474
Verhagen, J., 187 Woodbury, M. A., 98, 112
Vink, G., 278, 412 Wothke, W., 98, 115
von Davier, M., 439 Wu, M. C., 384, 483
von Hippel, P. T., 25, 27, 28, 35, 112, 115, 126, Wu, W., 40, 274, 427, 428, 466, 469
127, 189, 202, 205, 268, 269, 279, 289, 343,
404, 411, 412, 413, 459
von Oertzen, T., 40, 469 X
Vrieze, S. I., 359
Xi, N., 121
Xu, S., 379
W

Wagenmakers, E.-J., 147, 148, 187 Y


Wald, A., 79, 124, 293
Walsh, M. A., 37 Yajima, M., 269
Wang, L., 99, 121, 125, 126, 127, 188, 199, 202, Yang-Wallentin, F., 67
221, 262, 276, 288, 289, 290, 300, 349, 358, Yeo, I. K., 170, 205, 272, 408, 413, 414, 478
427 Yi, G. Y., 31
Wang, N., 284 Young, A. S., 263
Wang, S. A., 382 Yu, B., 172
Wasserman, L., 156, 157 Yuan, K. H., 11, 14, 67, 69, 81, 82, 89, 110, 124,
Weber, S., 474 133, 136, 137, 205, 269, 349, 358, 411, 432,
Wei, R., 285 468, 477, 482
Weiss, N., 112 Yuan, Y., 423
Wells, R. S., 475, 483, 484 Yucel, R. M., 205, 222, 263, 269, 301, 309, 331,
West, B. T., 19, 20, 22, 288 332, 333, 334, 335, 336, 337, 347
West, S. G., 37, 67, 99, 116, 125, 127, 128, 146,
199, 203, 211, 221, 289, 380, 423, 440, 463,
469, 478 Z
White, H., 67, 69
White, I. R., 24, 25, 35, 36, 99, 160, 188, 299, Zaslavsky, A. M., 222
406, 484 Zavoina, W., 232, 239
Whittaker, T. A., 433 Zellner, A., 172
Widaman, K. F., 106, 432 Zeng, L., 31
Wilkinson and Task Force on Statistical Zhang, Q., 125, 126, 127, 199, 202, 221, 262,
Inference, 24, 475 276, 288, 289, 290, 300
Wilks, S. S., 79, 124 Zhang, W., 67
Willett, J. B., 381 Zhang, X., 245
Williams, J., 423 Zhang, Z., 99, 121, 124, 156, 188, 349, 358, 427,
Windle, J., 222 432
Winship, C., 352 Zheng, X., 344
Winter, S. D., 147, 187 Zondervan-Zwijnenburg, M., 147, 187
Subject Index

Note. f or t following a page number indicates a figure or a table.

Agnostic imputation strategy. See also Multiple overview, 17–20, 18f


imputation preparing for missing data handling example,
item-level missing data and, 440, 446–449, 21–23, 22t, 23t
447t reporting results from a missing data
longitudinal data analyses and, 459–460 analysis, 476, 479–480
multilevel missing data and, 331 saturated correlates model and, 134–135,
overview, 262–263, 346 134f, 135f, 136f
Akaike information criterion (AIC) structural equation modeling framework and,
longitudinal data analyses examples, 388– 431
399, 392t, 393f, 395f, 397f, 398t two-stage estimation and, 136–139, 138f
missing not at random (MNAR) mechanism Available-case analysis, 388. See also Pairwise
and, 358–360 deletion
selection model analysis examples, 361–367, Average relative increase in variance, 294
363t, 364t, 366f Averaging the available items, 440
Analytic solution, 55, 58–60, 59f
Arithmetic mean imputation, 25–27, 26f, 31–36,
32t, 33f, 34f, 35t, 36t B
Auxiliary variables
Bayesian estimation with missing data and, Bayes’ theorem, 154–155
214–217, 216t Bayesian estimation. See also Bayesian
extra dependent variable model and, 136, 137f estimation for categorical variables;
factored regression models and, 139–140, 140f Bayesian estimation with missing data
inclusive analysis strategy and, 19–20 assessing convergence of the Gibbs sampler,
item-level missing data and, 448 171–180, 173f, 174f, 175f, 176f, 177f, 178f,
maximum likelihood estimates with missing 180t
data, 132–142, 134f, 135f, 136f, 137f, 138f, joint model imputation and, 266
140f, 142t linear regression and, 166–171, 169f, 171t

529
530 Subject Index

Bayesian estimation (cont.) Metropolis–Hastings algorithm and, 207–211,


MCMC estimation with the Gibbs sampler, 208f, 210f
159–160 multilevel missing data and, 309–313, 312t,
mediation and indirect effects and, 423–425, 313f, 318–320, 319t, 320t, 328–329, 329t,
424f 330f
missing at random (MAR) mechanisms and, multivariate normal data and, 217–220, 220t
12–13 overview, 188–189, 221
multiple imputation and, 287–288 random intercept regression models and,
multivariate normal data and, 180–185, 182f, 302–313, 303f, 307f, 312t, 313f
186t reporting results from a missing data
non-normality and, 408 analysis, 476, 477–478, 483
overview, 23–24, 45–46, 145, 147–154, 150f, Bayesian information criterion (BIC)
152f, 153f, 185–186, 261 longitudinal data analyses examples, 388–
univariate normal distribution and, 155–159, 399, 392f, 393f, 395f, 397f, 398t
158f missing not at random (MNAR) mechanism
using MCMC to estimate the mean and and, 358–360
variance, 160–165, 161f, 162f, 163t, 164f, selection model analysis examples, 361–367,
165f 363t, 364t, 366f
Bayesian estimation for categorical variables. See Bernoulli distribution, 48
also Bayesian estimation Beta distributions, 411
binary and ordinal predictor variables and, Between-cluster model
239–244, 241f, 244t joint model imputation and, 332–334, 336
latent response formulation and, 223–226, overview, 303
224f, 225f, 244–248, 247f random coefficient models and, 314–315
logistic regression and, 256–259, 259t Between-imputation interval, 267, 283–284,
nominal predictor variables and, 252–256, 294
256t Bias. See also Nonresponse bias
overview, 222–223, 260 computer simulations comparing missing
regression with a binary outcome and, 226– data methods and, 31–33
232, 227f, 230f, 231f, 232t distribution of missing values and, 202
regression with a nominal outcome and, fully conditional specification and, 340
248–252, 251f maximum likelihood estimates and, 99, 103,
regression with an ordinal outcome and, 344
232–239, 233f, 235f, 238f, 239t mediation and indirect effects and, 425–426,
Bayesian estimation with missing data. See also 426t
Bayesian estimation missing at random (MAR) mechanisms and,
auxiliary variables and, 214–217, 216t 35
choosing a missing data-handling procedure, missing not at random (MNAR) mechanism
470–473, 472f and, 36–37, 214, 348
curvilinear effects and, 211–213, 213t, 214f MNAR-by-omission and, 22
imputing an incomplete outcome variable, multiple imputation and, 331–332, 333
189–191, 191f, 192f multivariate normal data and, 125, 126, 127
inspecting imputations and, 204–206, 205f, overview, 20
206f passive imputation and, 288–289
interaction effects and, 199–204, 203t, 204f, prior distributions and, 183
451–452, 452t, 456–457 regression imputation and, 27
item-level missing data and, 444–446, 446t selection models for multiple regression, 354,
linear regression and, 192–199, 194f, 196f, 355–358, 356f
199t stochastic regression imputation and, 28–29
Subject Index 531

Bias-corrected bootstrap, 425 missing at random (MAR) mechanisms and, 8–9


Bias-reduction model, 42, 42f missing not at random (MNAR) mechanism
Binary predictor variables and, 11
Bayesian estimation for categorical variables Conditional effect, 203–204, 203t, 204f, 216–217
and, 239–244, 241f, 244t Conditional mean imputation. See Regression
choosing a missing data-handling procedure, imputation
470–473, 472f Conditionally missing at random (CMAR), 8, 45,
latent response formulation for nominal 362–363, 363t. See also Missing at random
variables and, 246–248, 247f (MAR)
Bivariate normal difference score distribution, Confidence intervals. See also Credible interval
246–248, 247f mediation and indirect effects and, 425–426,
Blimp, 473–474 426t
Bootstrap confidence intervals, 425–426, 426t multiple imputation and, 285–286
Bootstrapping, 68–69, 69t, 408, 426–428, 426t overview, 170–171
Box–Cox transformation, 412–413 Confirmatory factor analysis, 428, 431
Burn-in interval, 163, 176–177, 176f, 177f Conjugate prior distribution, 151. See also Prior
distribution
Convergence of the Gibbs sampler, 171–180,
C 173f, 174f, 175f, 176f, 177f, 178f, 180t
Correlations, 401–407, 404t, 405t, 407t
Categorical outcomes, 90–96, 91f, 92f, 93f, Covariance coverage, 38, 39t. See also Multiform
96t designs
Categorical variables, 223–226, 224f, 225f, 428. Covariance matrix
See also Bayesian estimation for categorical Bayesian analysis and, 184–185, 218
variables binary and ordinal predictor variables and, 242
Central tendency, 149 distribution of missing values and, 255
Chained equations imputation. See Fully multiple imputation and, 261
conditional specification multivariate normal data and, 88–89
Chi-square statistics, 297–299 nominal predictor variables and, 253
Cohort-sequential design, 41. See also structural equation modeling framework and,
Longitudinal designs 117–118
Commercial software, 473–474, 476, 480–481 Wald test and, 293–294
Comparative fit index (CFI), 432 Credible interval, 149, 170–171. See also
Compatibility, 275–276 Confidence intervals
Complete-case analysis, 27–28. See also Listwise Cross-level interaction, 320–321
deletion Cross-product derivative expression, 64–65
Complete-case restriction, 388 Cross-sequential design, 41, 41t. See also
Complete-data log-likelihood Longitudinal designs
expectation maximation (EM) algorithm and, Curvilinear effects
113 Bayesian estimation with missing data and,
incomplete data records and, 105 211–213, 213t, 214f
overview, 100, 112 multiple imputation and, 288–290
Complete-data model, 134 overview, 130–132, 132t, 133f
Computer simulations. See Simulation
Computer software, 473–474, 476, 480–481
Conditional distribution D
Bayesian analysis and, 168–171, 169f, 171t,
184–185, 186t, 228–230, 230f, 231f D2 statistic, 297–299
logistic regression and, 257 Data set descriptions, 485–492
532 Subject Index

Degrees of freedom, 285–286, 295–297 multilevel missing data and, 344


Deletion methods, 24–25, 25f multivariate normal data and, 103
Derivative equations overview, 70, 112–115, 115t
analytic solution and, 58–59 Expectation step (E-step), 113, 123, 144. See
information matrix and parameter covariance also Expectation maximation (EM)
matrix, 64–66 algorithm
standard errors with incomplete data and, Expected information, 65–66
108–112 Extra dependent variable model, 42, 42f, 136,
two-stage estimation and, 137–138 137f, 141
Descriptive summaries, 401–407, 404t, 405t, 407t
Diffuse MNAR. See also Missing not at random
(MNAR) F
Diggle–Kenward selection model, 382–384,
383f Factor loadings, 431
overview, 11, 349 Factored regression framework. See also Linear
pattern mixture model analysis examples, regression; Sequential specification
376–378, 377f, 378t auxiliary variables and, 139–140, 140f, 141,
pattern mixture model and, 351–352, 351f 214–215
selection models, 349–350, 350f, 358, 365– Bayesian estimation and, 193–195, 194f, 197–
367, 366f 199, 200–201, 239–244, 241f, 244t
Diffuse pattern mixture model, 351–352, 351f. categorical outcomes and, 143–145, 144t,
See also Diffuse MNAR; Pattern mixture 239–244, 241f, 244t
models choosing a missing data-handling procedure,
Diffuse selection model, 349–350, 350f. See also 470–473, 472f
Diffuse MNAR; Selection models with a count outcome, 462–464, 463f, 464t
Diggle–Kenward selection model, 382–384, curvilinear effects and, 131–132, 132t, 133f,
383f, 390–391, 392t, 393, 393f, 394, 399 211–213, 213t, 214f
Dispersion, 149 interaction effects and, 127–130, 128t, 129f,
Distribution of missing values 130t, 200–201, 449–457, 452t, 453t
Bayesian estimation and, 201–202, 242–244, item-level missing data and, 440, 441–443,
244t, 254–256, 256t 444f
interaction effects and, 201–202, 322 mediation and indirect effects and, 423
random coefficient models and, 316–318, 317f multilevel missing data and, 304–306, 316,
random intercept regression models and, 321–322, 327–328
306–308, 307f nominal predictor variables and, 252–254
Distributional assumption, 9, 475, 477–478 non-normality and, 409–410, 417
Duplication matrix, 89 overview, 116–117, 120–124, 122f, 124t
pattern mixture model for multiple
regression, 373
E selection models for multiple regression and,
356–358
EQS, 473 FIML estimation. See Full-information
Equality constraints, 441–442 maximum likelihood (FIML) estimation
Everywhere missing at random, 13. See also First derivatives
Missing at random (MAR); Missing data multivariate normal data and, 88–89
mechanisms Newton’s algorithm and, 72–74, 73f, 74t
Expectation maximation (EM) algorithm overview, 62
factored regression models and, 122–123, 144 Fisher information, 63–64
imputation procedure and, 27–28 Fixed effect imputation, 331–332
Subject Index 533

Focal analysis model Fully conditional specification


auxiliary variables and, 17–20, 18f, 133– descriptive summaries and correlations and,
134 403
Bayesian estimation and, 203–204, 203t, 204f, imputation, 338–343, 338t
242 interaction and curvilinear effects and, 289–290
inclusive analysis strategy and, 19–20 with latent variables, 340–341
interactions with scales, 449–457, 452t, longitudinal data analyses and, 461–462
453t overview, 272–279, 274f, 275f, 278t, 346
mediation and indirect effects and, 424 structural equation modeling framework and,
non-normal outcome variables, 417 433–435
pattern mixture model analysis examples,
374, 375t
preparing for missing data handling example, G
20–23, 22t, 23t
two-stage estimation and, 136–139, 138f gh distribution, 411
Focal regression model Gibbs sampler. See also Markov chain Monte
auxiliary variables and, 140–141 Carlo (MCMC) algorithms
Bayesian estimation with missing data and, assessing convergence of, 171–180, 173f, 174f,
190 175f, 176f, 177f, 178f, 180t
multilevel missing data and, 308 conditional distribution and, 184–185, 186t
Focused MNAR process. See also Missing not at fully conditional specification and, 277
random (MNAR) linear regression and, 168–171, 169f, 171t
Diggle–Kenward selection model, 382–384, overview, 158–160
383f posterior distribution and, 190
overview, 11, 349 regression with an ordinal outcome and, 236
pattern mixture model and, 351–352, 351f, Global fit assessments, 448
375–376, 376t Gradient ascent, 70–72, 71t
selection models and, 349–350, 350f, 363– Grid search, 55–56, 56t, 58
365, 364t Growth curve model
Focused pattern mixture model, 351–352, 351f. Diggle–Kenward selection model, 382–384, 383f
See also Focused MNAR process; Pattern longitudinal data analyses and, 379, 457–462,
mixture models 460t
Focused selection model, 349–350, 350f. See overview, 325–326, 457–458
also Focused MNAR process; Selection Growth models, 465–468
models
Fraction of missing information, 284–285
Frequentist paradigm, 148, 153–154, 171 H
Full conditional distributions, 161, 234–239,
235f, 238f, 239t, 248–252, 251f Hedeker–Gibbons model, 394–396, 395f, 399
Full-information maximum likelihood (FIML) Hessian matrix
estimation. See also Maximum likelihood multivariate normal data and, 89
estimates; Maximum likelihood estimates overview, 64–66
with missing data standard errors with incomplete data and,
non-normality and, 410–411, 411f, 418 108–112
overview, 98 Hierarchical data, 302–313, 303f, 307f, 312t, 313f
structural equation modeling framework and, HLM approach, 343–344
428–429, 430t Homogeneity of means and covariances, 14–15
Fully Bayesian imputation. See Model-based Hyperparameters, 156, 182–183, 182f
imputation procedure Hypothesis testing, 54
534 Subject Index

I Item response theory (IRT), 302


Item-level imputation, 447–448
Identifying correlates of incomplete variables, Item-level missing data
22–23, 23t missing questionnaire items, 439–449, 444f,
Identifying correlates of missingness, 21–22, 22t 446t, 447t
Identifying restrictions, 368, 388 scale scores and, 449–457, 452t, 453t
Ignorable missingness, 13–14 Iterative optimization algorithm
Implicit imputation, 106. See also Imputation; maximum likelihood estimates and, 55
Maximum likelihood estimates multivariate normal data and, 88–89
Imputation. See also Multiple imputation overview, 70–74, 71t, 73f, 74t
Bayesian estimation and, 204–206, 205f, 206f,
218–220, 220t, 228–230, 230f
descriptive summaries and correlations and, J
401–407, 404t, 405t, 407t
distribution of a missing regressor, 195–197, Joint model. See also Multiple imputation
196f Bayesian estimation with missing data and,
imputing missing values, 190–191, 191f, 192f, 217
195–197, 196f interaction and curvilinear effects and,
logistic regression and, 258 289–290
multilevel missing data and, 306–308, 307f item-level missing data and, 447–448
non-normality and, 408, 410–411, 411f, 417– longitudinal data analyses and, 461–462
421, 418f, 419f, 420f, 421f maximum likelihood estimates with missing
structural equation modeling framework and, data and, 116–117
433–435 multilevel missing data and, 332–337, 335t,
Inclusive analysis strategy, 19–20, 133 338t
Incomplete data records overview, 263–272, 265f, 269t, 270f, 271f, 346
maximum likelihood estimates and, 104t, structural equation modeling framework and,
105f, 106f, 107f 433–435
standard errors with, 107–112, 111t Jumping distribution, 208–209
Incomplete variables, 22–23, 23t, 189–191, 191f, Just-another-variable approach, 126–127, 288–
192f 290
Indicants, 245 Just-identified model, 433
Indirect effects, 422–428, 422f, 424f, 426t
Inestimable parameters, 371–372
Inference, 12–13 K
Information, 63–64
Information matrix, 64–66, 138–139 Kernal density plots
Interaction effects computer simulations comparing missing
Bayesian estimation with missing data and, data methods and, 33, 33f, 34f
199–204, 203t, 204f overview, 164
maximum likelihood estimates with missing using MCMC to estimate the mean and
data and, 125–130, 128t, 129f, 130t variance, 164–165, 164f, 165f
multilevel interaction effects, 320–324, 323t,
324t, 325f
multiple imputation and, 288–290 L
Inverse gamma, 157
Inverse transformation, 412–413 Lagrange multiplier. See Score test; Significance
Inverse Wishart distribution, 181–183, 182f, tests
184–185, 186t, 220 Last observation carried forward, 30–31, 31t
Subject Index 535

Latent growth curve model, 457–462, 460t Linear regression


Latent imputations, 228–232, 230f, 231f, 232t. auxiliary variables and, 141
See also Imputation Bayesian analysis and, 166–171, 169f, 171t,
Latent response variable formulation 192–199, 194f, 196f, 199t
Bayesian estimation for categorical variables maximum likelihood estimates and, 75–79,
and, 223–226, 224f, 225f 77f, 79t, 115–124, 119t, 120f, 122f, 124t
choosing a missing data-handling procedure, non-normal outcome variables, 420–421,
470–473, 472f 420t, 421f
for nominal variables, 244–248, 247f LISREL, 473
overview, 90–94, 91f, 92f, 93f, 438–439 Listwise deletion, 24–25, 25f, 31–36, 32t, 33f,
regression with a binary outcome and, 226– 34f, 35t, 36t
232, 227f, 230f, 231f, 232t Little’s MCAR test, 16–17
regression with an ordinal outcome and, Logistic regression. See also Categorical
232–239, 233f, 235f, 238f, 239t outcomes
Latent variables Bayesian estimation for categorical variables
interactions with scales and, 452–455, 453t and, 256–259, 259t
latent variable pattern of missing data, 2–3, maximum likelihood estimates with missing
3f data and, 143–145, 144t
latent variable regression model, 232–239, non-normality and, 412t
233f, 235f, 238f, 239t overview, 90–96, 91f, 92f, 93f, 96t
longitudinal data analyses and, 380 Log-likelihood functions. See also Likelihood
mediation and indirect effects and, 426–428 functions
Lavaan, 474 analytic solution and, 58, 59f
Learning rate, 71 complete-data log-likelihood, 100
Level-1 units, 302, 305. See also Multilevel estimating standard errors, 60–61, 60f
missing data estimating unknown parameters, 55–58, 56t, 57f
Level-2 units, 302, 305. See also Multilevel information matrix and parameter covariance
missing data matrix, 64–66
Likelihood functions. See also Log-likelihood iterative optimization algorithm and, 70–74,
functions; Maximum likelihood estimates 71t, 73f, 74t
Bayesian analysis and, 151, 152f, 154–155, model comparisons and, 358–360
156, 167, 181 multiple imputation and, 296–297
compared to probability distributions, 47–49, multiple regression analysis and, 75–76
49f, 50f multivariate normal data and, 85–88, 86f, 87f
multiple imputation and, 295–297 Newton’s algorithm and, 72–74, 73f, 74t
overview, 48 observed-data log-likelihood, 100–103, 101t
regression with a binary outcome and, overview, 53
227–228 probability distribution and, 94–95
regression with an ordinal outcome and, structural equation modeling framework and, 118
234 univariate normal distribution and, 52–54,
univariate normal distribution and, 52–54, 53f, 54f, 55f
53f, 54f, 55f Longitudinal designs
Likelihood ratio statistic. See also Significance last observation carried forward, 30–31, 31t
tests longitudinal data analyses, 388–399, 390f,
maximum likelihood estimates with missing 392t, 393f, 395f, 397f, 398t, 457–462, 460t
data and, 124–125 longitudinal growth curve model, 379
multiple imputation and, 295–297, 298–299 missing not at random (MNAR) mechanism
overview, 79, 80–83 and, 379–382, 380f
Linear mixed model. See Growth curve model overview, 39–41, 40t, 41t
536 Subject Index

M Maximization step (M-step), 114, 123, 144.


See also Expectation maximation (EM)
Mahalanobis distance, 86 algorithm
Marginal distribution, 158–159 Maximum likelihood estimates. See also
Marginal posterior distribution, 164 Likelihood functions; Maximum likelihood
Markov chain Monte Carlo (MCMC) algorithms estimates with missing data
assessing convergence of the Gibbs sampler, analytic solution and, 55, 58–60, 59f
171–180, 173f, 174f, 175f, 176f, 177f, 178f, categorical outcomes and, 90–96, 91f, 92f, 93f,
180t 96t
Bayesian estimation and, 188–189, 198–199, choosing a missing data-handling procedure,
199t, 222–223 470–473, 472f
conditional distribution and, 184–185, 186t, computer simulations comparing missing
228–230, 230f, 231f data methods and, 31–36, 32t, 33f, 34f, 35t,
distribution of missing values and, 242–244, 36t
244t, 254–256, 256t estimating standard errors, 60–64, 60f, 61f,
fully conditional specification and, 274, 277, 67–70, 69t
279 estimating unknown parameters, 55–58, 56t,
interaction effects and, 201–202, 451–452, 452t 57f
item-level missing data and, 445, 448–449 example of for multiple regression, 78–79,
joint model imputation and, 266–272, 269t, 79t
270f, 271f, 336–337, 338t incomplete data records and, 103–106, 104t,
linear regression and, 168–171, 169f, 171t, 105f, 106f, 107f
198–199, 199t information matrix and parameter covariance
logistic regression and, 257–259, 259t matrix, 64–66
mediation and indirect effects and, 424–425, item-level missing data and, 446
424f, 427 iterative optimization algorithm and, 70–74,
model-based imputation and, 292 71t, 73f, 74t
multilevel interaction effects and, 322–324, linear regression and, 75–79, 77f, 79t
323t, 324t, 325f Little’s MCAR test and, 16–17
multilevel missing data and, 301–301, 306, longitudinal data analyses and, 457, 460
308–313, 312t, 313f, 334–337, 335t, 338t mediation and indirect effects and, 425–426,
multiple imputation and, 261 426t
non-normal outcome variables, 419 missing at random (MAR) mechanisms and,
overview, 221 12–13
posterior distribution and, 190 model comparisons and, 358–360
regression with a nominal outcome and, multilevel missing data and, 343–345, 346t
248–252, 251f multiple imputation and, 287–288
regression with an ordinal outcome and, multivariate normal data and, 84–90, 86f, 87f,
234–239, 235f, 238f, 239t 90t, 99–103, 101t
selection models for multiple regression, non-normality and, 204–205, 408
354–355, 354f overview, 23–24, 45–46, 47, 96–97, 98–99,
Markov chain Monte Carlo (MCMC) estimation. 260, 261
See also Gibbs sampler; Monte Carlo power analyses for planned missingness
computer simulations designs, 43
Bayesian analysis and, 158–160 regression imputation and, 27–28
overview, 112, 145, 148 second derivatives and, 61–64, 61f
using MCMC to estimate the mean and selection model analysis examples, 362–367,
variance, 160–165, 161f, 162f, 163t, 164f, 363t, 364t, 366f
165f significance tests, 79–84
Subject Index 537

standard errors and, 76–78, 77f MICE (Multiple Imputation by Chained


structural equation modeling framework and, Equations) package
430–439, 435t multilevel missing data and, 338
univariate normal distribution and, 50–54, overview, 272, 273–274, 277, 474
51f, 53f, 54f, 55f predictive mean matching and, 277–278
Maximum likelihood estimates with missing structural equation modeling framework and,
data. See also Maximum likelihood 434–435
estimates Missing always at random, 13. See also
auxiliary variables and, 132–142, 134f, 135f, Missing at random (MAR); Missing data
136f, 137f, 138f, 140f, 142t mechanisms
categorical outcomes and, 143–145, 144t Missing always completely at random, 13. See
curvilinear effects, 130–132, 132t, 133f also Missing at random (MAR); Missing
expectation maximation (EM) algorithm and, data mechanisms
112–115, 115t Missing at random (MAR)
incomplete data records and, 103–106, 104t, auxiliary variables and, 17–20, 18f, 142
105f, 106f, 107f choosing a missing data-handling procedure,
interaction effects and, 125–130, 128t, 129f, 470–473, 472f
130t computer simulations comparing missing
linear regression and, 115–124, 119t, 120f, data methods and, 35, 35t
122f, 124t deletion methods and, 24–25, 25f
multivariate normal data and, 99–103, 101t evaluating, 14–17
overview, 98–99, 145 ignorable and nonignorable missingness and,
significance testing and, 124–125 13–14
standard errors with incomplete data and, imputing missing values, 191, 191f, 192f
107–112, 111t inference and, 12–13
Mdmb, 474 overview, 3–4, 8–10, 9f, 10f, 45–46, 348
Mean estimation, 160–165, 161f, 162f, 163t, 164f, preparing for missing data handling example,
165f 20–23, 22t, 23t
Mean substitution. See Arithmetic mean reporting results from a missing data
imputation analysis, 478–479
Mean vector standard errors with incomplete data and, 111
Bayesian analysis and, 184–185, 186t, 218 Missing completely at random (MCAR)
multivariate normal data and, 84, 85, 88–89 computer simulations comparing missing
structural equation modeling framework and, data methods and, 32–33, 32t, 33f, 34f
117–118 deletion methods and, 24
Mediation analysis, 422–428, 422f, 424f, 426t evaluating, 14–17
Metropolis–Hastings algorithm overview, 3–4, 6–7, 6f, 7f, 45
binary and ordinal predictor variables and, preparing for missing data handling example,
242, 243 21–22
curvilinear effects and, 213 standard errors with incomplete data and, 111
distribution of missing values and, 255 Missing data imputation, 258, 265–266. See also
item-level missing data and, 445 Imputation
mediation and indirect effects and, 424 Missing data mechanisms. See also Missing at
multilevel missing data and, 306, 308 random (MAR); Missing completely at
nominal predictor variables and, 253 random (MCAR); Missing not at random
non-normality and, 415 (MNAR)
overview, 207–211, 208f, 210f auxiliary variables and, 17–20, 18f
regression with an ordinal outcome and, choosing a missing data-handling procedure,
236–237, 238 470–473, 472f
538 Subject Index

Missing data mechanisms (cont.) random coefficient pattern mixture models,


comparing via simulation, 31–36, 32t, 33f, 385–388, 386f
34f, 35t, 36t reporting results from a missing data
computer software and, 473–474 analysis, 478–479, 481–483
diagnosing, 14–17 selection models for multiple regression,
ignorable and nonignorable missingness, 13–14 352–358, 353f, 354f, 356f, 361–367, 363t,
inference and, 12–13 364t, 366f
missing data patterns, 2–3, 3f shared parameter (random coefficient)
older missing data methods, 23–31, 25f, 26f, selection model, 384–385, 385f
28f, 30f, 31t Missingness, 381–384, 383f
overview, 1–2, 3–14, 4t, 5f, 6f, 7f, 9f, 10f, 12f, MNAR-by-omission. See also Missing not at
45–46, 470, 483–484 random (MNAR)
partitioning the data, 4–5, 4t, 5f auxiliary variables and, 17–19, 18f, 214
planned missing data designs, 37–45, 38t, inclusive analysis strategy and, 19–20
39t, 40t, 41t, 42f, 44f, 45t preparing for missing data handling example,
power analyses and, 43–45, 44f, 45t 22–23
preparing for missing data handling example, selection model analysis examples, 362–363,
20–23, 22t, 23t 363t
reporting results from a missing data Model-based bootstrap, 83
analysis, 474–483, 477t Model-based imputation procedure. See
Missing data-handling methods for multilevel also Multiple imputation; Sequential
models, 301–302. See also Multilevel specification
missing data interactions with scales and, 451–452, 452t,
Missing not always at random, 13. See also 456–457
Missing at random (MAR); Missing data item-level missing data and, 448
mechanisms longitudinal data analyses and, 459–460
Missing not at random (MNAR). See also Pattern multilevel missing data and, 331
mixture models; Selection models overview, 262–263, 290–292, 292t, 346
auxiliary variables and, 17–20, 18f, 142 structural equation modeling framework and,
computer simulations comparing missing 433
data methods and, 36, 36t Model-based multilevel multiple imputation,
Diggle–Kenward selection model, 382–384, 460–461, 460t
383f Moderated regression
evaluating, 14–17 Bayesian estimation with missing data and,
ignorable and nonignorable missingness and, 199–204, 203t, 204f
13–14 interaction effects and, 128–129, 128t, 129f
imputing missing values, 191 Metropolis–Hastings algorithm and, 207–211,
longitudinal data analyses and, 379–382, 380f, 208f, 210f
388–389, 390f, 392t, 393f, 395f, 397f, 398t model-based imputation and, 292
major modeling frameworks for, 349–352, non-normality and, 205
350f, 351f Modification index. See Score test
MNAR-by-omission, 17–20, 18f, 22–23, 214, Monotone pattern of missing data, 2–3, 3f,
362–363, 363t 381–382
model comparisons and, 358–360 Monte Carlo computer simulations. See also
overview, 3–4, 11–12, 12f, 45–46, 348–349, Markov chain Monte Carlo (MCMC)
399–400 algorithms; Simulation
pattern mixture model for multiple bootstrap resampling and, 68–69, 69t
regression, 367–379, 369f, 370f, 375t, 376t, MCMC estimation with the Gibbs sampler, 160
377f, 378t overview, 31
Subject Index 539

Mplus, 473 with latent response variables, 438–439


Multiform designs, 37–39, 38t, 39t. See also mediation and indirect effects and, 426–428
Planned missing data designs missing at random (MAR) mechanisms and,
Multilevel data, 301–302 13
Multilevel impuation, 459–460 model-based imputation and, 290–292,
Multilevel interaction effects, 320–324, 323t, 292t
324t, 325f multilevel missing data and, 331–332
Multilevel missing data. See also Missing data- multivariate significance tests, 293–299
handling methods for multilevel models nested within bootstrapping, 427
fully conditional specification imputation non-normality and, 408, 415–416
and, 338–343 normal-theory estimation, 435–437, 435t
joint model imputation and, 332–337, 335t, overview, 23–24, 45–46, 145, 261–262, 299,
338t 346
maximum likelihood estimates and, 343–345, pooling parameter estimates, 282
346t pooling standard errors, 282–285
multilevel interaction effects, 320–324, 323t, reporting results from a missing data
324t, 325f analysis, 476, 477–478, 483
multiple imputation and, 331–332 structural equation modeling framework and,
overview, 301–302, 346 430–439, 435t
random coefficient models and, 313–320, test statistic and confidence intervals and,
315f, 317f, 319t, 320t 285–286
random intercept regression models and, with weighted least squares, 437–438
302–313, 303f, 307f, 312t, 313f Multiple regression analysis
three-level models, 324–329, 329t, 330f Bayesian estimation with missing data and,
Multilevel model. See Growth curve model 197–199, 199t
Multilevel structural equation modeling, 344–345 Markov chain Monte Carlo (MCMC)
Multinomial probit model, 244–248, 247f estimation, 170–171, 171t
Multinomial regression model, 248–256, 251f, maximum likelihood estimates and, 75–79,
256t 77f, 79t
Multiple imputation. See also Imputation pattern mixture model analysis examples,
agnostic versus model-based multiple 374–379, 375t, 376t, 377f, 378t
imputation, 262–263 pattern mixture model for multiple
analyzing multiply imputed data sets, 279– regression, 367–374, 369f, 370f
285, 280f, 281t selection models for multiple regression,
Bayesian estimation with missing data, 189 352–358, 353f, 354f, 356f
choosing a missing data-handling procedure, significance testing and, 124–125
470–473, 472f structural equation modeling framework and,
descriptive summaries and correlations and, 119–120
401–407, 404t, 405t, 407t Multiple-group imputation strategy, 402
different answers from when compared to Multiple-group model regression estimates,
other methods, 287–288 129–130, 130t
fully conditional specification, 272–279, 274f, Multivariate data
275f, 278t factored regression models and, 120–124,
interaction effects and, 288–290, 451–452, 122f, 124t
452t, 456–457 missing at random (MAR) mechanisms and,
item-level missing data and, 439–449, 444f, 10
446t, 447t Multivariate distribution to a set of
joint model imputation, 263–272, 265f, 269t, incomplete variables, 263–272, 265f, 269t,
270f, 271f 270f, 271f
540 Subject Index

Multivariate normal data missing not at random (MNAR) mechanism


Bayesian analysis and, 180–185, 182f, 186t and, 214, 217, 348
Bayesian estimation with missing data and, multiple imputation and, 288–289
217–220, 220t Normal distributions. See also Multivariate
choosing a missing data-handling procedure, normal data
470–473, 472f Bayesian estimation with missing data and,
maximum likelihood estimation for, 99–103, 218–219
101t choosing a missing data-handling procedure,
overview, 84–90, 86f, 87f, 90t 470–473, 472f
Multivariate normal distribution, 276 imputations and, 417–418, 418f
Multivariate significance tests, 293–299. See also non-normality and, 410–411, 411f, 412t
Significance tests univariate normal distribution and, 50–54,
51f, 53f, 54f, 55f
Normalizing transformation, 411–415, 414f
N Normal-theory estimation, 435–437, 435t
Not missing at random process. See Missing not
Neighboring-case restriction, 388 at random (MNAR)
Nested models, 80–81 Numerical integration, 122, 144
Netwon–Raphson algorithm. See Newton’s
algorithm
Newton’s algorithm, 72–74, 73f, 74t, 88–89, 103, O
112, 114
Nominal variables Observed information matrix, 65, 110–112, 111t
choosing a missing data-handling procedure, Observed-data log-likelihood
470–473, 472f expectation maximation (EM) algorithm and,
latent response formulation for, 244–248, 247f 114–115
nominal predictor variables, 252–256, 256t incomplete data records and, 103–106, 104t,
regression with a nominal outcome and, 105f, 106f, 107f
248–252, 251f overview, 100–103, 101t, 112
Nonignorable missingness, 13–14 standard errors with incomplete data and,
Non-informative prior, 150, 152. See also Prior 107–108, 110–112
distribution Optimization algorithm, 70. See also Iterative
Non-normality optimization algorithm
Bayesian estimation with missing data and, Ordinal predictor variables
204–206, 205f, 206f Bayesian estimation for categorical variables
outcome variables and, 417–421, 418f, 419f, and, 239–244, 241f, 244t
420t, 421f choosing a missing data-handling procedure,
predictor variables and, 407–416, 411f, 412t, 470–473, 472f
414f, 416f latent response formulation for nominal
Non-normed fit index (NNFI), 432 variables and, 246–248, 247f
Nonresponse, 355–356. See also Missing not at Outcome variables, non-normal, 417–421, 418f,
random (MNAR) 419f, 420t, 421f
Nonresponse bias. See also Bias Outcome-dependent missingness, 381
auxiliary variables and, 16, 17, 142
computer simulations comparing missing
data methods and, 32, 33f P
diffuse processes and, 349
inclusive analysis strategy and, 19–20, 133 Pairwise deletion, 24–25
major modeling frameworks and, 349–350 Parallel imputation chains, 267–268
Subject Index 541

Parameter covariance matrix, 64–66 Posterior predictive distribution, 190, 195–197,


Parameter estimates 196f
model-based imputation and, 292 Potential scale reduction factor (PSRF)
multiple imputation and, 282 assessing convergence of the Gibbs sampler,
pooling, 282, 292 177–180, 178f, 180t
Partial data records. See Incomplete data records auxiliary variables and, 216
Partially factored regression model, 194, 197– Bayesian estimation with missing data and,
199, 241f, 242. See also Factored regression 216, 220
framework overview, 177–178
Partially sequential specification, 194, 197–199. regression with an ordinal outcome and, 238
See also Sequential specification Power analysis
Partitioning the data, 4–5, 4t, 5f for growth models with missing data, 465–
Passive imputation, 288–290 468
Pattern mean difference approach, 15–16, 21–22, for planned missing data designs, 43–45, 44f,
22t. See also Univariate pattern mean 45t
differences unplanned missing data and, 467–468
Pattern mixture models. See also Missing not at Predictive mean matching, 277–278
random (MNAR) Presenting results. See Reporting results from a
longitudinal data analyses and, 379–382, 380f missing data analysis
overview, 348, 351–352, 351f, 399–400 Prior distribution
pattern mixture model for multiple Bayesian analysis and, 150–151, 150f, 154,
regression, 367–379, 369f, 370f, 375t, 376t, 156–157, 158f, 167, 181–183, 182f
377f, 378t regression with a binary outcome and,
random coefficient pattern mixture models, 227–228
385–388, 386f regression with an ordinal outcome and, 234
Percentile bootstrap, 425 Probability distributions
Person mean imuptation, 26–27, 440 Bayesian analysis and, 154–155, 156, 167,
Planned missing data designs, 2–3, 3f, 37–45, 181
38t, 39t, 40t, 41t, 42f, 44f, 45t, 465–468 compared to likelihood functions, 47–49, 49f,
Pólya-gamma distribution, 257 50f
Pooling chi-square statistics, 297–299 multiple regression analysis and, 75–76
Population mean, 117–118 multivariate normal data and, 85–88, 86f, 87f
Posterior distribution overview, 48, 94–95
Bayesian analysis and, 149, 151–154, 153f, univariate normal distribution and, 50–51
157–159, 168–171, 169f, 171t, 184–185, Probit model, 428–429, 430t
186t, 190 Probit regression. See also Categorical outcomes
binary and ordinal predictor variables and, Bayesian estimation for categorical variables
243–244, 244t and, 222–223, 239–244, 241f, 244t
distribution of missing values and, 255–256, item-level missing data and, 448
256t latent response formulation and, 224–226,
logistic regression and, 258–259, 259t 225f, 244–248, 247f
Markov chain Monte Carlo (MCMC) logistic regression and, 256–259, 259t
estimation, 170–171, 171t maximum likelihood estimates with missing
overview, 148–149 data and, 143–145, 144t
regression with a binary outcome and, 227–228 nominal predictor variables and, 252–256,
regression with an ordinal outcome and, 234 256t
using MCMC to estimate the mean and overview, 90–96, 91f, 92f, 93f, 96t
variance, 160–165, 161f, 162f, 163t, 164f, regression with a binary outcome and, 227–
165f 228, 230–232, 231f, 232t
542 Subject Index

Probit regression (cont.) regression with a nominal outcome and,


regression with a nominal outcome and, 248–252, 251f
248–252, 251f regression with an ordinal outcome and,
regression with an ordinal outcome and, 232–239, 233f, 235f, 238f, 239t
232–239, 233f, 235f, 238f, 239t structural equation modeling framework and,
structural equation modeling framework and, 117, 118–119
434–435 Regression imputation. See also Regression
Proposal distribution, 208–209 analysis
Prorated scale score, 26–27 computer simulations comparing missing
Pseudo maximum likelihood estimation, 67–69 data methods and, 31–36, 32t, 33f, 34f, 35t,
36t
overview, 27–28, 28f
Q stochastic regression imputation and, 28–30,
30f
Quasi-maximum likelihood estimation, 67–69 Regression model parameters. See also
Questionnaire items, missing, 439–449, 444f, Regression analysis
446t, 447t logistic regression and, 259, 259t
mediation and indirect effects and, 426–428
structural equation modeling framework and,
R 119
Relative fit assessments, 54
R platform, 474 Relative increase in variance, 284–285
Random coefficient models Relative probabilities, 51–52
missing not at random (MNAR) mechanism Reporting results from a missing data analysis
and, 384–385, 385f auxiliary variables, 476, 479–480
overview, 313–320, 315f, 317f, 319t, 320t Bayesian estimation and multiple imputation,
random coefficient pattern mixture models, 476, 477–478, 483
385–388, 386f, 393–398, 395f, 397f distributional assumptions, 475, 477–478
random coefficient-dependent missingness, guidelines for, 475–476
381 missing data process, 475, 478–479
Random effect, 303, 379–380 missing data rates, 476, 477t
Random intercept regression models, 302–313, missing data-handling methods, 476, 480
303f, 307f, 312t, 313f, 334–337, 335t, 338t overview, 474–475
Random within-cluster covariance matrices, sensitivity analyses, 476, 477–478, 481–483
335–336, 337, 338t software tools and implementation details,
Regression analysis. See also Regression 476, 480–481
imputation; Regression model parameters Rescaled likelihood ratio statistic. See Satorra–
Bayesian estimation with missing data and, Bentler chi-square
218–220, 220t Reverse random coefficient imputation, 342–343
with a binary outcome and, 226–232, 227f, Robust standard errors
230f, 231f, 232t multiple imputation and, 435–437, 435t
with a count outcome, 462–464, 463f, 464t non-normality and, 408
multilevel missing data and, 301–301 overview, 67–69
multiple imputation and, 261 structural equation modeling framework and,
preparing for missing data handling example, 431
20–23, 22t, 23t Robust test statistics, 81–83
random intercept regression models and, Root mean square error of approximation
302–313, 303f, 307f, 312t, 313f (RMSEA), 432
Subject Index 543

S model comparisons and, 358–360


overview, 348, 349–350, 350f, 399–400
Sampling error, 283 selection models for multiple regression,
Sampling variance, 63–64 352–358, 353f, 354f, 356f, 361–367, 363t,
Sandwich estimator standard errors 364t, 366f
non-normality and, 408, 418 shared parameter (random coefficient)
overview, 67–69 selection model, 384–385, 385f
rescaled likelihood ratio statistic and, Semipartial correlations, 22–23, 23t, 141
81–82 SemTools, 474
structural equation modeling framework and, Sensitivity analysis
431 overview, 348
two-stage estimation and, 139 pattern mixture model for multiple
SAS, 473 regression, 374–379, 375t, 376t, 377f, 378t
Satorra–Bentler chi-square, 81–82, 431–432, reporting results from a missing data
436 analysis, 476, 477–478, 481–483
Saturated correlates model, 134–135, 134f, 135f, selection model analysis, 361–367, 363t, 364t,
136f, 141 366f
Saturated model, 433 Sequential imputation chain, 267
Saving filled-in data sets, 276–277 Sequential specification. See also Factored
Scale scores regression framework; Linear regression;
interactions effects and, 449–457, 452t, 453t Model-based imputation procedure
missing questionnaire items, 439–449, 444f, auxiliary variables and, 214–215
446t, 447t Bayesian estimation and, 197–199, 241f, 242
Scale-level imputation, 447 multilevel missing data and, 309–310
Score test. See also Significance tests nominal predictor variables and, 252–254
maximum likelihood estimates with missing non-normality and, 409–410
data and, 124–125 overview, 193–195, 194f
overview, 79 selection models for multiple regression and,
structural equation modeling framework and, 356–358
432–433 Shared parameter model
Score vector, 68 longitudinal data analyses and, 379–382,
Second derivatives 380f
estimating standard errors and, 61–64, 61f missing not at random (MNAR) mechanism
information matrix and parameter covariance and, 384–385, 385f
matrix, 64–66 Shared parameter (random coefficient) selection
multiple regression analysis and, 77–78 model, 384–385, 385f, 392–393
multivariate normal data and, 88–89 Significance tests
Newton’s algorithm and, 72–74, 73f, 74t maximum likelihood estimates with missing
overview, 62 data and, 124–125
standard errors with incomplete data and, missing at random (MAR) mechanisms and,
108–112 13
Selection models. See also Missing not at random multiple imputation and, 285–286, 293–299
(MNAR) multivariate significance tests, 293–299
coding missing data indicators, 381–382 overview, 79–84
Diggle–Kenward selection model, 382–384, pattern mean differences and, 15–16
383f preparing for missing data handling example,
longitudinal data analyses and, 379–382, 21
380f Simple logarithmic transformation, 412–413
544 Subject Index

Simulation. See also Monte Carlo computer Substantive model-compatible imputation. See
simulations Model-based imputation procedure
comparing missing data methods via, 31–36, Synthetic parameter values, 160
32t, 33f, 34f, 35t, 36t
power analyses for planned missingness
designs, 43–45, 44f, 45t T
selection models for multiple regression,
354 t distribution, 411
Single imputation. See also Imputation Target distribution or target function, 207, 208f,
arithmetic mean imputation, 25–27, 26f 209–210, 210f
listwise and pairwise deletion, 24–25, 25f Tests, significance. See Significance tests
overview, 24 Thinning interval, 267
Single-level multiple imputation, 461–462 Three-form design, 37–38, 38t, 43–45, 44f, 45t.
Six-form design, 38, 38t. See also Multiform See also Multiform designs
designs Three-level models, 324–329, 329t, 330f
Slopes, 58, 59f, 61–64, 61f Trace plots
Software, 473–474, 476, 480–481 assessing convergence of the Gibbs sampler,
Split-chain method, 238 172–174, 173f, 174f, 175f
SPSS, 473 item-level missing data and, 445
Square root transformation, 412–413 overview, 172
Standard errors regression with an ordinal outcome and,
alternative approaches to estimating, 67–70, 237–238, 238f
69t Transformations, 411–415, 414f
based on expected information, 65–66 Truncated normal distribution, 229
with incomplete data, 107–112, 111t t-statistic, 285–286, 293
maximum likelihood estimates and, 60–64, Tucker–Lewis Index (TLI), 432
60f, 61f, 76–78, 77f Two-method measurement designs, 41–43, 42f
missing at random (MAR) mechanisms and, Two-stage estimation, 133–134, 136–139, 138f
13
model-based imputation and, 292
multiple imputation and, 282–285 U
multivariate normal data and, 88–89
pooling, 282–285, 292 Uncongenial scenarios, 287–288
second derivatives and, 61–64, 61f Underidentified pattern of missing data, 2–3, 3f
structural equation modeling framework and, Univariate analysis, 121, 148
118 Univariate normal distribution, 50–54, 51f, 53f,
two-stage estimation and, 137, 139 54f, 55f, 155–159, 158f
Stata, 473 Univariate pattern mean differences, 15–16. See
Statistical significance tests. See Significance also Pattern mean difference approach
tests Univariate pattern of missing data, 2–3, 3f
Stochastic regression imputation, 28–30, 30f, Unknown parameters, 55–58, 56t, 57f
31–36, 32t, 33f, 34f, 35t, 36t Unplanned missing data
Structural equation modeling multiform designs and, 39
auxiliary variables and, 133–134 power analyses for planned missingness
multilevel missing data and, 344–345 designs, 43–45, 44f, 45t
overview, 116–120, 119t, 120f, 428–439, 429f, power analysis and, 467–468
430t, 435t U-shaped function, 63
Subgroups, 401–407, 404t, 405t, 407t Utilities, 245
Subject Index 545

V Weibull distributions, 411


Weighted least squares estimation, 437–438
Variance estimation, 160–165, 161f, 162f, 163t, Wishart distribution, inverse, 181–183, 182f,
164f, 165f, 284–285 184–185, 186t, 220
Variance–covariance matrix Within-cluster regression model
Bayesian analysis and, 184–185, 186t joint model imputation and, 332–333, 335–
multiple regression analysis and, 78 336, 337, 338t
multivariate normal data and, 84, 89–90 overview, 302–304, 303f
overview, 65 random coefficient models and, 315, 315f
standard errors with incomplete data and, Within-imputation variance, 283–285, 293–294
111–112, 111t Within-subject mean difference, 284
structural equation modeling framework and,
117–118, 119, 433–434
two-stage estimation and, 137 Y
Wald test and, 294–295
Yeo–Johnson power transformation
non-normal outcome variables, 417, 418–421,
W 419f, 420t, 421f
non-normality and, 408, 409, 412t, 413–416,
Wald test. See also Significance tests 414f, 416f
maximum likelihood estimates with missing overview, 205–206, 206f
data and, 124–125
multiple imputation and, 293–295, 298–299
overview, 79–80, 81–83, 95–96 Z
Wave missing data designs, 40–41, 40t, 43–45,
44f, 45t. See also Longitudinal designs z-statistic, 80. See also Wald test
About the Author

Craig K. Enders, PhD, is Professor and Area Chair in Quantitative Psychology in the
Department of Psychology at the University of California, Los Angeles. His primary
research focus is on analytic issues related to missing data analyses, and he leads the
research team responsible for developing the Blimp software application for missing
data analyses. Dr. Enders also conducts research in the areas of multilevel modeling and
structural equation modeling, and is an active member of the Society of Multivariate
Experimental Psychology, the American Psychological Association, and the American
Educational Research Association.

546
NOTATION GUIDE

i = observation index
C = number of MCMC chains
J = number of clusters (multilevel model); j is an index
E = expectation or average
G = number of groups; g is an index
H = rows per person in augmented data for factored regression (Chapter 3)
N = sample size
B = number of bootstrap samples; b is an index
T = number of iterations; t is an index
K = number of predictor variables; k is an index
V = number of variables; v is an index
P = number of unique parameters; p is an index
Q = number of of hypothesized parameters tested or degrees of freedom;
q is an index

You might also like