You are on page 1of 80

Introduction

It is both a pleasure and honor to introduce the fifth edi- tural causal model (SCM), into the larger SEM fam-
tion of this book. Like the previous editions, structural ily that dates to the development of path analysis by
equation modeling (SEM) is presented in an acces- Sewall Wright in the 1920s–1930s and to the publica-
sible way for readers without strong quantitative back- tion of LISREL III in 1976 as the first widely available
grounds. Included in this edition are many new exam- computer program for covariance structure analysis,
ples of SEM applications in disciplines that include also called covariance-based SEM. In the same tradi-
health, political science, international studies, cognitive tion, this fifth edition includes composite SEM, also
neuroscience, developmental psychology, sport and referred to as partial least squares path modeling or
exercise, and psychology, among others. Some exam- variance-based SEM, as the third full member of the
ples were selected due to technical problems in the anal- SEM family. Composite SEM has developed from a
ysis, but such examples provide a context for discussing set of methods seen in the 1980s–1990s as more suit-
how to deal with challenges that can and do occur in able for exploratory research that emphasized predic-
SEM, especially in samples that are not large. So not all tion over explanation to a suite of full-fledged modeling
applications of SEM described in this book are trouble techniques for exploratory or confirmatory analyses,
free, but neither are actual research problems. including theory testing. Both the SCM and composite
SEM offer unique perspectives on causal modeling that
can benefit researchers more familiar with traditional,
WHAT’S NEW covariance-based SEM. This means that researchers
acquainted with all three members of the SEM family
The many changes in this edition are intended to can test a wider range of hypotheses about measure-
enhance the pedagogical presentation and cover recent ment and causation. I try to make good on this promise
developments. The biggest changes are summarized throughout the fifth edition.
next:
2. Traditional SEM and composite SEM are
1. The fourth edition of this book was one of the described within Edward Rigdon’s concept proxy
first introductory works to incorporate Judea Pearl’s framework that links data with theoretical concepts
nonparametric approach to SEM, also called the struc- through proxies, which approximate concepts based on

IntroKline5E.indd 1 3/22/2023 4:50:11 PM


2 Introduction

correspondence rules—also called auxiliary theory— CauseAndCorrelation, dagitty, and ggm. Together with
about presumed causal directionality between concepts the lavaan package, a wide variety of analyses for non-
and data. This point refers to the distinction between parametric, parametric, and composite models in SEM
reflective measurement, where proxies for latent vari- is demonstrated, all with no-cost software. Commercial
ables are common factors, and formative measurement, software for SEM is still described, including Mplus,
where proxies for emergent variables are composites of which can feature state-of-the-art analyses before they
observed variables. The choice between the two mea- appear in other computer tools, but free SEM software
surement models just mentioned should be based on is now nearly as capable as commercial products. Also,
theory, not by default due to the researcher’s lack of I would guess that free software could be used in the
awareness about SEM techniques for analyzing com- large majority of published SEM studies.
posites.
6. Extended presentations on regression fundamen-
3. There are additional new chapters on SEM analy- tals, significance testing, and measurement and psy-
ses in small samples and recent developments in media- chometrics beloved by readers of the fourth edition are
tion analysis. Surveyed works about mediation analysis freely available in updated form as primers on the book’s
concern research designs and definitions of mediated website. This change was necessary to include the new
effects, including natural direct and indirect effects and material in the fifth edition. The topics just mentioned
interventional direct and indirect effects estimated in are still covered in the new edition but in a more con-
clinical trials, among other topics. There is also cov- cise way. New to the fifth edition in the main text is
erage of new reporting standards for SEM studies by a self-test of knowledge about background concepts in
the American Psychological Association (APA) and statistics and measurement. There is a scoring key, too,
the technique of piecewise SEM, which is based on so readers can check their understanding of fundamen-
concepts from Pearl’s SCM. There are also extended tals. Readers with higher scores could directly proceed
tutorials on modern techniques for dealing with miss- to substantive chapters on SEM analyses, and readers
ing data, including multiple imputation and full infor- with lower scores can consult any of the primers on the
mation maximum likelihood (FIML), and also about website for more information and exercises.
instrumental variable methods as a way to deal with
the confounding of target causal effects.
BOOK WEBSITE
4. The topics of specification and identification ver-
sus analysis were described in separate chapters in the
The address for the book’s website is https://www.
fourth edition. They are now combined into individual
guilford.com/kline-materials. From the site, you can
chapters for each technique described in the fifth edi-
freely access the computer files—data, syntax, and
tion. I believe this more closely integrated presentation
output files—for all detailed examples in this book.
helps readers to more quickly and easily develop a sense
The website promotes a learning-by-doing approach.
of mastery for a particular kind of SEM technique.
The availability of both syntax and data files means
5. There is greater emphasis on freely available soft- that readers can reproduce the analyses in this book
ware for SEM analyses in this new edition. For exam- by using the corresponding R packages. Even without
ple, the R package lavaan package was used in most doing so, readers can still open the output file on their
analyses described in this book. It is a full-featured own computers for a particular analysis and view the
computer program for both basic and advanced SEM results. This is because all computer files are simple
analyses. It has the capability to analyze both common text files that can be opened with any basic text edi-
factors and composites as proxies for theoretical con- tor, such as Notepad (Windows), Emacs (Linux/
cepts. The syntax in lavaan is both straightforward UNIX), or TextEdit (macOS), among others. Syn-
and used in some other R packages, including cSEM tax files are annotated with extensive comments.
for composite SEM, to specify structural equation Even if readers use a different computer tool, such
models, so it has application beyond lavaan. as LISREL, it is still worthwhile to review the files
Other R packages used for detailed examples in the on the website generated in the R environment. This
fifth edition include semTools, piecewiseSEM, MBESS, is because it can be helpful to view the same analy-
MIIVsem, psych, WebPower, systemfit, sem, bmem, sis from somewhat different perspectives. Some of the
Introduction 3

exercises for this book involve extensions of the analy- SYMBOLS AND NOTATION
ses for these examples, so there are plenty of opportuni-
ties for practice with real data sets. Advanced works on SEM often rely on the symbols
and notation associated with the original matrix-based
syntax for LISREL, which features a profusion of dou-
PEDAGOGICAL APPROACH bly subscripted lowercase Greek letters for individual
model parameters, uppercase Greek letters for matrices
Something that has not changed in the fifth edition of parameters for the whole model, and two-letter acro-
is pedagogical style: I still speak to readers (through nyms in syntax for matrices. For example, the symbols
my author’s voice) as one researcher to another, not as ( x)
λ12 , Λ x, and LX
statistician to the quantitatively naïve. For example, the
instructional language of statisticians is matrix algebra, refer in LISREL notation to, respectively, a specific
which conveys a lot of information in a short amount of loading on an exogenous (explanatory) factor, the
space, but readers must already be versed in linear alge- parameter matrix of loadings for all such factors, and
bra to understand the message. There are other, more LISREL syntax that designates the matrix (Lambda-
advanced works about SEM that emphasize matrix X). Although I use here and there some symbols from
presentations (Bollen, 1989; Kaplan, 2009; Mulaik, LISREL notation, I do not oblige readers to memorize
2009b), and these works can be consulted when you LISREL notation to get something out of the book.
are ready. Instead, fundamental concepts about SEM This is appropriate because LISREL symbolism can be
are presented here in the language of applied research- confusing unless one has learned the whole system by
ers: words, tabular summaries, and data graphics, not rote.
matrix equations. I will not shelter you from some of
the more technical aspects of SEM, but I aim to cover
ENJOY THE RIDE
fundamental concepts in accessible ways that promote
continued learning. Learning a new set of statistical techniques is not every-
one’s idea of fun. (If doing so is fun for you, that’s okay,
I understand and agree.) But I hope the combination
PRINCIPLES > SOFTWARE of accessible language that respects your intelligence,
examples of SEM analyses in various disciplines, free
You may be relieved to know that you are not at a dis- access to background tutorials (i.e., the primers) and
advantage at present if you have no experience using computer files for detailed examples, and the occa-
an SEM computer tool. This is because the coverage sional bit of sage advice offered in this book will help to
of topics in this book is not based on the symbolism, make the experience a little easier, perhaps even enjoy-
syntax, or user interface associated with a particular able. It might also help to think of this book as a kind of
software package. In contrast, there are many books travel guide about language and customs, what to know
linked to specific SEM computer programs. They can and pitfalls to avoid, and what lies just over the horizon
be invaluable for users of a particular program, but per- in SEM land.
haps less so for others. Instead, key principles of SEM
that users of any computer tool must understand are
PLAN OF THE BOOK
emphasized here. In this way, this book is more like a
guide to writing style than a handbook about how to use
Part I introduces fundamental concepts, reporting
a particular word processor. Besides, becoming profi- standards, preparation of the data, and computer tools.
cient with a particular software package is just a matter Chapter 1 lays out both the promise of SEM and wide-
of practice. But without strong conceptual knowledge, spread problems in its application. Concepts in regres-
the output from a computer tool for statistical analy- sion, significance testing, and psychometrics that are
ses—including SEM—may be meaningless or, even especially relevant for SEM are reviewed in Chapter
worse, misleading. 2, which also include the self-test in these areas. Basic

IntroKline5E.indd 3 3/22/2023 4:50:11 PM


4 Introduction

steps in SEM and reporting standards are introduced ciations for each pair of measured variables. Chapters
in Chapter 3 along with an example from a recent 11–12 extend these ideas to, respectively, the compari-
empirical study. How to prepare the data for analysis in son of alternative models all fit to the same data and
SEM and options for dealing with common problems, the simultaneous analysis of a model over data from
including missing data, are covered in Chapter 4, and multiple groups, also called multiple-group SEM.
computer tools for SEM, both commercial and free, are Part III deals with the analysis of models where at
described in Chapter 5. least some theoretical concepts are approximated with
Part II deals with the fundamentals of hypothesis multiple observed variables, or multiple-indicator mea-
testing in SEM for classical path models, which in the surement. Such models are often referred to as “latent
analysis phase feature a single observed measure for variable models,” but for reasons explained in Chapter
each theoretical variable, also called single-indicator 13, our models include only proxies for latent variables,
measurement. It begins in Chapter 6, which introduces not latent variables themselves. These proxies are of
nonparametric SEM as described by Judea Pearl (i.e., two general types: common factors based on reflective
the SCM). The SCM is graphical in nature; specifically, measurement models and composites based on forma-
causal hypotheses are represented as directed graphs tive measurement models. The analysis of pure reflec-
where theoretical variables are depicted with no com- tive measurement models in the technique of confirma-
mitment to any distributional assumptions or specific tory factor analysis (CFA) is described in Chapter 14,
operational definitions for any variable. Graphs in non- and Chapter 15 deals with the analysis of structural
parametric SEM can be analyzed by special computer regression (SR) models—also called latent variable
tools without data. This capability allows researchers path models—where causal effects between observed
to test their ideas before collecting the data. For exam- variables or common factors are estimated. Chapter 16
ple, the analysis of a directed graph may indicate that is about composite SEM, which analyzes causal models
a particular causal effect cannot be estimated unless with multiple-indicator measurement based on forma-
additional variables are measured. After the data are tive, not reflective, measurement and where proxies for
collected, it is a parametric model that is typically conceptual variables are composites, not common fac-
analyzed, and such models and their assumptions are tors. Application of the technique of confirmatory com-
described in Chapter 7. The technique of piecewise posite analysis (CCA), the composite analog to CFA, is
SEM, which connects the two perspectives, nonpara- demonstrated.
metric and parametric, through novel techniques for Part IV is about advanced techniques. How to deal
analyzing path models, is covered in Chapter 8. with SEM analyses in small samples is addressed in
Chapters 9–12 are perhaps the most important ones Chapter 17, and Chapter 18 concerns the analysis of
in the book. They concern how to test hypotheses and categorical data in CFA. Chapter 19 explains how to
evaluate models in complete and transparent ways that analyze nonrecursive models with causal loops that
respect both reporting standards for SEM and best involve two or more endogenous (outcome) variables
practices. These presentations are intended as counter- assumed to influence each other, and Chapter 20 sur-
examples to widespread dubious practices that plague veys recent developments that enhance, improve, and
many, if not most, published SEM studies. That is, the extend ways to assess hypotheses of causal media-
state of SEM practice is generally poor, and one of my tion, or indirect causal effects that involve at least one
goals is to help readers distinguish their work above intervening variable. The state of mediation analysis
this din of mediocrity. Accordingly, Chapter 9 outlines in the literature is problematic, but some of the newer
methods for simultaneous estimation of parameters in approaches and methods described in this chapter seem
structural equations models and explains how to ana- promising. The analysis of latent growth models for
lyze means along with covariances. Chapter 10 deals longitudinal data is the subject of Chapter 21, and the
with the critical issue of how to properly assess model application of multiple-group CFA to test hypotheses
fit after estimates of its parameters are in hand. A criti- of measurement invariance is dealt with in Chapter
cal point is that model fit should be routinely adjudged 22. The capstone of the book is the summary of best
from at least two perspectives: global or overall fit, and practices in SEM in Chapter 23. Also mentioned in this
local fit at the level of residuals, which in SEM con- chapter are common mistakes with the aim of helping
cerns differences between sample and predicted asso- you to avoid them.

IntroKline5E.indd 4 3/22/2023 4:50:11 PM


Part I

Concepts, Standards, and Tools

Pt1Kline5E.indd 5 3/22/2023 3:52:35 PM


Pt1Kline5E.indd 6 3/22/2023 3:52:35 PM
1

Promise and Problems

This book is your guide to the principles, practices, strengths, limitations, and applications of structural equa-
tion modeling (SEM) for researchers and students without extensive quantitative backgrounds. Accordingly,
the presentation is conceptually rather than mathematically oriented, the use of formulas and symbols is kept
to a minimum, and many examples are offered of the application of SEM to research problems in disciplines
that include psychology, education, health sciences, cognitive assessment, and political science, among oth-
ers. When you finish reading this book, I hope that you will have acquired the skills to begin to use SEM in
your own research in an informed, principled way. Here is my four-point plan to get you there:

1. Review fundamental concepts in regression, hypothesis testing, and measurement beginning in the
next chapter. A self-test of knowledge in each area just mentioned is provided with a scoring key.
Additional resources for review of background concepts are freely available on this book’s website.
2. Convey what should be communicated in complete, transparent, and verifiable reporting of the
results in SEM studies. Formal reporting standards for SEM by the American Psychological Associa-
tion (APA) are both explained and modeled by example.
3. Describe all three major members of the SEM family. Each offers unique perspectives that can help
you in every stage of an SEM study, from planning through operationalization to data collection and
then finally to the analysis of a model that faithfully represents your specific hypotheses.
4. Emphasize best analysis practices and provide warnings about questionable practices that are seen
in too many published SEM studies.

To summarize, the main goal is to help you produce research that is distinguished by its accuracy, complete-
ness, and quality by following best practices, that is, SEM done right.

PREPARING TO LEARN SEM Know Your Area

Listed next are suggestions for the best ways to get Strong familiarity with the theoretical and empiri-
ready to learn about SEM. I offer these suggestions cal literature in your research area is the single most
in the spirit of giving you a healthy perspective at the important thing you could bring to SEM. This is
beginning of our task, one that empowers your sense of because everything—from the specification of your
being a researcher. initial model to modification of that model in subse-

Pt1Kline5E.indd 7 3/22/2023 3:52:35 PM


8 Concepts, Standards, and Tools

quent reanalyses to interpretation of the results—must script is a joy to read. I have reviewed many in my career
be guided by your domain knowledge. So, you need, and have always argued for publication. If the models and
first and foremost, to be a researcher, not a statistician data are not working well, that is worthy of announcing it
or a computer nerd. This is true for most kinds of statis- to the world as students and I have done. (p. 642)
tical analysis in that the value of the product (numeri-
cal results) depends on the quality of the ideas (your This quote is from a dissertation defense where some
hypotheses) on which the analysis is based. committee members were critical of an SEM analysis
where no model was retained. Asked by the committee
chair for a response, Schreiber (personal communica-
Know Your Measures tion, December 1, 2020) replied that retaining no model
Kühnel (2001) reminded us that learning about SEM has is a beautiful thing when it is apparent that predictions
the by-product that researchers must address fundamen- based on theory did not work out in practice. In con-
tal issues of measurement. Specifically, analyzing mea- trast, whether or not a scientifically trivial model fits
sures with strong psychometric characteristics, such as the data is irrelevant (Millsap, 2007).
good score reliability and positive evidence for validity, is
essential in SEM. For example, it is impossible to approx- Use the Best Research Computer
imate hypothetical constructs without thinking about in the World . . .
how to operationalize and measure those constructs.
When you have just a single measure of a construct, it is Which is the human brain; specifically—yours. At the
critical for this single indicator to have good psychomet- end of the analysis in SEM—or any other type of sta-
ric properties. Similarly, the analysis of measures with tistical analysis—it is you as the researcher who must
deficient psychometrics could bias the results. evaluate the degree of support for the hypotheses,
explain any unexpected findings, relate the results to
those from prior studies, and consider the implications
Eyes on the Prize of the findings for future research. Without your content
The point of SEM is to test a theory by specifying expertise, a statistician or computer nerd could help you
a model that represents predictions of that theory to select data analysis tools or write program syntax,
among plausible constructs measured with appropri- but could not help with the other things just mentioned.
ate observed variables (Hayduk et al., 2007). If such a As aptly noted by Pedhazur and Schmelkin (1991), “no
model does not ultimately fit the data, this outcome is amount of proficiency will do you any good, if you do
interesting because there is value in reporting models not think” (p. 2).
that challenge or discredit theories. Beginners some-
times mistakenly believe that the point of SEM is to Get a Computer Tool for SEM
find a model that fits the data, but this, by itself, is not
Obviously, you need a computer program to conduct
impressive. This is because any model, even one that
the analysis. In SEM, there are now many choices of
is grossly wrong (misspecified), can be made to fit the
computer tools, many for no cost. Examples of free
data by making it more complicated (adding parame-
computer software include various SEM packages
ters). In fact, if a structural equation model is specified
such as lavaan, semTools, cSEM, and OpenMx for R,
to be as complex as possible, it will perfectly represent
which is an open-source language and environment for
the data, not only in a particular sample, but also in any
statistical computing and graphics. Other options are
other sample using the same variables.
Wnyx, a graphical environment for creating and test-
There is also real scientific value in learning about
ing structural equation models, and JASP, an open-
just how and why things went wrong with a model
source computer program with a graphical user inter-
faithfully based on a particular theory. Schreiber (2017)
face (GUI) and capabilities for both frequentist and
put it like this:
Bayesian statistical procedures. It also has modules for
One component about SEM analyses that is rarely talked SEM, including mediation analysis and latent growth
about is when no model is retained. This is an overlooked modeling. Commercial options for SEM include Amos,
area because researchers fear a lack of being able to publish Adanco, EQS, LISREL, Mplus, and SmartPLS, among
it. A well written and argued “no model retained” manu- others.

Pt1Kline5E.indd 8 3/22/2023 3:52:35 PM


Promise and Problems 9

In the open science spirit of free access to research data-driven research, where preliminary causal models
tools and resources, greater emphasis in this book is are generated, to more confirmatory research, where
placed on free computer programs for SEM than on one or more extant models based on a priori hypoth-
commercial options. Specifically, the lavaan package eses are tested or compared. While the data in SEM
for R is used in most of the detailed analysis examples. come from measured variables, they can be treated in
It has both basic and advanced options for a wide range the analysis as approximating hypothetical constructs.
of SEM analyses, and its capabilities rival those of Thus, by analyzing manifest variables as indicators for
commercial software. Several other R packages that target constructs, it is also possible to estimate causal
supplement or extend lavaan analyses, such as sem- relations among those constructs.
Tools, are also used in examples. Syntax, data, and Pearl (2012, 2023) defined SEM as a causal infer-
output files can be freely downloaded from the website ence method that takes three inputs (I) and generates
for this book—see the Introduction—which is also an three outputs (O). The inputs are
exemplar of open access.
I-1. A set of causal hypotheses based on theory or
results of empirical studies that are represented
Join the Community in the structural equation model. The hypotheses
An electronic mail network called SEMNET is dedi- are typically based on assumptions, only some of
cated to SEM.1 It serves as an open forum for discus- which can actually be tested in the data.
sion and the whole range of issues associated with I-2. A set of queries or questions about causal rela-
SEM. It also provides a place to ask questions about tions among variables of interest such as, what is
analyses or more general issues, including philosophi- the magnitude of the direct causal effect of X on
cal ones (e.g., the nature of causality or causal infer- Y (represented as X → Y), controlling for other
ence). Subscribers to SEMNET come from various presumed causes of Y? All queries follow from
disciplines, and they range from novices to seasoned model specification.
veterans, including authors of many works cited in this I-3. Most applications of SEM are in observational
book. Sometimes the discussion gets lively (sparks can studies, or nonexperimental designs, but data
fly), but so it goes in scientific discourse. An archive from experimental or quasi-experimental studies
of prior discussions on SEMNET can be searched for can be analyzed, too—see Breitsohl (2019) for
particular topics. A special interest group for SEM is examples.
available for members of the American Educational
Research Association (AERA).2 There is even a theme The outputs of SEM are
song for SEM, the hilarious Ballad of the Casual Mod-
eler, by David Rogosa (1988).3 You can blame me if the O-1. Quantitative estimates of model parameters for
song gets stuck in your head. hypothesized effects including, for example,
X → Y, given the data.
O-2. A set of logical implications of the model that
DEFINITION OF SEM may not directly correspond to a specific param-
eter but can still be tested in the data. For exam-
The term structural equation modeling (SEM) refers ple, a model may imply that variables W and Y
to a set of statistical techniques for estimating the mag- are unrelated after controlling for certain other
nitudes and directions of presumed causal effects in variables in the model.
quantitative studies based on cross-sectional, longitu- O-3. The degree to which the testable implications of
dinal, experimental, or other kinds of research designs. the model are supported by the data.
Its application can range from more exploratory and
1 https://listserv.ua.edu/cgi-bin/wa?A0=semnet
BASIC DATA ANALYZED IN SEM
2 https://www.aera.net/SIG118/Structural-Equation-Modeling-

SIG-118 When unstandardized variables are analyzed in SEM,


3 https://web.stanford.edu/class/ed260/ballad.mp3 the basic datum for continuous variables is the covari-

Pt1Kline5E.indd 9 3/22/2023 3:52:35 PM


10 Concepts, Standards, and Tools

ance, which is defined for observed variables X and Y narily flexible in terms of effects and types of variables
as follows: that can be analyzed as presumed causes or outcomes.
There are other differences between SEM and regres-
covXY = rXY SDX SDY (1.1) sion techniques: Associations for observed variables
only are estimated in standard regression analysis, but
where rXY is the Pearson correlation and SDX and SDY relations for latent variables can also be estimated in
are their standard deviations.4 A covariance estimates SEM. In regression, the roles of predictor and criterion
the strength of the linear relation between X and Y in are theoretically interchangeable. For instance, there is
their original (raw score) units, albeit with a single no special problem in bivariate regression with specify-
number. Because the covariance is an unstandardized ing X as a predictor of Y in one analysis (Y is regressed
statistic, its value has no fixed lower or upper bound. on X) and then in a second analysis with regressing X
For example, covariances of, say, –1,025.45 or 19.77 are on Y. There is no such ambiguity in SEM, where the
possible, given the scale of the original scores. The sta- specification that X affects Y is a causal link that reflects
tistic covXY conveys more information than rXY, which theory and also depends on other assumptions in the
says something about association but in a standardized analysis. Thus, both the semantics and interpretation of
metric only. There are times in SEM when it is appro- the results in regression versus SEM are distinct (Bol-
priate to analyze standardized rather than unstandard- len & Pearl, 2013). Yes, there are situations where stan-
ized variables. If so, then rXY is the basic datum for dard regression techniques can be used to estimate pre-
variables X and Y. Reasons to analyze unstandardized sumed causal effects, but the context for SEM is causal
versus standardized variables in SEM are described at modeling, not mere prediction.
various points in the book.
Some researchers, especially those who use ANOVA
(analysis of variance) as their main analytical tool, have FAMILY MATTERS
the impression that SEM is concerned only with covari-
ances or correlations. This view is too narrow because The method of SEM consists of three distinct families of
means can be analyzed in SEM, too. But what really techniques or approaches to causal inference. All origi-
distinguishes SEM is that means of latent variables can nated in the pioneering work by the geneticist Sewall
also be estimated. In contrast, ANOVA is concerned Wright. His method of path coefficients (Wright,
with means of observed variables only. It is also pos-
1934)—or path analysis as it is now called—featured
sible to estimate effects in SEM traditionally associated
the estimation of causal effects based on hypotheses
with ANOVA, including between-group and within-
represented in a statistical model. Wright’s path mod-
group (e.g., repeated measures or longitudinal data)
els included both observed and latent variables, and in
mean contrasts. For example, in SEM one can estimate
graphical form they closely resemble today’s model
the magnitude of group mean differences on latent vari-
diagrams in SEM (e.g., Pearl, 2009, p. 415). Presumed
ables, which is not feasible in standard ANOVA. Means
causal effects were estimated in sample data, and pre-
are not analyzed in probably most published SEM stud-
ies, but the option to do so provides additional flex- dictions based on the model were compared with pat-
ibility. Several examples of the analysis of means are terns of observed associations in samples. In hindsight,
described later in the book. Wright’s innovations were remarkable, and his work
Just as in regression analysis, it is possible in SEM to continues to have influence to this day.
estimate curvilinear relations for continuous variables Although all three SEM families date to key innova-
or analyze noncontinuous variables, including nominal tions in the 1970s–1980s, each was further developed
or ordered-categorical (ordinal) variables, among other in relatively distinct sets of research disciplines or
variations, as either presumed causes or outcomes. areas. Consequently, other than a handful of scholars
Interactive effects can also be analyzed in SEM. An knowledgeable of the histories and intricacies in at least
outdated view is that only linear effects of continuous two different SEM families, many researchers trained
variables that do not interact can be estimated (Bollen in a particular SEM family were only partially aware
& Pearl, 2013), but the reality is that SEM is extraordi- of other possibilities or approaches to causal modeling.
But things are changing, in part due to the increasing
4 Thecovariance of a variable with itself is just its variance, such influence of multidisciplinary research that includes
as cov XX = s X2 . several disciplines in an integrated way under the same

Pt1Kline5E.indd 10 3/22/2023 3:52:35 PM


Promise and Problems 11

subject or area of study. The promise is that different multivariate ANOVA), and canonical variate analysis
viewpoints or methods over disciplines will comple- (canonical correlation), among others, analyzes com-
ment each other and provide a more comprehensive posites, too, but the focus of composite SEM is aimed
understanding of the problem or novel approaches to more at causal modeling. The technique of confirma-
dealing with it (Choudhary, 2015). tory composite analysis (CCA), a member of this fam-
The three SEM families are listed next and described ily, is the composite-based analogue to CFA. The name
afterward: “composite SEM” is used from this point forward.
3. The third family member more familiar to
1. Researchers in psychology and related disci-
researchers in epidemiology, computer science, and
plines are probably most familiar with the SEM fam-
medicine is the structural causal model (SCM) or
ily called covariance structure analysis, covariance
nonparametric SEM, which originated in Judea
structure modeling, or covariance-based SEM. All
Pearl’s work in the 1970s–1980s on Bayesian prob-
techniques of this type estimate parameters of causal ability networks and later extended to the more gen-
models made up of observed variables or proxies for eral problem of causal inference (Pearl, 2009). In the
latent variables by minimizing the difference between SCM, causal hypotheses are represented in a directed
the sample covariance matrix and the predicted covari- acyclic graph (DAG) when unidirectional causation is
ance for the same measured variables, given the model, assumed or in a directed cyclic graph (DCG) when
which represents hypotheses about how and why model certain variables are hypothesized to affect each other,
variables should be related (covary). Latent variables or reciprocal causation. The method is nonparametric
are approximated with common factors of the type ana- because the specification of a causal graph requires no
lyzed in classical factor analysis techniques that date commitment to any particular operational definition,
to the beginning of the 1900s (e.g., Spearman, 1904). distributional assumption, or specific functional form
The technique of confirmatory factor analysis (CFA) is of statistical association, such as linear versus curvilin-
a member of this family. Although this choice reflects ear, for any pair of variables. Unlike model diagrams in
my own background in psychology, I will use the sim- the other two SEM families, which are basically static
pler term “traditional SEM” to refer to the covariance- entities that require data to be analyzed, there are spe-
based SEM family.5 cial methods and computer programs for analyzing a
2. The second member of the SEM family is better causal graph with no data. This capability permits the
known among researchers in disciplines such as market- researcher to analyze alternative causal models in the
ing, organizational research, business research, or infor- planning stage of a study. The results of such analyses
mation systems, among others. It is called variance- can inform the researcher about how to select covari-
based SEM, composite SEM, or partial least squares ates that control possible confounding of target causal
path modeling (PLS-PM). The term “PLS-PM” also effects, among other possible ways to deal with the
refers to an estimation algorithm based on regression problem. This approach has also motivated the devel-
techniques that underlie many, but all not, applications opment of novel approaches to mediation analysis that
of composite SEM. It analyzes composites, or weighted are described later in the book. From now onward, this
combinations of observed variables, to approximate family is called “nonparametric SEM.”
hypothetical constructs. The term “variance-based”
signals that the goal is not strictly to explain the sample Traditional SEM
covariance matrix, although this goal can also be pur-
sued in a composite analysis. Instead, these techniques This SEM family is a synthesis of two frameworks:
analyze total variation among observed variables when the path analytic method by Wright, based on regres-
estimating causal effects, given the model. The classi- sion techniques to estimate causal effects; and the
cal general linear model (GLM) of multivariate statis- factor analytic approach to estimate latent variables
tics that includes multiple regression, MANOVA (i.e., from observed or manifest variables. It dates to (1)
the introduction of path analysis to the social sciences
5 Yes,what is considered as tradition by a person with a particu- in the 1950s–1960s by Blalock (1961) and others and
lar background may be seen as novelty to another person from a (2) the subsequent integration of regression techniques
different background. That’s life and multidisciplinary research. and factor-analytic methods into a unified framework

Pt1Kline5E.indd 11 3/22/2023 3:52:35 PM


12 Concepts, Standards, and Tools

in the 1970s–1980s called the JWK model (Bentler, Traditional SEM itself is part of an extended fam-
1980). The acronym refers to the work of three authors: ily of methods for analyzing latent variable models
K. G. Jöreskog, J. W. Keesling, and D. Wiley. The first that is briefly outlined next; see the sources cited for
widely available computer program to fit casual mod- more information. For example, latent variables in
els to sample covariance matrices was LISREL III by ­traditional SEM are assumed to be continuous. There
Jöreskog and Sörbom (1976), which was the progenitor are other techniques for analyzing models with cat-
for later versions of LISREL and other computer pro- egorical latent variables. The levels of a categorical
grams that include Amos, EQS, lavaan, and Mplus, latent variable are called classes, and they represent
among others. a mixture of subpopulations where membership is not
This family of traditional SEM techniques is prob- known but is inferred from the data. Thus, a goal of
ably the most widely used. It offers the potential the analysis is to estimate the nature and number of
benefits summarized next (Bagozzi & Yi, 2012): By latent classes. The technique of latent class analy-
integrating regression techniques with factor analytic sis can be viewed as a kind of factor analysis but one
methods, it is possible to estimate causal effects for where classes are approximated from observed vari-
any c­ ombination of observed or latent variables speci- ables that could be either categorical or continuous.
fied as ­ presumed causes or outcomes. The explicit Lanza and Rhoades (2013) described applications of
distinction between observed and latent variables can latent class analysis to identify subgroups in treatment
take direct account of measurement error, or score outcome studies.
unreliability. This characteristic lends a more realistic Muthén (2001) described the analysis of mix-
sense to the analysis in that researchers in the behav- ture models—also called mixture modeling—with
ioral sciences often analyze scores that are subject to latent variables that may be continuous or categorical.
measurement error. As mentioned, traditional SEM When both are present in the same model, the analy-
supports a range of applications from exploratory to sis is basically traditional SEM but conducted across
more confirmatory, depending on the researcher’s inferred subpopulations. The work just cited is part of
hypotheses and aims. There is a rich variety of mod- the larger ongoing effort to express all latent variable
els that can be analyzed for longitudinal data, more so models within a common mathematical framework
than in composite SEM. (Bartholomew, 2002). The Mplus computer program is
There are also limitations of traditional SEM: It especially adept at analyzing a variety of latent variable
requires large samples, which makes it challenging to models. This is because it can analyze all basic kinds of
apply the method in research areas where it is difficult to traditional SEM models and mixture models, too. Both
collect large numbers of cases, such as in studies of rare kinds of analyses just mentioned can also be combined
disorders. Exactly what is meant by “large samples” is in Mplus with multilevel modeling—also called linear
addressed in a later section of this chapter. There are mixed modeling or hierarchical linear modeling—
certain types of hypotheses that are difficult to test in for analyzing data with repeated measurements or that
this SEM family. An example is when hypothetical con- are organized in hierarchical levels, such as children
structs are defined in ways that contradict the estimation with families, where scores within the same level are
of latent variables with common factors as one would do probably not independent (Nezlek, 2008). Computer
in the technique of CFA. In such cases, the composite programs like Mplus blur the distinction between tradi-
SEM family—which uses composites to approximate tional SEM and techniques such as latent class analysis,
concepts, not common factors—may serve as a better mixture modeling, and multilevel modeling.
alternative for reasons explained in Chapters 3 and 16.
Other limitations are due to the misuse of traditional
Composite SEM
SEM, which is unfortunately widespread. These prob-
lems are not due to the method per se, but I would guess Composite SEM was developed in the 1970s–1980s by
that many, if not most, published SEM studies have one Herman O. A. Wold (1982). It analyzes composites as
flaw so severe that the results may have little or no inter- proxies for hypothetical variables, not common fac-
pretative significance. This critical issue is also elabo- tors. Statistical methods for composites are generally
rated later in the chapter. See Bollen, Fisher, Lilly, et al. simpler than methods for common factors, and there
(2022), who described the 50-year history of traditional are fewer distributional assumptions and other require-
SEM since 1972, reviewing strengths and vulnerabili- ments for composite methods. This approach was once
ties, and outlining future directions. described as a “soft modeling” alternative to “heavy-

Pt1Kline5E.indd 12 3/22/2023 3:52:35 PM


Promise and Problems 13

weight” traditional SEM with its complex estimation can now be followed in both composite SEM and tradi-
algorithms, potential for technical problems to scuttle tional SEM. Very recent developments (e.g., Schuberth,
the analysis, and the need for large samples. In its early 2021) extend the capabilities of standard SEM com-
years, the emphasis in composite SEM was on predic- puter programs like lavaan, Mplus, LISREL, and oth-
tion of target constructs by other variables, including ers to analyze composite models, which brings to the
covariates or other composites for different theoretical analysis all the advantages formerly associated with
variables. Because estimators were based on standard traditional SEM. Composite SEM can also test kinds
regression techniques, these methods generally maxi- of hypotheses about measurement that are difficult to
mized R2, or the proportion of explained variance, for evaluate in traditional SEM, and in such cases com-
outcome variables. In contrast, traditional SEM tech- posite SEM is the preferred technique. Relatively new
niques do not necessarily maximize R2 for individual estimators can directly control for measurement error
outcomes. Thus, composite SEM at the time was gen- in composites. All these developments explain why I
erally thought of as a prediction method versus tradi- think it is important for you to know about this member
tional SEM, which was seen as emphasizing explana- of the SEM family, too.
tion through maximizing the similarity of sample and
model-implied data matrices.
Nonparametric SEM
In addition to the distinction of prediction versus
explanation, authors of works about composite SEM Pearl (2009) described his graph-theoretic approach,
published roughly in the 1990s–2010s (e.g., Hair et the SCM, as unifying two frameworks to causal infer-
al., 2012) highlighted these relative advantages: For ence, traditional SEM and the potential outcomes
the same number of observed variables in the model, model (POM), also called the Neyman–Rubin model
composite SEM generally required smaller sample after Jerzy Neyman and Donald Rubin (Rubin, 2005).
sizes than traditional SEM, and the power of signifi- Briefly, the POM elaborates on the role of counterfac-
cance tests was described as generally higher in com- tuals in causal inference. A counterfactual is a hypo-
posite SEM for the same sample size. Its relative lack thetical or conditional statement that expresses not
of distributional assumptions was rightly touted as an what has happened but what could or might happen
advantage, as was its more exploratory nature com- under different circumstances (e.g., “I would not have
pared with traditional SEM. Drawbacks included its been late, if I had correctly set the alarm”). In treat-
lack of a direct way to control for measurement error ment outcome studies, for example, there are two basic
or test the overall fit of the model to the data, which are counterfactuals: (1) what would the outcome of control
standard practices in traditional SEM. The inability to cases be, if they were treated; and (2) what would the
assess overall fit in composite SEM limited its role in outcome of treated cases be, if they were not treated? If
testing theories; specifically, it offered no direct way to each case was either treated or not treated, these poten-
determine how well the model as a whole explained the tial outcomes would not be observed. This means that
data. At the time, composite SEM could be described the observed data—outcomes for treatment versus con-
as “SEM-lite” for testing hypotheses about latent vari- trol—is a subset of all possible combinations.
ables compared with its older brother, traditional SEM. The POM is concerned with conditions under which
Oh how times change. Technical and conceptual causal effects may be estimated from data that do not
advances in composite SEM since about 2010 have include all potential outcomes. In experimental designs,
been a turning point, and this is no hype. (I am not big random assignment helps with the replicability of stud-
on hype; keep it real, please.) Just a few key develop- ies in terms of the equivalence of treated and control
ments are summarized next; see Henseler (2021) and groups. Thus, any observed difference in outcome can
Chapter 16 in this book for more detailed accounts: It is be attributed to the treatment, provided there is suffi-
now possible to apply composite SEM across the whole cient replication. But things are more complicated in
range of studies from more exploratory to more con- quasi-experimental or nonexperimental designs where
firmatory. One reason is that new methods allow the the assignment mechanism is both nonrandom and
researcher to test the fit of the whole model to the data, unknown. In this case, the average observed difference
just as in traditional SEM. The same basic stages of may be a confounded estimator of treatment effects
the analysis, from specification of the model through versus selection factors. In such designs, the POM
its analysis and possible modification based on the data, distinguishes between equations for the observed data

Pt1Kline5E.indd 13 3/22/2023 3:52:35 PM


14 Concepts, Standards, and Tools

versus those for causal parameters. This feature helps PEDAGOGY AND SEM FAMILIES
to clarify how estimators based on the data may differ
from estimators based on the causal model (MacKin- Next, I explain my approach to teaching you about
non, 2008). The POM has been applied many times in SEM, given the descriptions of the three SEM families
randomized clinical trials and in mediation analysis, a just considered. First, beginning in Chapter 6, I describe
topic covered later in this book. the logic of causal inference for nonparametric struc-
Some authors describe the POM as a more disci- tural equation models where only theoretical variables
plined method for causal inference than traditional are represented in a causal graph. These variables are
SEM (Rubin, 2009), but such claims are problematic for not yet operationalized. Of course, theoretical variables
two reasons (Bollen & Pearl, 2013). First, it is possible must eventually be measured. There could be a vari-
to express counterfactuals in SEM as predicted values ety of ways to operationalize a construct. For example,
for outcome variables, once we fix values of its causal there is single-indicator measurement, where a single
variables to constants that represent the conditions in observed variable is the proxy for the corresponding
counterfactual statements (Kenny, 2021). Second, the construct. Scores on that indicator could be treated
as ordinal data that measure relative standing without
POM and SEM are logically equivalent in that a theo-
assuming equal intervals or treated as continuous data
rem in one framework can be expressed as a theorem in
if intervals are assumed to be more or less equal, among
the other. That the two systems encode causal hypoth-
other possibilities. An alternative is multiple-indicator
eses in different ways—in SEM as functional relations
measurement, where a set of ≥ 2 observed variables
among observed or latent variables and in the POM as is used to approximate the same theoretical variable.
statistical relations among counterfactual (latent) vari- That is, instead of placing all of one’s measurement
ables—is just a superficial difference (Pearl, 2009). eggs in the basket of a single observed variable, scores
Thus, the SCM as outlined by Pearl is a framework from sets of multiple of variables are combined when
for causal inference that the extends the capabilities of approximating theoretical variables.
both traditional SEM and the POM. This is why Hay- But exactly how target constructs should be measured
duk et al. (2003) described the SCM as the future of is a decision that should come after working out your
SEM and also why I introduce it to readers both in this basic causal hypotheses by specifying and analyzing a
edition and in the previous (4th) edition of this book. causal graph in nonparametric SEM. Concepts about
Grace et al. (2012) described the SCM as the third gen- ways to control confounding and determine whether
eration of SEM. The first generation of SEM dates to it is possible to estimate specific causal effects in the
Sewall Wright, who invented path analysis and corre- graph are reviewed. If analysis of the casual graph indi-
sponding path diagrams as the graphical expression of cates that a particular effect cannot be estimated, then
causal hypotheses. The second generation of SEM is something needs to be done, such as adding covariates
the synthesis of path analysis and factor analysis in tra- to address confounding. If a causal effect can be esti-
ditional SEM.6 Earlier I described the possibility in this mated, analysis of the graph may indicate that more
approach of analyzing the graphical model before the than one estimate of that effect could be generated in
data are collected as a way to help plan the study. These the data. Such knowledge helps the researcher to plan
methods can also locate testable implications implied the analysis in ways that prevent unpleasant surprises,
by the graph. No special software is needed to analyze such as discovering after collecting the data that cer-
the data. This is because testable implications for con- tain hypothesized effects cannot be estimated without
tinuous variables can be estimated with standard com- adding variables to the model (for which it might be
puter tools for statistical analysis, such as IBM SPSS, too late).
with no need for special SEM software. This particular Next, I describe in Chapter 7 the specification of
approach is called piecewise SEM, which is described parametric causal models that correspond to actual
in Chapter 8. measured variables. To keep things simpler, the mea-
surement approach is single indicator and the corre-
6 Muthén (2001) described second-generation SEM as the capa- sponding parametric models are manifest variable
bility in traditional SEM to analyze continuous or categorical path models analyzed in the classical technique of
outcomes in models with fixed or random effects, such as latent path analysis, the most senior member of the tradi-
growth models. tional SEM family. I outline in Chapters 8–12 the fun-

Pt1Kline5E.indd 14 3/22/2023 3:52:35 PM


Promise and Problems 15

damental principles of estimating model parameters, posite SEM) is a large-sample technique. Implications
evaluating model fit to the data, testing hypotheses of this characteristic are considered throughout the
about the model, respecifying the model if its account book, but I can say now that certain types of estimates
of the data is rejected, comparing alternative models fit in SEM, such as standard errors for effects of latent
to the same data, and analyzing the same model over variables, may be inaccurate when the sample size is
multiple groups or conditions. These topics are the core not large. The risk for technical problems in the analy-
of traditional SEM. They all generalize to latent vari- sis is greater, too.
ables models based on multiple-indicator measure- Because sample size is such an important issue, let
ment of the kind analyzed in the technique of CFA, but us now consider the bottom-line question: What is a
I want you to understand these fundamentals before we “large enough” sample size in SEM? It is impossible to
add the complexities of multiple-indicator measure- give a single answer because the factors that are sum-
ment to the mix. That is, even if you are most interested marized next can affect sample size requirements:
in latent variable modeling, a strong knowledge of path
analysis with observed variables will you help you get 1. More complex models, or those with more param-
there. eters, require bigger sample sizes than simpler models
Coverage of latent variable models begins in Chap- with fewer parameters. This is because models with
ter 13 with a conceptual treatment of multiple-indicator more parameters require more estimates, and larger
measurement. I should caution you that this chapter has samples are necessary in order for the computer to esti-
a different structure compared with most other chap- mate the additional parameters with reasonable preci-
ters in the book. Specifically, there are no data analysis sion.
examples nor equations. Instead, we deal with concepts
about measurement with multiple indicators that the 2. In analyses in which all outcome variables are
researcher should understand before applying specific continuous and normally distributed, all effects are lin-
analysis techniques. The goal is help you to specify ear, and there are no interactive effects require smaller
models where sets of multiple indicators approximate sample sizes. This is in comparison to analyses in which
hypothetical constructs in ways consistent with your some outcomes are not continuous or have severely
hypotheses. nonnormal distributions or in which there are curvilin-
Described in Chapters 14–15 are, respectively, the ear or interactive effects. Sample size limitations also
traditional SEM technique of CFA for analyzing mea- curtail the use of estimation methods available in SEM,
surement models and the analysis of so called latent some of which need very large samples because of the
variable path models where single or multiple indi- assumptions they make—or do not make—about the
cators are used to approximate theoretical variables. data.
These types of models represent the apex in traditional 3. Larger sample sizes are needed if score reliabil-
SEM that can be extended in many ways to test addi- ity is relatively low; that is, less precise data require
tional, more advanced hypotheses that are covered larger samples in order to offset the potential distorting
in Chapters 17–22. Chapter 16 is devoted to compos- effects of measurement error. Latent variable models
ite SEM with an emphasis on explaining the kinds of can control measurement error better than observed
hypotheses about measurement that are difficult to test variable models, so fewer cases may be needed when
in traditional SEM but can be evaluated with relative there are multiple indicators for constructs of interest.
ease in composite SEM. I do not describe traditional The amount of missing data also affects sample size
SEM and composite SEM as competitors; instead, I requirements. As expected, higher levels of missing
emphasize their unique roles and capabilities as com- data require larger sample sizes in order to compensate
plementary approaches. for loss of information.
4. There are also special sample size considerations
SAMPLE SIZE REQUIREMENTS for particular kinds of structural equation models. In
factor analysis, for example, larger samples may be
Attempts to adapt SEM techniques to work in smaller needed if there are relatively few indicators per factor,
samples are described in Chapter 17, but it is still gener- the factors explain unequal proportions of the variance
ally true that SEM (more so in traditional than in com- across the indicators, some indicators covary appre-

Pt1Kline5E.indd 15 3/22/2023 3:52:36 PM


16 Concepts, Standards, and Tools

ciably with multiple factors, the number of factors is sample size would be 20q, or N = 200. Less ideal would
increased, or covariances between factors are relatively be an N:q ratio of 10:1, which for the example just given
low. for q = 10 would be a minimum sample size of 10q, or
N = 100. As the N:q ratio falls below 10:1 (e.g., N = 50
Given all of these influences, there is no simple rule for q = 10 for a 5:1 ratio), so does the trustworthiness
of thumb about sample size that works across all stud- of the results. The risk for technical problems in the
ies. Also, sample size requirements in SEM can be analysis is also greater.
considered from at least two different perspectives, (1) It is even more difficult to suggest a meaningful
the number of cases required in order for the results to absolute minimum sample size, but it helps to consider
have adequate statistical precision versus (2) minimum typical sample sizes in SEM studies. A median sample
sample sizes needed in order for significance tests in size may be about 200 cases based on reviews of stud-
SEM to have reasonable power. Recall that power is the ies in different research areas, including operations
probability of rejecting the null hypothesis in signifi- management (Shah & Goldstein, 2006) and education
cance testing when the alternative hypothesis is true in and psychology (MacCallum & Austin, 2000). But
the population. Depending on the model and analysis, N = 200 may be too small when analyzing a complex
sample size requirements needed for power to equal or model or outcomes with nonnormal distributions, using
exceed, say, .95, can be much greater than those needed an estimation method other than maximum likelihood,
for statistical precision. or finding that there are missing data. With N < 100,
Results of a computer simulation (Monte Carlo) almost any type of SEM may be untenable unless a very
study by Wolf et al. (2013) illustrate the difficulty with simple model is analyzed, but models so basic may be
“one-size-fits-all” heuristics about sample size require- uninteresting. Barrett (2007) suggested that reviewers
ments in SEM. These authors studied a relatively small of journal submissions routinely reject for publication
range of structural equation models, including factor any SEM analysis where N < 200 unless the popula-
analysis models, manifest-variable versus latent-vari- tion studied is restricted in size. This recommendation
able models of mediation, and single-indicator versus is not standard practice, but it highlights the fact that
multiple-indicator measurement models. Minimum analyzing small samples in SEM is problematic.
sample sizes for both precision and power varied widely Most published SEM studies are probably based on
across the different models and extent of missing data. samples that are too small. For example, Loehlin and
For example, minimum sample sizes for factor analysis Beaujean (2017) noted that results of power analyses
models ranged from 30 to 460 cases, depending on the in SEM are “frequently sobering” because researchers
number of factors (1–3), the number of indicators per often learn that their sample sizes are far too small for
factor (3–8), the average correlation between indicators adequate statistical power. Westland (2010) reviewed
and factors (.50–.80), the magnitude of factor correla- a total of 74 SEM studies published in four different
tions (.30–.50), and the extent of missing data (2–20% journals on management information systems. He esti-
per indicator). mated that (1) the average sample size across these
I describe in Chapter 10 various methods to estimate studies, about N = 375, was only 50% of the minimum
target sample sizes in traditional SEM, but here I can size needed to support the conclusions; (2) the median
suggest at least a few rough guidelines about sample sample size, about N = 260, was only 38% of the mini-
size requirements for statistical precision: For latent mum required and also reflected substantial negative
variable models where all outcomes are continuous and skew in undersampling; and (3) results in about 80% of
normally distributed and where the estimation method all studies were based on insufficient sample sizes. We
is maximum likelihood—the default method in most will revisit sample size requirements in later chapters,
SEM computer tools—Jackson (2003) described the but many, and probably most, published SEM studies
N:q rule. In this heuristic, Jackson (2003) suggested are based on samples that are too small.
that researchers think about minimum sample sizes
in terms of the ratio of the number of cases (N) to the
number of model parameters that require statistical BIG NUMBERS, LOW QUALITY
estimates (q). A recommended sample-size-to-param-
eters ratio would be 20:1. For example, if a total of q There is no denying that SEM is increasingly “popular”
= 10 parameters requires estimates, then a minimum among researchers. Thousands of SEM studies have

Pt1Kline5E.indd 16 3/22/2023 3:52:36 PM


Promise and Problems 17

been published at an accelerating rate since the 1990s Many problems were apparent. For instance, an explicit
or so (Hair et al., 2012; Thelwall & Wilson, 2016). The justification for using SEM was provided in only
increasing availability of computer programs, both about 40% of the studies; distributional assumptions
commercial and free, has made SEM in its various were addressed in about 20%; a specific rationale for
forms ever more accessible to applied researchers. It the sample size was given in about 30%; and specific
is not hard to understand the enthusiasm for SEM. As details about the correspondence between model and
described by David Kenny in the Series Editor Note in data were reported in 20% of reviewed studies. These
the second edition of this book, researchers love SEM results are disheartening, but they are hardly atypical.
because it addresses questions they want answered Nothing in SEM protects against equivalent models,
and it “thinks” about research problems the way that which explain the data just as well as the researcher’s
researchers do. But there is evidence that many—if preferred model, but an equivalent model makes dif-
not most—published reports of the application of SEM fering causal claims. The problem of equivalent mod-
have serious flaws as described next. els is relevant in probably most applications of SEM,
MacCallum and Austin (2000) reviewed about 500 but most authors of SEM studies do not even mention
SEM studies in 16 different psychology research jour- it (MacCallum & Austin, 2000). Ignoring equivalent
nals, and they found problems with the reporting in models is a serious kind of confirmation bias whereby
most cases. For example, in about 50% of the articles, researchers test a single model, give an overly posi-
the reporting of parameter estimates was incomplete tive evaluation of the model, and fail to consider other
(e.g., unstandardized estimates omitted); in about 25% explanations of the data (Shah & Goldstein, 2006). The
the type of data matrix analyzed (e.g., correlation vs. potential for confirmation bias is further strengthened
covariance matrix) was not described; and in about by the lack of replication, a point considered next.
10% the model specified or the indicators of factors It is rare when SEM analyses are replicated across
were not clearly specified. Shah and Goldstein (2006) independent samples either by the same researchers
reviewed 93 articles in four operations management who collected the original data (internal replication)
research journals. In most articles, they found that it or by other researchers who did not (external replica-
was hard to determine the model actually tested or the tion). The need for large samples in SEM complicates
complete set of observed variables. They also found replication, but most of the SEM research literature is
that the estimation method used was not mentioned in made up of one-shot studies that are never replicated.
about half of the articles, and in 31 out of 143 studies, It is critical to eventually replicate a structural equation
the model described in the text did not match the statis- model if it is ever to represent anything beyond a mere
tical results reported in text or tables. statistical exercise. Thus, the SEM research literature
Zhang et al. (2021) reviewed 144 SEM studies pub- is part of the broader replication crisis in psychology
lished in 12 top organizational and management jour- and other disciplines. Kaplan (2009) noted that despite
nals in 2011–2016. Each article was evaluated against over 40 years of application of SEM in the behavioral
criteria that included sciences, it is rare that results from SEM analyses are
used for policy or clinically relevant prediction studies.
1. The clarity of the rationale for using SEM over The ultimate goal of SEM—or any other type of
alternative methods. method for statistical modeling—should be to attain
2. Whether the statistical models analyzed were what I call statistical beauty, which means that the
described in sufficient detail. final retained model (if any)
3. The completeness of the reporting on data integ-
rity, including the rationale for the sample size and 1. Has a clear theoretical rationale (i.e., it makes sense).
information about distributional assumptions, miss- 2. Differentiates between what is known and what is
ing data, and reliability of the scores analyzed. unknown—that is, what is the model’s range of con-
4. Whether hypotheses were tested in a clear, specific venience, or limits to its generality?
order. 3. Sets conditions for posing new questions.
5. Whether the statistical results were described in
adequate detail so that readers could evaluate the That most applications of SEM fall short of these goals
trustworthiness of the conclusions. should be taken as an incentive by all of us to do better.

Pt1Kline5E.indd 17 3/22/2023 3:52:36 PM


18 Concepts, Standards, and Tools

LIMITS OF THIS BOOK ing the data, and the potential to distinguish between
observed and latent variables and to test a wide range
Many advanced applications in SEM are described in of hypotheses about measurement and causation. More
Chapters 17–22, but it is impossible (and undesirable, and more researchers are using SEM, but in too many
too) to cover the whole range of extended or special- studies there are serious problems with the way it is
ized analyses in a single volume. Just a few of these applied or with how analysis results are reported. How
topics are mentioned next with citations for sources to avoid getting into trouble with SEM is a major theme
that provide more information. There are special struc- in later chapters of this book. The ideas introduced in
tural equation models for imaging data, such as func- this chapter set the stage for review in the next chap-
tional magnetic resonance imaging (fMRI) (Cooper et ter of fundamental principles in statistics that underlie
al., 2019), and also for genetic data (Luo et al., 2019). SEM.
Bayesian SEM combines the methods of Bayesian esti-
mation with the analysis of structural equations models
(Depaoli, 2021). Bayesian options for SEM are men- LEARN MORE
tioned at various points in the book, but their applica-
tion requires strong knowledge of Bayesian statistics. Bollen and Pearl (2013) describe myths about traditional
Two other advanced topics not covered in this book SEM, Hair (2021) outlines the history of composite SEM, and
Wolfle (2003) traces the introduction of path analysis to the
include multilevel SEM (Castanho Silva et al., 2020)
social sciences.
and the analysis of interactive effects of latent variables
(Cortina et al., 2021), although methods for observed
Bollen, K. A., & Pearl, J. (2013). Eight myths about causal-
variables are described in Chapters 12 and 20. ity and structural equation models. In S. L. Morgan
(Ed.), Handbook of causal analysis for social research
(pp. 301–328). Springer.
SUMMARY
Hair, J. F. (2021). Reflections on SEM: An introspective, idio-
The SEM families of techniques have their origins syncratic journey to composite-based structural equation
in regression-based analyses of observed variables, modeling. SIGMIS Database, 52(SI), 101–113.
factor-analysis-based evaluations of common factor Wolfle, L. M. (2003). The introduction of path analysis
models, composite-based analyses of proxies for theo- to the social sciences, and some emergent themes: An
retical variables, and methods from computer science annotated bibliography. Structural Equation Modeling,
for analyzing causal graphs. Essential features include 10(1), 1–34.
the capabilities to analyze causal models before collect-

Pt1Kline5E.indd 18 3/22/2023 3:52:36 PM


2

Background Concepts and Self‑Test

Newcomers to SEM should have good statistical knowledge in at least three areas: (1) regression analysis;
(2) correct interpretation of results from statistical significance testing including the role of bootstrapping;
and (3) psychometrics, or statistical measures of the properties of scores from psychological tests including
evidence for their reliability or validity. Some estimates in SEM are interpreted exactly as regression coef-
ficients, and these interpretations depend on many of the same assumptions as in regression analysis. The
potential for bias due to omitted predictors that covary with measured predictors or due to measurement
error in predictors or the criterion is similar in both regression and SEM when (1) there is a single observed
measure of each construct, and (2) score reliability is not explicitly represented nor controlled in the analysis.
Results of significance testing both for the whole model and for estimates of its individual parameters are
widely reported in SEM studies, although statistical significance is hardly the sole basis for inference in SEM
for reasons explained throughout the book. It is also true that results in significance testing are widely mis-
understood in perhaps most analyses, including SEM, and you need to know how to avoid making common
mistakes. If predictor or outcome variables are measured with psychological tests, their score reliabilities
should be routinely assessed in the researcher’s sample. Knowledge of psychometrics is also essential when
selecting among alternative measures of the same target concept.
This chapter serves as a check on your understanding of regression fundamentals, significance test-
ing, and psychometrics, and as a gateway to additional resources for self-study. Specifically, after a review
of potential obstacles to a strong command of these topics and a summary of background topics from the
perspective of learning about SEM, a self-test of comprehension with a scoring system is provided. Relatively
low scores (e.g., < 50% correct) in a particular area, such as psychometrics, would signal the need for further
reading. To that end, a total of three supplementary chapters, the Regression Primer, Significance Testing
Primer, and Psychometrics Primer, are freely available on the website for this book. Some advice: Even if you
think that you already know these topics, you should take the self-test. Many readers tell me that they learned
something new after hearing about the issues outlined next.

UNEVEN BACKGROUND States and Canada about training in statistics, measure-


PREPARATION ment, and research methods. They replicated a compa-
rable survey from almost 20 years before on the same
There is evidence that not all of the background top- topic. In 2008, the median for training in statistics and
ics just mentioned are adequately covered in graduate measurement was 1.6 years associated mainly with a
school. For example, Aiken et al. (2008) surveyed just 1-year introductory statistics sequence. About 80% of
over 200 psychology doctoral programs in the United doctoral programs offered in-depth training in both the

19

Pt1Kline5E.indd 19 3/22/2023 3:52:36 PM


20 Concepts, Standards, and Tools

“old standards” (e.g., ANOVA) and multiple regres- Reasons for this critical assessment include the likeli-
sion (MR) with the result that students could generally hood that many, if not most, p values reported in the
perform such analyses themselves (Aiken et al., 2008). literature are wrong, the myriad of false beliefs associ-
Significance testing in ANOVA, MR, and other tech- ated with significance testing, and the concern about p
niques would also typically be covered in introductory hacking—see Topic Box 2.1 for elaboration.
graduate statistics courses (Kline, 2020b). But whether Instruction in measurement was available in about
graduate students—or even experienced researchers— 60% of psychology doctoral programs surveyed by
understand significance testing or related concepts, Aiken et al. (2008), but coverage of core topics such
such as confidence intervals, is questionable, a topic as classical test theory, item response theory (IRT),
addressed soon. and test construction was typically brief with a median
Aiken et al. (2008) also described critical gaps in length of just 4.5 weeks for all topics. This is not enough
statistics or measurement training. For instance, only time to develop any real expertise about measurement.
about 30% of doctoral programs offered in-depth In fact, I bet that an undergraduate student who has
training in regression diagnostics, or techniques for taken a one-semester (e.g., 3-credit) course in psycho-
assessing whether assumptions are tenable or if there metrics knows more about psychological measurement
are scores, such as outliers, with undue influence on the and test construction than a graduate student without
results. Misunderstanding about assumptions in MR is this background. The meagre amount of doctoral-level
common. Examples include being unaware that per- measurement training in 2008 was actually a slight
fect score reliability for predictor variables is assumed, improvement over results for the same area reported 20
believing falsely that measurement error necessarily years earlier.
causes underestimation of individual regression coef-
ficients, and the myth that the requirement for normal
distributions applies to the observed scores instead POTENTIAL OBSTACLES TO LEARNING
of the residuals (Williams et al., 2013). Over 90% of ABOUT SEM
nearly 900 articles in clinical psychology research jour-
nals reviewed by Ernst and Albers (2017) were unclear I do not know whether the results just summarized
about assumption checks in MR analyses. Another gap about psychology graduate training in statistics, mea-
is that logistic regression for dichotomous outcomes is surement, and research design apply generally to other
covered in depth in only about 10% of doctoral pro- disciplines. But in my experience, based on working
grams (Aiken et al., 2008). This result suggests that with SEM novices from disciplines such as education,
training in other methods for binary data, such as the biology, health sciences, operations management, mar-
probit regression model, or methods for categorical keting, and commerce, researchers face the following
outcomes with three or more levels, such as ordinal potential obstacles to learning about SEM:
regression for ordered categories or multinomial logis-
tic regression for unordered categories, is infrequent at 1. The emphasis in graduate training in regression
the graduate level. techniques is mainly on continuous outcomes, so new
Another gap is that controversies in significance test- researchers may be relatively unprepared to apply esti-
ing are thoroughly covered in only about 30% of psy- mators for categorical outcomes in SEM.
chology doctoral programs (Aiken et al., 2008). This is
unfortunate because ongoing debate about the proper 2. Overemphasis of statistical significance can blind
role of significance testing, including none, is part of a researchers to other aspects of the results that are just
larger credibility crisis about psychology research that as critical, if not even more so, than p values. Antona-
includes concerns about replication, reporting, mea- kis (2017) referred to this particular stumbling block
surement, and the wasting of perhaps most research as significosis, or an ordinate focus on statistical sig-
funds in some areas (Szucs & Ioannidis, 2017). How to nificance. A related distortion is dichotomania, or the
properly report results from SEM analyses is covered compulsion to dichotomize continuous p values against
in the next chapter, but traditional significance test- an arbitrary standard, such as p < .05 for results touted
ing is seen as lacking in more and more disciplines; as “significant” versus p ≥ .05 for other results that are
expert and novice researchers alike are unaware of discounted, ignored, or lamented because they are “not
that decline, which is a serious problem (Kmetz, 2019). significant.” This binary interpretation is not supported

Pt1Kline5E.indd 20 3/22/2023 3:52:36 PM


Background Concepts and Self‑Test 21

TOPIC BOX 2.1

Cautionary Tales About Significance Testing:


Inaccuracies, Errors, and Hacking
For two reasons, most p values reported in the research literature could be incorrect:

1. Assumptions of significance testing—­random sampling from population distributions with known


properties (e.g., normality, homoscedasticity) and the absence of all other sources of error other
than sampling error—are generally implausible. Most samples in human studies are ad hoc, or
samples of convenience selected because they happen to be available. Convenience sampling
may have little, if anything, to do with random sampling. Few distributions in real data sets are
normally distributed or homoscedastic, and even slight departures from distributional assumptions
in small, unrepresentative samples can grossly distort p values (Erceg-Hurn & Mirosevich, 2008).
2. There are mistakes in too many journal articles in the reporting of p values. Nuijten et al. (2016)
found that over 50% of reviewed articles published in psychology research journals during the
years 1985–2013 contained at least one incorrect p value, given reported test statistics and
degrees of freedom.

There is ample evidence that most researchers, including those with the highest levels of statistical
training, do not fully understand what p values mean (McShane & Gal, 2016). Psychology professors in
two different surveys endorsed false belief about p values at rates generally no lower than among under-
graduate students (Haller & Krauss, 2002; Oakes, 1986), and about 90% of both groups endorsed at
least one incorrect interpretation. Summarized next are what I call the big five misinterpretations
of p values in significance testing. They are described for the case p < .05 when testing at the .05 level:

1. Most researchers endorse the local Type I error fallacy that the likelihood that the decision just
taken to reject the null hypothesis is a Type I error is less than 5%. This belief is wrong because any
particular decision to reject the null hypothesis is either correct or incorrect, so no probability (other
than 0 or 1.0) is associated with it. Also, it is only with sufficient replication that we could determine
whether or not the decision to reject the null hypothesis in a particular study was correct.
2. Probably most researchers endorse the odds-­against-­chance fallacy, or the false belief that
the probability that a particular result is due to chance (i.e., arose by sampling error alone) is
less than 5%. This belief is wrong because p is calculated by the computer assuming that the null
hypothesis is true, so the probability that sampling error is the only explanation is already taken
to be 1.0. Thus, it is illogical to view p as somehow measuring the likelihood of chance. Besides,
the probability that sample results are affected by error of some kind—­sampling, measurement,
­implementation, or specification error, among others—­is virtually 1.0. From this perspective, basi-
cally all sample results are wrong in that they generate incorrect point estimates of the target
parameter, and significance testing in primary studies does nothing to change this reality (Ioan-
nidis, 2005).
3. Just under half of researchers believe that the probability is less than 5% that the null hypothesis
is true, which is the inverse probability error, also called the fallacy of the transposed
conditional (Ziliak & McCloskey, 2008), and the permanent illusion due to its persistence
over time and disciplines (Gigerenzer & Murray, 1987). This error stems from forgetting that
(continued)

Pt1Kline5E.indd 21 3/22/2023 3:52:36 PM


22 Concepts, Standards, and Tools

p values are conditional probabilities of data under the null hypothesis, not the other way around.
There are direct methods in Bayesian statistics to estimate conditional probabilities of hypotheses,
but not in classical significance testing.
4. The false belief that 1 – p is the probability of finding another significant result in a future sample
is the replication (replicability) fallacy, which is endorsed by about half of researchers. For
example, if p < .05, then the likelihood of replication is believed to exceed .95 under this myth.
Knowing the probability of replication in hypothetical future samples would be very useful, but the
quantity 1 – p is just the probability of a result or one even less extreme under the null hypothesis.
In general, replication is a matter of experimental design, sampling, and whether some effect
actually exists in the population.
5. The validity (valid research hypothesis) fallacy is the myth that 1 – p is the likelihood
that the alternative hypothesis is true, a false hope endorsed by about half of researchers. As
mentioned, probabilities of hypotheses are not estimated in significance testing, and both p and
its complement, 1 – p, are conditional probabilities of data, not hypotheses.

There are many other cognitive errors about significance testing outcomes. For example, the filter
myth is the false belief that p values sort results into two categories: those “significant” findings that are
not due to chance versus the “not significant” results that are due to chance. This myth is basically an exten-
sion of the odds-­against-­chance fallacy applied to results with higher (not significant) p values. Authors
in just over 50% of nearly 800 published empirical studies reviewed by Amrhein et al. (2019) committed
the zero fallacy—also called the slippery slope of nonsignificance (Cumming & Calin-­Jageman,
2017)—by falsely interpreting the absence of statistical significance as evidence for a zero population
effect size.
The misinterpretations just described plus others—­see the Significance Testing Primer—­can promote
a cognitive style where researchers report low p values with great confidence or even bravado but with
little real comprehension (Lambdin, 2012). The same distortions can also lead to confirmation bias where
statistically significant results are uncritically accepted as proving the researcher’s hypotheses—­see Calin-­
Jageman and Cumming (2019) for examples. False beliefs also pose a challenge to instructors of a first
graduate statistics course: Students’ heads are filled with myth about significance testing that can interfere
with further learning unless those fallacies are identified and remediated (Kline, 2020b).
Confidence intervals are described as ways to summarize results in a more comprehensive way than
is possible in significance testing, and statisticians generally prefer interval estimation over point estima-
tion, or the reporting of results with no margins of error (or error bars in graphical form) (Cumming &
Calin-­Jagerman, 2017). Because significance testing and confidence intervals are based on essentially the
same concepts and assumptions, it is not surprising that similar kinds of errors can affect both. Abelson
(1997) described the law of diffusion of idiocy, which states that every misstep in significant testing
has a counterpart with confidence intervals. There is also evidence that confidence intervals, too, are
widely misinterpreted (Hoekstra et al., 2014). Listed next are some common false beliefs (Morey et al.,
2016):

1. Fundamental confidence fallacy, or the myth that confidence level, such as 95%, indicates
the likelihood (i.e., .95) that the interval contains the value of the target parameter.
2. Precision fallacy, or the misunderstanding that narrower confidence intervals automatically
signal greater precision of knowledge about the parameter.
(continued)

Pt1Kline5E.indd 22 3/22/2023 3:52:36 PM


Background Concepts and Self‑Test 23

3. Likelihood fallacy, or the untruth that a particular confidence interval includes equally likely
values for the parameter.

The phenomenon of p hacking involves the possibility of presenting any result as “significant”
through decisions, such as covariate selection, transformations of scores, or the treatment of missing data,
that are not always disclosed (Simmons et al., 2011). Most instances of hacking involve lowering p val-
ues so that key results are significant, but p values can also be increased when results that are not sig-
nificant favor the researcher’s goals. Perhaps the most infamous example of hacking to increase p is the
Vioxx calamity, where analyses of data from a clinical trial were manipulated to render nonsignificant an
increased risk of heart attack in the treatment group (Ziliak & McCloskey, 2008). Both p hacking and other
dubious practices that favor the researcher’s hypotheses are probably widespread (John et al., 2012). The
parallel in SEM is model hacking, where significance test results are manipulated to increase the odds
of retaining the model, which represents the researcher’s hypotheses. How to avoid model hacking in SEM
begins with complete and transparent reporting of the results, including explaining the bases for deciding
whether to retain any model.

by what can be trivial differences in continuous p val- SIGNIFICANCE TESTING


ues; that is, a result where p = .04 does not appreciably
differ from another result where p = .06 when testing at Many effects can be tested for statistical significance in
the .05 level. SEM, ranging from things such as the variance for a sin-
gle variable up to entire models evaluated across mul-
3. Relatively little formal training in measurement tiple samples. There are four reasons, however, why the
or psychometrics can put researchers in a difficult role for significance testing in SEM should be smaller
position when selecting measures for their research or compared with more standard techniques like ANOVA:
evaluating the quality of data from psychological tests
in their own samples. It can also hinder accurate and 1. The capability to evaluate an entire model at
complete reporting about psychometrics. once brings a higher-level perspective to the analysis.
Although statistical tests of individual effects in the
Whatever researchers lack in their training about model may be of interest, at some point the researcher
statistics, measurement, or research design when must make a decision about the whole model: Should
approaching what is, for them, a new technique such as it be rejected?—modified?—if so, how? This deci-
SEM can be addressed through self-study or participa- sion should not be based on significance testing alone
tion in seminars, summer schools, or other continuing because other factors, such as whether statistical results
education experiences. The need to periodically update are meaningful, given the research context and hypoth-
one’s data analysis skills is also a normal part of being eses, are equally if not more important than statistical
a researcher. That is, lifelong learning as self-initiated significance. There is also a sense in SEM that the view
education focused on personal or professional devel- of the entire model takes precedence over that of spe-
opment is a healthy perspective not just for research- cific details (individual effects).
ers, but for everyone. So let’s get on with knocking the 2. The technique of SEM generally requires large
rust off what you already know or adding to your skill samples, but in significance testing, effects with low p
set. Described next from the perspective of learning values in very large samples are sometimes of trivial
about SEM are special issues in significance testing, magnitude. By the same token, virtually all effects that
measurement, and regression—see the corresponding are not zero could be significant in a sufficiently large
primers for more detailed presentations. sample. Just the opposite can happen in smaller sam-

Pt1Kline5E.indd 23 3/22/2023 3:52:36 PM


24 Concepts, Standards, and Tools

ples: Effects of appreciable magnitude that are actually in the unstandardized solution, but standard errors for
true in the population may fail to be significant due to standardized estimates might be part of optional out-
low statistical power. put that must be requested by the researcher. Because
the unstandardized and standardized estimates for the
3. Researchers should be more concerned with esti-
same parameter have their own standard errors, it can
mating effect size and evaluating the substantive signif-
happen that, say, the unstandardized estimate is signifi-
icance of their results than with statistical significance,
cant but the standardized estimate is not significant at
which has little to do with scientific or practical impor-
the same level, or vice versa. This outcome is neither a
tance. In particular, researchers should not believe that
computer error nor contradictory because unstandard-
an effect or association exists just because a result is
ized and standardized estimates each have their own
significant, especially if an arbitrary threshold, such
distinct sampling distributions. Confusion about this
as p < .05, is applied to dichotomize continuous p val-
issue can be minimized by not dichotomizing p values,
ues. Likewise, researchers should not conclude that an
such as in NFSA-style reporting of the results.
effect is absent just because it is not significant (Was-
serstein et al., 2019); that is, avoid committing the zero
fallacy—see Topic Box 2.1.
MEASUREMENT AND PSYCHOMETRICS
4. Standard errors for the effects of latent vari-
ables are estimated by the computer, and those stan- Given limited training about measurement in many
dard errors are the denominators of significance tests graduate programs, it is not surprising that reporting
for those effects. The value of the standard error could on psychometrics in the research literature is too often
change if, say, a different estimation method is used or deficient. This is especially true for reliability coeffi-
sometimes even across different software packages for cients, which indicate the degree to which test scores
the same analysis and data. Thus, it can happen that an are consistent over variations in testing situations,
effect for a latent variable is “significant” in one analy- times, test administrators or scorers, forms, or selec-
sis but is “not significant” in another analysis with a tions of items from the same domain. Lower values of
different estimator or computer tool. Now, differences reliability coefficients indicate less precise scores. If
in p values across different computer programs for a reliability coefficient, designated here as rXX, equals
the same effect and estimation method are not usually zero, it means that the scores are basically random
great, but slight differences in p can make big differ- numbers, and random numbers measure nothing. The
ences in significance testing, such as p = .051 versus result rXX = 1.0 indicates flawless consistency, but such
.049 for the same effect when testing at the .05 level. a result in real data would be pretty extraordinary (i.e.,
don’t hold your breath waiting for perfection).
You may be surprised to know there is no require- Reporting values of reliability coefficients should be
ment to dichotomize p values at all. This means that routine whenever scores from psychological tests are
exact p values are simply reported but are not com- analyzed, but the reality is unfortunately different. For
pared against any bright-line rule or threshold, such as example, Vacha-Haase and Thompson (2011) reviewed
.05, or any other standard whatsoever. This means that nearly 50 meta-analyses of results from about 13,000
the stale, musty, and worn-out term “significant” is not primary studies in which test scores were analyzed.
used for results with low p values, so there are no aster- They found that about 55% of authors mentioned noth-
isks or any other symbol in text or tables that designate ing about score reliability. In 15% of reviewed studies,
“significant” results just as there is no special demarca- authors merely reported values of reliability coefficients
tion of other effects with higher p values as “not sig- from other sources, such as test manuals. Inferring
nificant.” Hurlbert and Lombardi (2009) referred to this from reliability coefficients derived in other samples,
perfectly legitimate reform of traditional significance such as a test’s normative sample, to a different popu-
testing as neo-Fisherian significance assessments lation is called reliability induction. Such reasoning
(NFSA), and it is consistent with the call to researchers about the generalizability of reliability coefficients
to stop using the term “significant” in point 3 just listed. needs explicit justification. But authors of reviewed
You should also know that SEM computer tools gen- studies rarely compared characteristics of their samples
erally print by default standard errors for the estimates with those from cited studies of score reliability. For

Pt1Kline5E.indd 24 3/22/2023 3:52:36 PM


Background Concepts and Self‑Test 25

example, scores from a computer-based task of reaction REGRESSION ANALYSIS


time developed in samples of young adults may not be
as precise for elderly adults, who may be unfamiliar Two topics are addressed next: (1) Effects of measure-
with this method of testing. ment error on the results in MR analyses and (2) the
A better practice is for researchers to report val- limited usefulness of significance testing when test-
ues of reliability coefficients in their own samples. It ing hypotheses about incremental validity in the pres-
is critical to do so because reliability is a property ence of even moderate amounts of measurement error.
of scores in a particular sample, not an immutable Perfect score reliability (rXX = 1.0) is assumed in MR
characteristic of tests. This is because the precision for the predictors but not for the criterion, also called
of scores from the same test varies over samples from the response, outcome, or dependent variable. This is
the same population due to sampling error. Variation because only the criterion has residuals, or differences
in score reliability may be even greater over samples between actual and predicted scores on the response
taken from a population different from the target pop- variable. The residuals give random measurement error
ulation for the test. Urbina (2014) used the term rela- a place to “go” for the criterion; that is, error is mani-
tivity of reliability to emphasize that (1) the quality of fested in the regression residuals. But predictors have
reliability belongs to scores, not tests, and (2) scores no residuals to “absorb” their measurement error; thus,
for individual cases are more or less reliable due to it must be assumed that their scores are perfectly con-
unique characteristics of examinees, such as motiva- sistent.
tion or fatigue, and also to examiner qualifications, Measurement error in the criterion only does not
such as experience in administering or scoring the bias unstandardized regression coefficients, if that
test, and conditions where testing takes place. Some error is unrelated to the predictors. But there is down-
of these factors are unrelated to the test itself, but they ward bias in the standardized coefficients, which also
can still influence the consistency and precision of makes the value of R2 decrease as error in the crite-
scores. Along these lines, researchers should also cite rion increases under the same condition (Williams et
reliability coefficients in published sources (reliabil- al., 2013). Things are more complicated when there is
ity induction) but with comments on the similarities error in the predictors: Measurement error in a single
between samples described in those sources and the predictor that is not shared with any other variable will
researcher’s sample. attenuate the absolute value of the regression coeffi-
Appelbaum et al. (2018) described revised jour- cient for that predictor. But regression coefficients for
nal article reporting standards for quantitative studies the other, error-free predictors can also be affected.
(JARS-Quant) for journals published by the Ameri- Whether those other coefficients are attenuated, exag-
can Psychological Association. Revised guidelines for gerated, or unbiased depends on the values and signs
reporting on psychometrics (including SEM studies) of (1) correlations between the underlying latent vari-
call on authors to able imperfectly measured by a predictor and all other
explanatory variables, and (2) the effect of the concep-
1. Report values of reliability coefficients for the tual variable just mentioned on the criterion (Bollen,
scores analyzed, if possible. 1989; Kenny, 1979). Values of these correlations and
their potential distorting effects are rarely known in
2. Describe the specifics of those reliability coeffi-
practice, so it difficult to anticipate the magnitudes and
cients, such as the length of the retest interval for
directions of this propagation of measurement error
test–retest reliabilities, characteristics of scorers
in a single predictor.
and their training for interrater reliabilities, or the
The effects of measurement error in multiple predic-
specific type of internal consistency coefficients for
tors are also hard to anticipate because there are two
composite (multi-item) scales.
sources of distortion for each predictor: Attenuation
3. Report the characteristics of external samples if bias due to error in a particular predictor, plus a term
reporting reliabilities from those samples, such as the value of which is influenced by the degree of unreli-
original test normative samples. ability error in other predictors (Kenny, 1979). It is pos-
4. Provide estimates of convergent and discriminant sible that these two sources of bias could cancel each
validity where relevant. other out, but that outcome could be rather unlikely.

Pt1Kline5E.indd 25 3/22/2023 3:52:36 PM


26 Concepts, Standards, and Tools

Whether bias is positive or negative in a particular coef- about .50, the error rate in significance testing is just
ficient is also generally unknown, but overall distortion greater than 65%. In even larger samples, the error
tends to increase as score reliabilities among the pre- rate can approach 1.0. These counterintuitive results
dictors are lower. happen because (1) there is a curvilinear relation such
Another issue is the presence of measurement error that Type I error rates generally peak over the range
that is shared over ≥ 2 predictors or with the criterion. rXX = .30–.70, and (2) Type I errors also increase with
Correlated measurement error can arise when mul- sample size, holding all else constant. In larger sam-
tiple variables are assessed with a common method, ples, significance tests have greater power to detect a
variables share common informants or stimuli, or the true relation between a predictor and the criterion, but
same variable is measured on ≥ 2 occasions among the measurement error causes the significance test to con-
same cases, among other possibilities. The signs and flate a common effect of multiple predictors with the
magnitudes of the error correlations also affect the unique contribution for a single predictor (Westfall &
degree and direction of bias in regression coefficients Yarkoni, 2016). Thus, it is possible that inferences in
(Bollen, 1989; Williams et al., 2013)—see Trafimow MR analyses about incremental validity based mainly
(2021) for more information and examples. There are on p values are suspect in many, if not most, published
ways to adjust individual regression coefficients for regression studies.
attenuation, but they assume that measurement error is You will learn that a relative strength of SEM over
not correlated and, as mentioned, measurement error standard regression techniques is the capability to
can attenuate or inflate regression coefficients. Whether explicitly represent in the model the score reliability
such simple corrections give accurate results in real- for any observed variable or the presence of correlated
world data is questionable (Williams et. al, 2013). measurement errors (if any) for particular pairs of vari-
Incremental validity concerns the relative contribu- ables. All other results, including coefficients for puta-
tions of different variables in predicting some criterion. tive causal variables, are estimated given the informa-
That contribution is estimated by the regression coeffi- tion about psychometrics specified by the researcher.
cients, which statistically control for effects of all other The potential advantage of SEM for testing hypotheses
predictors in the equation. In this way, the coefficients about incremental validity is even greater when theo-
indicate the relative contribution of each predictor retical concepts are specified as measured by multiple
above and beyond the rest.1 Many researchers interpret observed variables, or indicators. Thus, hypotheses
a statistically significant regression coefficient as evi- about incremental validity are better supported by
dence for the incremental validity of the corresponding methods like SEM that can explicitly model unreli-
predictor. There are thousands of published regression ability than in standard regression analysis (Westfall &
studies in which hypotheses about incremental valid- Yarkoni, 2016).
ity are tested in this way (Hunsley & Meyer, 2003), so
the issue of whether inferences based on p values for
regression coefficients are trustworthy is very relevant. SUMMARY
Based on rational analysis and computer simula-
tion results, Westfall and Yarkoni (2016) demonstrated It helps to begin learning about SEM with a good
that effective Type I error rates for significance tests comprehension of basic concepts in regression, sig-
of regression coefficients in MR are surprisingly high nificance testing including bootstrapping, and psycho-
even in reasonably large samples (e.g., N = 300) with metrics. Not all of these topics may be covered with
at least moderate levels of scores reliability (e.g., rXX equal depth in graduate programs, so researchers often
need to supplement their knowledge. There are also
= .80) in analyses with just two predictors. Under the
widespread myths about significance testing and psy-
conditions just stated and assuming that the bivari-
chometrics that can interfere with learning about SEM.
ate correlation of each predictor with the criterion is
For example, there are many false interpretations of
1 Other metrics or statistics for measuring incremental validity p values from significance testing that altogether can
include partial correlations, semipartial correlations (also called make researchers overly confident in their results and
part correlations), and sequential increases in R2 when predictors distract them from other aspects of their data, such as
are added to the equation in a particular order—see Grömping effect size or precision. The black-box view that reli-
(2015). ability is a property of tests rather than of scores in a

Pt1Kline5E.indd 26 3/22/2023 3:52:36 PM


Background Concepts and Self‑Test 27

particular sample runs counter to the best practice of Regression


reporting psychometrics, including values of score reli-
Questions 1–4 concern predictors X and W and crite-
abilities for the data analyzed. Measurement error in
rion Y. All variables are continuous. For the same data,
a single predictor in standard regression analysis can
the unstandardized and standardized regression equa-
bias the coefficients for other predictors, too, and its
tions are listed next as, respectively,
effects do not always involve attenuation of regression
coefficients, especially if measurement error is shared Ŷ = 2.15X + 1.30W + 2.34
over multiple predictors or with the criterion. Presented
next is a self-test of knowledge about concepts in sig- and
nificance testing, regression, and psychometrics. After ẑY = .59z X + .34zW
responding to the questions, score your answers using
the criteria that follow the self-test. Refer to the primers 1. Interpret each unstandardized coefficient. (5)
available on the book’s website for additional informa- 2. Interpret each standardized coefficient. (5)
tion about each background topic. 3. Which variable, X or W, contributes the most to pre-
diction, and why? (3)
4. If X, W, and Y were all measured in a different sam-
SELF‑TEST
ple and new regression analyses conducted, which
estimates—unstandardized or standardized—are
Sorry, no multiple-choice items here. I believe that
preferred for comparing results for each predictor
essay questions are better able to detect strengths
over the samples, and why? (4)
or weaknesses in knowledge about statistics (Kline,
2020b). Write a concise response to each question. The 5. What is the least squares criterion in standard
number of possible points for each question is indicated ordinary least squares (OLS) regression? What are
in parentheses, and maximum possible total scores are advantages and disadvantages? (4)
listed after the questions for each topic. The scoring cri- 6. Describe R2 as an estimator of r2, the population
teria for each item are given in the next section. proportion of explained variance. (4)
7. What is a corrected (adjusted, shrunken, shrinkage-
Significance Testing corrected) R2, and what is a potential complication
in its interpretation? (6)
1. What is a sampling distribution? (4)
8. What is the problem of overfitting (overparameter-
2. For the same continuous variable, interpret SD = ization) in regression analysis? (4)
15.00 versus SE = 3.00 (respectively, standard devi-
9. Describe the effects of omitting a predictor that
ation vs. standard error of the mean). (4) covaries with the criterion above and beyond all the
3. What does p = .03 in significance testing mean? (4) included (measured) predictors (also called omit-
4. What is a in significance testing, and how does it ted-variable bias or left-out variables error). (5)
differ from p? (4) Maximum score = 40
5. Define power and b in significance testing. (5)
6. List the factors that affect power. (6) Psychometrics
7. State the combination of factors in question 6 that
leads to the highest power. (6) All questions assume classical test theory.
8. Interpret these results (i.e., each numerical value) 1. What is the difference between reliability and valid-
for two independent samples: t(48) = 2.25, p = .029 ity? (4)
for a one-tailed hypothesis. (6)
2. Give a general interpretation for rXX = .80. (4)
9. Given t = 6.75/3.00 = 2.25 for two independent sam-
3. Given rXX = 1.0 for test–retest reliability, explain
ples, interpret the numerator and denominator, and
what is wrong with this statement: Each case
describe how group size affects each value. (7)
obtained exactly the same score at both occasions.
Maximum score = 46 (3)

Pt1Kline5E.indd 27 3/22/2023 3:52:37 PM


28 Concepts, Standards, and Tools

4. What is the relation between the (Cronbach’s) alpha 3. If the null hypothesis is true and all distributional
coefficient and the split-half reliability coefficient assumptions are correct, the results in 3% of ran-
for scores from the same test? (6) dom samples would be as extreme as the sample
5. What does the alpha coefficient measure, that is, result or even more so. Whether this result is “sig-
what determines its value? Comment on this state- nificant” is unknown because no criterion level for
ment: Alpha = .90; thus, the items are unidimen- statistical significance was stated. In NFSA, there
sional (they measure a single factor). (5) is no criterion level (i.e., p values are not dichoto-
mized), so the terms “significant” or “not signifi-
6. What is required to evaluate alternate-forms reli-
cant” do not apply.
ability? What is a purpose for alternate forms?
What does rXX = .75 mean for immediate versus 4. The quantity a is the criterion level of statistical
delayed administration of the forms? (6) significance specified before the data are collected.
It is the conditional probability of rejecting the null
7. What is the standard error of measurement (SEm)?
hypothesis over random samples when the null
How does rXX affect its value? What does SEm = 7.50
hypothesis is true, or a Type I error. The value of
mean? What is a practical application of SEm? (6)
p is the conditional probability of a sample result
8. Why is it a misconception that the value of the Pear- or one even more extreme assuming that the null
son correlation rXY always has a range from –1.0 to hypothesis is true, sampling is random, and all dis-
1.0? What is a risk of this myth? (4) tributional assumptions hold.
9. What is convergent validity versus discriminant 5. Power is the probability of correctly rejecting the
validity, and what is the concern about common null hypothesis over random samples when the
method variance when evaluating either type of alternative hypothesis is true. Its complement, or
validity? (4) 1 – power, is b, or the probability of failing to reject
Maximum score = 42 (retaining) the null hypothesis when the alternative
hypothesis is true.3
6. Power is determined by sample size, the magnitude
SCORING CRITERIA of the true effect in the population, the level of a,
the reliability of the scores, whether the effect is
Award 1 point for each underlined definition or phrase between-subjects or within-subjects, and whether
mentioned in your answer. Calculate the total score the test statistic is parametric or nonparametric.
for each area and then calculate the proportions of the 7. Bigger samples, a larger population effect size,
maximum possible total score. lower levels of a (e.g., .05 instead of .01), higher
score reliability, a within-subject effect, and a para-
Significance Testing metric test statistic are all associated with higher
power.
1. A sampling distribution is the probability distribu-
tion for a statistic over many random samples all the 8. The degrees of freedom are 48, which means that
same size (N) and drawn from the same population. the total sample size is N = 50, but the size of each
group is unknown.4 The mean difference is 2.25
2. The term SD = 15.00 is the square root of the aver-
standard errors in magnitude. Assuming random
age squared distance between the scores and the
sampling from two different populations with nor-
mean,2 and it estimates the population standard
mal distributions and equal variances, the probabil-
deviation, s. The quantity SE = 3.00 is the esti-
ity of observing a mean difference as large as 2.25
mated standard deviation in the sampling distribu-
standard errors or even greater is .029.
tion of random means each based on N = 25 cases
(i.e., SE = SD/N1/2). Thus, the estimated square root 9. The group mean difference is 6.75, and its value is
of the average squared distance between sample not affected by group size. The standard error of the
means and the population mean is 3.00. 3 It is just as correct to say that 1 – b = power.
2A common but incorrect response is that SD is the average abso- 4 Equal group sizes for the independent-samples t test are not
lute distance between the scores and the mean—see Huck (2016). required.

Pt1Kline5E.indd 28 3/22/2023 3:52:37 PM


Background Concepts and Self‑Test 29

mean difference is 3.00, and it estimates the stan- tors within the same sample—see Grace and Bollen
dard deviation in a sampling distribution of ran- (2005).
dom mean differences, where samples in each pair 5. The least squares criterion defines a unique solu-
were selected from two different populations with tion for the regression coefficients, including the
normal distributions and equal variances. Its value intercept, such that the sum of squared residuals is
is affected by group size: Larger groups lead to minimized; that is, S (Y – Ŷ)2 is as small as possible.
smaller standard errors, keeping all else constant. This solution is the best possible solution in a par-
ticular sample, one that maximizes the value of R2.
Regression A drawback is capitalization on chance: The “best”
solution in one sample will not be so in a different
1. All answers for this question refer to the origi- sample unless the covariance matrices and means
nal (raw score) metric of each variable: A 1-point are identical in both samples.
increase in X predicts an increase in Y of 2.15
points, while controlling for W; a 1-point increase 6. The statistic R2 is a positively biased estimator of r2
in W predicts an increase in Y of 1.30 points, con- (i.e., R2 generally overestimates r2). Bias increases
trolling for X; and the predicted score on Y is 2.34 as the sample size decreases for a fixed number of
when X = W = 0. predictors or bias increases as predictors are added
for a fixed number of cases. In very large samples,
2. Given an increase in X of 1 standard deviation, the
the expected value of R2 is essentially equal to that
expected increase in Y is .59 standard deviations,
of r2.
controlling for W; and Y is predicted to increase by
.34 standard deviations, given an increase in W of 7. An adjusted R2 estimates r2 by reducing the value
one standard deviation while we are controlling for of R2 as a function of sample size (N) and the num-
X. In the standardized solution, the intercept equals ber of predictors (k). In general, there is greater
zero because the mean for all variables is zero. reduction in smaller samples than in larger samples
3. Unstandardized coefficients for X and W—respec-
just as there is greater reduction as the ratio k/N
tively, 2.15 and 1.30—are not directly comparable increases (e.g., predictors are added while sample
unless the two variables have the same raw score size is constant). Also, there is generally greater
metric. Because both variables do have the same reduction for smaller than for larger values of R2.
metric in the standardized solution (i.e., mean = 0, It can happen that adjusted R2 values are < 0 (nega-
standard deviation = 1.0), their standardized coef- tive); if so, then it is usually treated as though the
ficients can be directly compared: The relative corrected R2 is zero (Cohen et al., 2003).
contribution of predictor X is largest, or about 1.75 8. Overfitting refers to a regression equation with
that times of W (i.e., 59/.34) in a standard deviation too many predictors relative to the sample size. It
scale. inflates R2 values to the point where they start to
4. It is the unstandardized coefficients that are gener- describe random variation more so than any real
ally preferred. If two independent samples differ in relations among variables. Overfitting also reduces
their variances for the same measures, the basis for generalizability of the results to another sample,
the standardized estimates is not comparable over that is, the results may not replicate because they so
samples.5 Standardized coefficients are better for heavily reflect idiosyncratic variation in the origi-
comparing the relative contributions of the predic- nal sample that is not repeated in other samples. For
example, if N = 100 and there are k = 99 predictors,
5 Here is an example: Suppose that the same multiple-choice test then the value of R2 must equal 1.0 because there is
is administered to each of two different classes. Scores in each no error variance with so many predictors, and this
class are reported as the proportion correct, but relative to the is true even if the data are random numbers.
highest raw score in each group, not the total number of items.
9. Omitting a relevant predictor increases standard
Although the proportions in each class are standardized and have
the same range (0–1.0), they are not directly comparable across errors for measured predictors, reduces the value
the classes if the highest scores in each group are unequal. Only of R2 , and biases the intercept unless the mean on
the raw scores (total number of items correct) would have the the omitted predictor is zero. If the omitted variable
same meaning over classes. is uncorrelated with all measured predictors, the

Pt1Kline5E.indd 29 3/22/2023 3:52:37 PM


30 Concepts, Standards, and Tools

regression coefficients for the measured predictors 4. Both methods are based on a single administration
are unbiased; otherwise, their coefficients will be of the test. Both are measures of internal consis-
biased (Mauro, 1990). This bias can be positive or tency but at different levels: Alpha measures consis-
negative, and it generally increases as correlations tency at the item level, but the split-half coefficient
between omitted and included predictors increase. measures consistency at the level of test halves, or
consistency in total scores over two halves of the
test corrected for overall test length. There is a
Psychometrics
single value of the alpha coefficient, but there are
1. Reliability concerns the precision or consistency of as many possible split-half coefficients as there are
scores whereas validity concerns the accuracy of ways to split test items into two halves. If all items
interpretations for those scores in a particular con- have the same variance, the alpha coefficient equals
text of use. Reliability is a prerequisite for valid- the average of all possible split-half coefficients.
ity—very inconsistent scores measure nothing and 5. Alpha measures the combined effect of test length
thus have no meaningful interpretation—but reli- and the interrelatedness of responses over test
ability does not guarantee validity.6 That is, just items, which can also be expressed as the ratio of
because scores are consistent and repeatable does variability in examinees’ responses across all test
not mean that they actually measure the target con- items over the variance of test total scores. Increas-
struct. ing either factor just mentioned will increase the
2. A total of 1 – .80 = .20, or 20% of the observed value of alpha. Coefficient alpha assumes that the
variation in scores is due to the type of random items are unidimensional (homogenous), so a high
error measured by the method used to derive the value of alpha does not confirm that hypothesis. It
coefficient (e.g., test–retest, alternate-form, inter- is possible to obtain a relatively high value of alpha
rater, internal consistency). The remaining 80% is even when the items are multidimensional (Schmitt,
systematic, or not due to the kind of measurement 1996; Streiner, 2003; Urbina, 2014).
error estimated by the coefficient. Some of that 6. The alternate-forms method requires at least two
remaining variance could be due to sources of error different complete versions of the test that should be
not estimated by the coefficient. If so, then addi- comparable in terms of length, difficulty, and con-
tional reliability studies would be needed (i.e., those tent. The forms are meant to be used interchange-
that depend on a different method). ably and for the same purpose. In situations where
3. Assuming there are individual differences among people are required to take the same test on multiple
cases, the pattern where each case obtains exactly occasions, administration of alternate forms may
the same score at each occasion would generate the reduce practice effects. Immediate administration,
outcome rXX = 1.0. But so would any other pattern 1 – .75 = .25, or 25% of observed variation is due
where the magnitudes of differences between the to content sampling error, or idiosyncratic effects
cases are perfectly preserved over the occasions of item wording, participant familiarity with item
even though perhaps no case obtained the same content, or inadvertent selection of items from dif-
score both times. For example, if the values in ferent domains that lead to inconsistent scores over
parentheses listed next represent scores from each the forms. Delayed administration, a total of 25%
case at the first and second occasions, or of the variance, is due to the combined effects of
(12, 16), (14, 18), (15, 19), (18, 22), content sampling error and time sampling error, but
and (20, 24) with no additional information, the two sources of
error are confounded.
then rXX = 1.0. Thus, test–retest coefficients measure
whether individual differences are perfectly main- 7. The statistic SEm estimates the standard deviation
tained over the two occasions, not just whether the of observed scores (X) around a hypothetical true
scores are the same. (error-free) score T, where observed scores have
a random error term, or X = T + E, such that the
6 Saying that reliability is a necessary but insufficient condition average of the error component of the scores, E, is
for validity means the same thing. zero and normally distributed. One context is a dis-

Pt1Kline5E.indd 30 3/22/2023 3:52:37 PM


Background Concepts and Self‑Test 31

tribution of many repeated measures from a hypo- 9. Convergent validity involves the hypothesis that
thetical case around their true score, and another scores from ≥ 2 tests claimed to measure the same
is a distribution of observed scores for all cases in construct should appreciably covary, and discrimi-
the population with the same true score. The result nant validity concerns the prediction that scores
SEm = 7.50 says that the square root of the average from ≥ 2 tests claimed to measure different con-
squared distance between the observed scores and structs should not appreciably covary. Method vari-
the true score is 7.50. As rXX increases, the value ance is due to use of a particular measurement or
of SE m decreases, and if rXX = 1.0, then SEm = 0. source of information, and common method vari-
The term SEm is used in constructing confidence ance can inflate validity coefficients for two tests
intervals for true scores, given observed scores for based on the same method. Thus, it is better to
individual cases. evaluate convergent or discriminant validity when
8. The value of rXY can be perfect (i.e., –1.0 or 1.0) only no two tests are based on the same measurement
if the distributions of X and Y have exactly the same method.
shape, their association is strictly linear, and their
score reliabilities are perfect (rXX = rYY = 1.0); other- Well, how did you do? Don’t fret if any of your three
wise, the maximum absolute value for rXY is < 1.0. scores expressed as a percentage are < 50%. A score
A risk is that a researcher is unable to accurately in this range is not uncommon, especially if your sta-
judge the strength of association if they are unaware tistics knowledge is rusty—see Kline (2020b). Scores
of the range of possible values for rXY. For example, around 70% suggest at least a basic understanding, so
what is thought to be a relatively low value of rXY, your review of the corresponding primer could be more
such as .30, may actually be close to its maximum focused on areas of relative weakness. Aced it (> 90%)?
value, given the score reliabilities for variables X Nice, but don’t get cocky: You should still browse the
and Y (Huck, 2016). primers to check for any missing tidbits in your knowl-
edge.

Pt1Kline5E.indd 31 3/22/2023 3:52:37 PM


3

Steps and Reporting

Described in this chapter are the basic steps in SEM and journal article reporting standards for SEM analyses
recently published by the American Psychological Association. The first step in SEM, model specification, is
the most important step of all. This statement is true because results from all later steps assume that the model
analyzed is reasonably correct; otherwise, there may be little meaningful interpretation of the results. The
state of reporting in SEM studies is poor—no, that’s too mild—it is actually in a state of crisis. This is because
too little information about model specification, correspondence between model and data, or possible limita-
tions to conclusions is given in many—and, I would guess, unfortunately, most—published SEM studies. With-
out complete reporting, readers are unable to judge whether the findings are trustworthy, including whether
the model actually fits the data, if a model was retained. Helping you to distinguish your own reporting so
that common mistakes and omissions are avoided is thus the major goal of this chapter.

BASIC STEPS 3. Select the measures (operationalize the concepts)


and collect, prepare, and screen the data.
Six basic steps comprise most SEM analyses, and in a 4. Analyze the model:
perfect world two additional optional steps would be
a. Evaluate the fit of the model; if it is poor, respec-
carried out in every analysis. Review of the basic steps
ify the model, provided that doing so is justified
will help you to understand the relation of specification
by theory (skip to step 5); otherwise, retain no
to later steps and to recognize the utmost importance
model (skip to step 6).
of specification. The basic steps are actually iterative
b. Assuming that a model is retained, interpret the
because problems at a particular step may require a
return to an earlier step. Note that a possible outcome parameter estimates.
is the decision to retain no model, which is a legitimate c. Consider equivalent or near-equivalent models
conclusion to an SEM analysis. Remember that the goal (skip to step 6).
in SEM is not to retain a model at any cost—it is instead 5. Respecify the model, which is assumed to be identi-
to a test a theory to the best of the researcher’s ability. fied (return to step 4).
Basic steps are listed next and discussed afterward. 6. Report the results.
Later chapters elaborate on specific issues at each step
beyond specification for particular SEM techniques:
Specification
1. Specify the model. In SEM, specification involves the representation of
2. Evaluate whether the model is identified (if not, go the researcher’s hypotheses as a series of equations or
back to step 1). as a model diagram (or both) that define the expected

32

Pt1Kline5E.indd 32 3/22/2023 3:52:37 PM


Steps and Reporting 33

relations among observed or unobserved variables. dence to the data, and that’s it. But this scope of model
Depending on the theory, these relations could be testing is so narrow that it occurs only on relatively few
specified as causal or noncausal. The latter (noncausal) occasions. A second, somewhat less restrictive context
refers to statistical relations that arise due to spurious involves the testing of alternative models, and it refers
associations, such as when two variables are affected to situations in which ≥ 2 a priori models are available.
by a common cause, or confounder, but are not caus- Alternative models usually include the same observed
ally related to each other. The model as a whole should variables but represent different patterns of hypoth-
thus represent all the ways the variables are expected to esized relations among them. This context requires
relate to one another. sufficient bases to specify more than one model, such
Outcome (dependent) variables in SEM are referred as at least two competing theories that make different
to as endogenous variables, each of which has at predictions for the same variables. Another example is
least one presumed cause among other variables in in relatively new research areas when there is uncer-
the model. Endogenous variables usually have error tainty about expected patterns of relations. In this more
terms, which represent variation that is not explained exploratory case, the researchers might test a range of
by the causes of those variables. Given the hypotheses, models from simpler to more complex instead of com-
an endogenous variable could be specified as a cause paring models based on different theories (MacCallum,
of a different endogenous variable. Endogenous vari- 1995). In either case, the particular model with the best
ables as just described are intervening (intermediate) acceptable fit to the data may be retained, but the rest
variables that act as a causal link between other vari- will be rejected.
ables. That is, intermediates are specified as affected Probably the most frequent specification context in
by causally-prior variables, and in turn they affect other SEM is model generation, where an initial model is
variables further “downstream” in the causal pathway. fitted to the data. If model fit is found to be unsatis-
For reasons explained in later chapters, an interven- factory, it is respecified usually by adding effects, or
ing variable is not synonymous with a mediator—also parameters, to the original model, which makes the
called a mediating variable—but a mediator is always model more complex and also generally improves its
an intervening variable. Other causes of some endog- fit. But if the initial model has acceptable fit, that model
enous variables in the model are independent variables, might be simplified by dropping parameters, or making
called exogenous variables in SEM, which themselves the model simpler, which generally worsens fit. In both
are strictly causal. This is because whatever causes scenarios just described, the respecified model is tested
exogenous variables are not represented in the model; again with the same data. The goal of model generation
that is, their causes are unknown as far as the model is is to “discover” a model with three attributes: It makes
concerned. theoretical sense, it is reasonably parsimonious, and it
Whether a variable is endogenous or exogenous is has acceptably close correspondence to the data.
determined by the theory being tested, not by analysis. Because the initial model in model generation is
This means that (1) the model is specified before the not always retained, I suggest that researchers make,
data are collected, and the model represents the total in the specification step, a list of possible modifica-
set of hypotheses to be evaluated in the analysis and tions that would be justified according to theory. This
(2) the technique of SEM is not generally a method for means to prioritize the hypotheses, representing just
causal discovery so that if given a true causal model, the very most important ones in the model, and leave
then SEM could be applied to estimate the directions the rest for a “backup list,” if needed. This is because
and magnitudes of causal effects represented in the it is often necessary in SEM to respecify models (step
model. However, this is not how SEM is typically used: 5), and respecification should respect the same prin-
Instead, a causal model is hypothesized, and the model ciples as specification. Preregistration of the analysis
is fitted to sample data assuming that it is correct. plan would make a strong statement that changes to
Therefore, specification is the most important step. the initial model were not made after the examin-
There are three general contexts for model specifica- ing the data (Nosek et al., 2018); that is, the basis for
tion in SEM (Jöreskog, 1993). In a strictly confirma- changing model was a priori, not post hoc in order to
tory application, the researcher has a single model that get a model—at worst, any arbitrary model—to fit the
is either retained or rejected based on its correspon- data.

Pt1Kline5E.indd 33 3/22/2023 3:52:37 PM


34 Concepts, Standards, and Tools

Identification Instead, there are graphical methods and identifica-


tion heuristics, or rules of thumb, that can determine
Although graphical models are useful heuristics for
whether some, but not all, kinds of models are identi-
organizing knowledge and representing hypotheses,
fied. There are also computer tools that analyze graphi-
they must eventually be translated into a statistical
cal representations of some, but not all, kinds of path
model that can be analyzed using a computer program.
models for identification (Textor et al., 2021), which is
A statistical model in SEM is a set of simultaneous both convenient and less subject to error compared with
equations, where the computer must eventually derive manual application of heuristics. Dealing with identi-
a single set of estimates that resolve unknown values, or fication is one of the biggest challenges for newcom-
model parameters, over all equations, given the data. ers to SEM (Kenny & Milan, 2012); accordingly, it is a
Statistical models must generally respect certain recurrent theme throughout this book.
rules or restrictions. One requirement is that of iden- Suppose that a researcher specifies a structural equa-
tification, or whether every model parameter can be tion model that is true to a particular theory, but the
expressed as a function of the variances, covariances, resulting model is not identified. In this case, there is
or means in a hypothetical data matrix. Meeting this little choice in SEM but to respecify the model so that
requirement means it is possible to find a unique value it is identified, but respecifying the original model can
for that parameter, given a particular estimation method be akin to making an intentional specification error
and its statistical criteria. But if a parameter is not iden- from the perspective of the theory. There are two basic
tified, any particular value might not be unique. This options: (1) Apply graphical identification methods to
means that the researcher could not distinguish true determine which parameters in the original model are
versus false values for that parameter even with access identified, and then estimate only those parameters;
to population data (Bollen, 2002). that is, analyze what is possible and skip the rest (i.e.,
Mathematical proof for identification generally the underidentified parameters). This first option means
involves solving the equations for a parameter in terms that the whole model is not analyzed. (2) Respecify the
of symbols for elements in a hypothetical data matrix, model by adding variables, such as covariates or instru-
and these symbols generally do not refer to specific mental variables, so that all parameters are identified
numerical values in any given sample. For example, while still respecting the original theory. This second
Dunn (2005) described a symbolic proof of the claim option is thus a balancing act between identification
that least squares estimators of regression coefficients and fidelity to theory. But the main point is that identi-
and intercepts minimize the sum of squared residuals fication should be evaluated while planning the study
(i.e., the least squares criterion) in standard regression and before the data are collected. Otherwise, it may
analysis. Note that identification is an inherent property be difficult—if not impossible—to add variables to the
of a parameter in a particular model. This means that model to fix an identification problem, if the data are
a parameter that is not identified remains so regardless already collected.
of both the data matrix and the sample size (N = 100,
1,000, etc.). Attempts to analyze a model with at least
Measure Selection and Data Collection
one parameter that is not identified may be fruitless;
therefore, a model that is not identified should be The various activities for this stage—select good mea-
respecified (return to Step 1). Presented in Topic Box sures, collect the data, and screen them—are discussed
3.1 is an intuitive account of model identification. You in the next chapter on data preparation and in the Psy-
can use an online simultaneous equations calculator to chometrics Primer available on the website for this
work through the examples.1 book.
Structural equation models are often sufficiently
complex that is impractical for applied researchers Analysis
without strong backgrounds in linear algebra to inspect
model equations or parameters for identification. This step involves using an SEM computer program
to conduct the analysis. Computer tools for SEM, both
1 https://www.symbolab.com/solver/simultaneous-equations- commercial and those freely available, are described in
calculator Chapter 5. Here is what takes place during this step:

Pt1Kline5E.indd 34 3/22/2023 3:52:37 PM


Steps and Reporting 35

TOPIC BOX 3.1

Conceptual Explanation of Identification


Consider the following formula

a+b=6 (3.1)

where a and b are unknowns that require estimates. Because there are more unknowns than formulas in
Equation 3.1, it is impossible to find a unique set of estimates. In fact, there are an infinite number of solu-
tions for (a, b), including

(4, 2), (8, –2), and (2.5, 3.5)

and so on. Thus, Equation 3.1 is not identified; specifically, it is underidentified or underdetermined,
which here signals the excess of unknowns over the number of formulas or, respectively, 2 > 1. A similar
thing happens when a computer tries to derive unique estimates for an underidentified set of equations: It
is not possible to do so, and the attempt fails.
The next example shows that having an equal number of unknowns and formulas does not guarantee
identification. Consider the following set of formulas:

a+b=6 (3.2)
3a + 3b = 18

Although Equation 3.2 has 2 unknowns and 2 formulas, it does not have a unique solution. Actually, an
infinite number of solutions satisfy Equation 3.2, such as (4, 2), and so on. This happens due to an inherent
characteristic: The second formula in Equation 3.2 is linearly dependent on the first formula (i.e., multiply
the first formula by the constant 3), so it cannot narrow the range of solutions that satisfy the first formula.
Thus, Equation 3.2 is underidentified because the effective number of formulas is 1, not 2.
Now consider the following set of 2 formulas with 2 unknowns where the second formula is not lin-
early dependent on the first:

a+b=6 (3.3)
2a + b = 10

Equation 3.3 has a unique solution; it is (4, 2), and none other. Thus, Equation 3.3 is just-­identified,
just-­determined, or saturated with equal numbers of unknowns and distinct (i.e., not wholly depen-
dent) formulas, or 2 for each. That unique solution (4, 2) also perfectly reproduces the constants in Equa-
tion 3.3 (6, 10).
Let’s see what happens when there are more formulas than unknowns. Consider the following set of
formulas with 3 equations and 2 unknowns:

a+b=6 (3.4)
2a + b = 10
3a + b = 12
(continued)

Pt1Kline5E.indd 35 3/22/2023 3:52:37 PM


36 Concepts, Standards, and Tools

There is no single solution that satisfies all three formulas in Equation 3.4. For example, the solution (4, 2)
works only for the first two formulas in Equation 3.4, and the solution (2, 6) works only for the last two
formulas. But there is a way to find a unique solution: Impose a statistical criterion that leads to an overi-
dentified or overdetermined set of equations with more formulas than unknowns. An example for
Equation 3.4 is the least square criterion from regression analysis but with no constant (intercept) in the
prediction equation. Expressed in words:

Find values of a and b in Equation 3.4 that yield total scores such that the sum of squared
differences between the constants 6, 10, and 12 and these total scores is as small as possible.

Applying the criterion just stated to the estimation of a and b yields a solution that not only gives the
smallest possible difference (.67) but that also is unique, or (3, 3.33). This solution does not perfectly
reproduce the constants in Equation 3.4 (6, 10, 12): the total scores generated by the solution just stated
are 6.33, 9.33, and 12.33, but no other solution comes closer, given the least squares criterion applied
in this example.
Definitions of underidentified, just-­identified, and overidentified models in SEM are elaborated on in
later chapters, but some implications can be stated now: An unidentified model has at least one param-
eter that cannot be algebraically expressed as a unique function of the covariances, variances, or means
among the observed variables. Such models may have other parameters that are identified, and these
identified parameters could potentially be estimated with sample data. But a model with ≥ 1 underidenti-
fied parameter(s) cannot be analyzed as a whole unless it is respecified to reduce the excess number of
unknowns (parameters) or certain constraints are imposed by the researcher in the analysis; otherwise,
underidentified models are too complex to be analyzed with available data.
The unique solution for a just-­identified model can perfectly reproduce the data, but that feat is not
very impressive because model and data are equally complex. Any respecification of a just-­identified
model (within requirements about identification) will also perfectly explain the same data even though
such respecifications oftenmake opposing causal claims (i.e., the models are equivalent). Also, a justified-­
identified model will perfectly fit any arbitrary sample data matrix for the same variables. For all these
reasons, (1) just-­identified models are not discomfirmable, and (2) such models usually have little scientific
merit (MacCallum, 1995), but Pearl (2009) discussed some exceptions.
Overidentified models typically do not perfectly explain the data because such models are simpler
than the data. The possibility for imperfect fit makes such models discomfirmable, and a question in the
analysis is whether the degree of model–data discrepancy warrants rejecting the model (i.e., go to the
respecification step). The principle of disconfirmability also implies a preference for models that are
not highly parameterized, or have so many unknowns relative to the data that they can hardly “disagree”
with those data (MacCallum, 1995). Box (1976, p. 792) put it like this. “Just as the ability to devise simple
but evocative models is the signature of the great scientist so overelaboration and overparameterization
is often the mark of mediocrity.” Other aspects of parameter identification in SEM are elaborated in later
chapters.

Pt1Kline5E.indd 36 3/22/2023 3:52:37 PM


Steps and Reporting 37

1. Evaluate model fit, which means determine how well ter (and more honest, too) to retain no model (Hayduk,
the model explains the data. Perhaps more often than 2014). This is because there are risks to respecifying
not, an initial model does not fit the data very well. models based more on improving fit to the data in a
Not if, but when this happens to you, skip the rest of particular sample than on substantive considerations
this step and consider the question, “Can a respecifi- (MacCallum, 1995):
cation of the original model be justified, given rel-
evant theory and results of prior empirical studies?” 1. Data-driven respecification can so greatly capi-
2. Assuming the answer is “yes” and given satisfac- talize on sampling error that any retained model
tory fit of the respecified model, then interpret the is unlikely to replicate, especially when complex
parameter estimates. models are analyzed in small samples.
3. Next, consider equivalent or near-equivalent mod- 2. Estimates of some parameters in purely data-driven
els. Recall that equivalent models explain the same respecified models may have little or no substantive
data just as well as the researcher’s preferred model meaning.
but with a contradictory pattern of causal effects 3. Because the model is evaluated based on the same
among the same variables. For a given model, data used to modify it, any retained model should
there may be many—and in some cases infinitely be validated with data from a new sample; that is,
many—equivalent versions. Thus, the researcher evidence for model validation should come from a
should explain why their favorite model should not different sample. Thus, a respecified model should
be rejected in favor of equivalent ones. There may not be treated as confirmed without fitting it to new
also be near-equivalent models that fit the same data.
data nearly as well as the researcher’s preferred
model, but not exactly so. Near-equivalent models MacCallum (1995) noted that it would be appropriate
are often just as critical a validity threat as equiva- for journal editors to reject manuscripts for SEM stud-
lent models, if not even more so. ies based on model generation if the concerns just listed
are not addressed. This advice strikes me as reasonable
When testing alternative models, it is not the model in that the direct acknowledgment of study limitations
with the best relative fit to the data that would be auto- is part of reporting, which is considered next.
matically retained. This is because the best-fitting
model among alternatives may not itself have accept-
Reporting
able fit to the data when viewed from a more absolute
standard. Also, the researcher in this context should The last basic step is an accurate and complete descrip-
not be solely concerned with model fit. This is because tion of the analysis in a written report. The fact that so
the parameter estimates for a candidate model should many published articles that concern SEM are flawed
also make theoretical sense. As noted by MacCallum in this regard was discussed earlier. These blatant
(1995), a model that fits well but generates nonsensi- shortcomings are surprising considering that there have
cal parameter estimates is of little scientific value. This been, for some time, published guidelines for reporting
principle is also true in strictly confirmatory and model SEM results. For example, recent APA standards for
generation applications of SEM. SEM analyses (Appelbaum et al., 2018) described later
in this chapter were based on earlier SEM standards
for the journal Archives of Scientific Psychology by
Respecification
Hoyle and Isherwood (2013). McDonald and Ho (2002),
A researcher usually arrives at this step because the fit Jackson et al. (2009), and Boomsma et al. (2012) dis-
of their initial model is poor. In the context of model cussed general principles for reporting results in SEM,
generation, now is the time to refer to that backup list of and there are works on how to apply SEM in various
theoretically acceptable changes I suggested when you disciplines, including travel and tourism, social and
specified the initial model. If there is no such list—or administrative pharmacy, and marketing, among oth-
if the researcher has exhausted their backup list and yet ers (respectively, Nunkoo et al., 2013; Schreiber, 2008;
the respecified model still has poor fit—it may be bet- Richter et al., 2016).

Pt1Kline5E.indd 37 3/22/2023 3:52:37 PM


38 Concepts, Standards, and Tools

OPTIONAL STEPS Gender differences in consumer brand engagement


were noted by Tunca (2019) as a possible explanation
Two optional steps could be added to the six basic ones for differences in results compared with those reported
just described (i.e., steps 1–6): by Hollebeek et al. (2014).
Issues concerning the external replication of SEM
7. Replicate the results. studies are summarized next; see Porte (2012) for more
8. Apply the results. information about general issues in replication: Recall
that unstandardized estimates of the same param-
The requirement for large samples in SEM complicates eter should generally be compared over independent
replication. This is because it may be difficult enough samples, not standardized results. How to simultane-
to collect a single sample large enough for SEM analy- ously fit the same structural equation model to the data
ses much less collect twice as many cases or so to also from multiple samples is described in Chapter 12. This
have a replication sample. Poor reporting of results in process involves specifying equality constraints that
SEM studies may also contribute to the problem: If should be imposed on unstandardized estimates for the
the sample, measures, model, and analysis are not all same parameter over different samples as a way to test
for group differences. This method (among others) is
clearly described, replication efforts could be stymied.
used in CFA studies of measurement invariance, and
That fact that many, and perhaps most, claims in SEM
it could also be applied by a researcher who collects
studies are made with little evidence for their general-
new data to test the same model from a prior study by a
izability outside the original sample(s) should give us
different researcher. There are formal measures of the
all pause, though.
similarity of factor solutions over different samples,
A bright spot about replication in SEM concerns the
such as the Tucker coefficient of congruence, a stan-
evaluation of measurement invariance, or whether a set
dardized measure of proportionality in factor loadings
of indicators measures the same theoretical concepts
over different groups, that can be applied in CFA—see
over different populations, times, methods of adminis- Moss et al. (2015) for an example.
tration, and so on. Measurement invariance can be esti- There are hundreds, probably thousands, of studies
mated in CFA by testing whether a measurement model where SEM has been applied to test theories (Thelwall
has similar fit over data collected from samples drawn & Wilson, 2016). One has to look much harder to find
from different populations, such as women and men, practical applications of SEM, but such studies exist.
among other variations described later in this book For example, the latest revision of the Stanford–Binet
(Chapter 22). There are now many CFA studies of mea- Intelligence Scales, Fifth Edition (Roid, 2003) was
surement invariance (Dong & Dumas, 2020), and most influenced by results of factor analytic studies, includ-
involve replication over different samples collected by ing CFA findings, about its internal structure (DiSte-
study authors. fano & Dombrowski, 2006). Rebueno et al. (2017) used
There are fewer examples of replication over dif- the technique of path analysis to evaluate the attributes
ferent samples collected by different researchers, or of a pre-graduate clinical training program in a ran-
external replication. One is Tunca (2019), who con- domly selected sample of nursing students. Verdam et
ducted a pre-registered replication of an earlier SEM al. (2017) described effect sizes that can be computed
study by Hollebeek et al. (2014). The original analysis in SEM analyses on different types of response shifts
involved a statistical model of customer brand engage- in reported outcomes over time among patients in
ment conceptualized as three dimensions—cognitive treatment. Effect sizes are derived from decomposing
processing, brand-related affection, and behavioral observed change into elements that include three dif-
interaction—presumed to affect self–brand connection ferent types of response shift—including (1) change in
and brand usage intent (Hollebeek et al., 2014, p. 157). patients’ internal standards (recalibration), (2) change
Tunca (2019) fitted the same model to data from a new in patients’ values about the importance of particu-
sample. Although some of the original findings were lar outcomes (reprioritization), and (3) change in the
consistent with the original and replication samples, meaning of the outcome to patients (reconceptualiza-
evidence for the discriminant validity of customer tion)—versus estimated true change in the underlying
brand engagement dimensions was more problematic. construct.

Pt1Kline5E.indd 38 3/22/2023 3:52:38 PM


Steps and Reporting 39

REPORTING STANDARDS ency of Health Research (EQUATOR) network, which


offers a searchable database.2
Reporting standards are envisioned as sets of best prac- Summarized in Table 3.1 are the APA JARS-Quant
tices established by authority figures, such as experi- standards for SEM analyses. Authors should explain
enced researchers, journal editors (who also tend to the theoretical bases for model specification, includ-
be experienced researchers), and methodologists, pre- ing directionalities of presumed effects. The analysis
sented with a rationale that also allows for discretion in context—for example, whether an initial model will be
their application (Cooper, 2011). They are not intended respecified (model generation)—should also be stated.
as requirements, or mandatory practices that must be Describe the model in complete detail, including its
adopted by all, but they are also not merely recommen- parameters, associations between observed and latent
dations, or endorsements of good reporting practices variables, and whether means were analyzed along
with no obligation to actually carry them out. Instead, with covariances. Verify that the model is identified,
the goal of reporting standards is to improve the quality and explicitly tabulate the model degrees of freedom
of journal articles through clear, accurate, and trans- (df M). How to do so for different kinds of models is
parent reporting, while simultaneously not interfering described in later chapters, but this basic reporting is
too much with researcher creativity and the normal problematic in too many studies: Cortina et al. (2017)
workings of science (Appelbaum et al., 2018). The hope reviewed nearly 800 models described in 75 published
is that better reporting will help readers to more clearly SEM studies. They found that the information needed
understand the bases for claims made in articles and to calculate df M was reported about 75% of the time, but
facilitate study replication. reported df M was matched by the article information
A drawback to reporting standards is that there is not only 62% of the time. Shah and Goldstein (2006) found
always consensus about exactly what to report in cer- a similar problem: In 31 out of a total of 143 reviewed
tain research areas. This is true in SEM in which there articles detailing SEM studies, the model described in
is longstanding debate about the adequacy of various the text failed to match the statistical results presented
statistics about model fit, especially about the useful- in tables or text. Such discrepancies raise questions
ness of significance tests of global model fit versus about the model described versus the model actually
continuous measures of fit (Vernon & Eysenck, 2007). analyzed.
When experts in the field disagree, no reporting stan- State whether samples are empirical or generated,
dards can resolve the issue. The approach taken in this such as in a computer simulation study, and describe
book is to (1) address directly the areas of disagreement any loss of samples due to technical problems, such as
beginning especially in Chapter 10 about model testing inadmissible solutions (i.e., illogical estimates) or fail-
and (2) offer what I believe are principled arguments for ure of iterative estimation to converge—see Table 3.1.
best practices, including reporting. I also believe that Give the rationale for the sample size, including the
the existence of reporting standards for SEM helps to parameters for a power analysis. Power analysis and
focus the conversation among both experienced practi- other methods to specify target sample size in SEM
tioners and students. are described in Chapter 10. Describe all measures in
The original APA journal article reporting standards detail, including psychometrics, for scores from psy-
(JARS) did not include SEM analyses (APA Publica- chological tests analyzed in the researcher’s sample.
tions and Communications Board Working Group on Describe the data screening in complete detail,
Journal Article Reporting Standards, 2008). Hoyle and including (1) the extent, patterns, and assumptions about
Isherwood (2013) developed SEM reporting standards missing data; (2) the status of distributional assump-
for the journal Archives of Scientific Psychology in the tions; (3) whether any other problems, such as outliers,
form of questions to be addressed by authors or review- were detected; and (4) corrective actions taken, if any
ers. Their guidelines were reworded as statements, (Table 3.1). A strong demonstration of transparency
slightly edited for consistency, and then incorporated occurs when both the syntax for the analysis and the
into the revised APA standards, JARS-Quant for quan- data file are accessible to other researchers. Access to
titative studies (Appelbaum et al., 2018). Hundreds original data is part of the open-science movement in
of reporting standards for health-related research are
described on the Enhancing the Quality and Transpar- 2 https://www.equator-network.org/

Pt1Kline5E.indd 39 3/22/2023 3:52:38 PM


TABLE 3.1. Summary of JARS-Quant Reporting Standards for Structural Equation
Modeling Analyses
Section Topic and description
Abstract If a model is retained, report values of at least two global fit statistics with brief comment on local fit, and
state whether the initial model was respecified.
Introduction Justify directionality assumptions (e.g., X → Y instead of the reverse).
State whether respecification is planned, if the initial model is rejected.
Specification Describe context (i.e., model generation, model comparison, strictly confirmatory).
Give a full account for model specification, including latent variables and indicators (i.e., measurement
model), status of parameters (i.e., free, fixed, or constrained), or how the model or analysis deals with
score non-independence.
Describe the mean structure, if means are analyzed.
Justify error correlations in the model, if present.
Verify that tested models are identified.
Explicitly tabulate numbers of observations, free parameters, and dfM.
Methods State whether data were simulated or collected from actual units of study (i.e., cases).
Describe measures (e.g., taken from a single vs. multiple questionnaires), specify whether they are items or
scales (i.e., total scores), and report psychometrics.
Outline how sample size was determined (e.g., power analysis, accuracy in parameter estimation, resource
limitations), give details for power analysis (e.g., H0, H1, a, effect size).
Indicate whether results or samples were discarded due to nonconvergence or inadmissible solutions.
Data screening Describe data loss patterns (i.e., MCAR, MAR, MNAR) and corrective actions (e.g., single or multiple
and summary imputation, FIML).
State distributional assumptions (e.g., multivariate normality), report evidence that assumptions were met or
describe actions taken to address violations.
Report sufficient summary statistics that allow secondary analysis (e.g., covariance matrix), or make the raw
data file available.
Estimation State the computer tool (including version) used in the analysis and the estimation method used, or make the
syntax available.
Describe default criteria that were changed to obtain a converged solution.
Report whether the solution is admissible (e.g., negative error variances) and describe corrective actions.
Model fit Interpret values of global fit statistics according to evidence-based criteria.
Describe local fit, or the residuals (e.g., standardized, normalized, correlation).
Explain the decision to retain or reject any model.
State criteria for evaluating parameter estimates, including whether results are compared over groups.
If comparing alternative models, state criteria for selecting a preferred model.
Respecification Indicate whether respecifications were a priori or post hoc (i.e., arrived at before or after examining the data).
Give a theoretical rationale for any parameters fixed or freed in respecification.
Estimates Report unstandardized and standardized estimates with standard errors for all free parameters, if possible.
Report standardized and unstandardized indirect effects with standard errors, outline analysis strategy.
Report estimates of interactions with standard errors, describe follow-up analyses.
Discussion Summarize changes to the initial model and rationale, if a model is retained.
Justify the preference for retained models over equivalent models or near equivalent models that explain the
same data just as well or nearly so.

Note. JARS-Quant, journal article reporting standards for quantitative studies; MCAR, missing completely at random; MAR, missing at random;
MNAR, missing not at random; FIML, full information maximum likelihood; dfM, model degrees of freedom. Underlining emphases the most
essential parts for each standard. From “Journal Article Reporting Standards for Quantitative Research in Psychology: The APA Publications and
Communications Board Task Force Report,” by Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., & Rao, S. M. (2018).
Journal article reporting standards for quantitative research in psychology: The APA Publications and Communications Board Task Force report.
American Psychologist, 73(1), 3–25. https://doi.org/10.1037/amp0000191. Copyright © 2018 American Psychological Association. All rights re-
served.

40

Pt1Kline5E.indd 40 3/22/2023 3:52:38 PM


Steps and Reporting 41

which all components of research, including data, meth- degree of correspondence of empirical estimates with
ods, peer review, and publications, are freely available. theoretical predictions, or other reasons for the decision
Increasing numbers of journals encourage the sharing (Table 3.1). If a model is retained, a common mistake
of the data, and some journals award badges to authors is to report only the standardized solution. Although
for open-science practices (Chambers, 2018). standardized estimates are easier to interpret, they are
A fact of life in SEM is that computer analyses can, generally less useful for comparing results for the same
and sometimes do, go wrong. How to deal with vari- model and variables but fitted to data in a different sam-
ous technical problems in the analysis are addressed ple, especially if samples, such as men versus women,
throughout the book, but the nature of such problems differ in their variances (Grace & Bollen, 2005). Thus,
and corrective steps taken to resolve them should be also report the unstandardized solution with stan-
reported—see Table 3.1. Reassure readers that the solu- dard errors. If analyzing indirect effects or interaction
tion is admissible, that is, the estimates do not include effects, then describe the analysis strategy, including
anomalous results that would invalidate the solution. follow-up analyses. Another common mistake is the
Model fit should be described at two different levels, failure to acknowledge the existence of equivalent ver-
global and local. Global fit concerns the overall or aver- sions of any retained model with an explanation about
age correspondence between the model and the sample why the researcher’s version would be preferred.
data matrix. Just as averages do not indicate variability, As a reviewer of SEM studies submitted to research
it can and does happen in SEM that models with appar- journals, I see many examples of deficient, poor report-
ently satisfactory global fit can have problematic local ing, so much so that over the years I developed a set of
fit, which is measured by residuals computed for every boilerplates, or standardized text, that I can adapt to a
pair of observed variables in the model. Residuals are particular manuscript. Presented in Topic Box 3.2 are
the differences between observed (sample) and pre- examples of some of my review templates. I hope that you
dicted (by the model) covariances or correlations, and can learn a few things about better, more complete report-
as absolute residuals increase, local fit becomes worse ing from these cautionary tales on inadequate reporting.
for those pairs of variables. Evidence for problematic
local fit goes against retaining the model.
The analogy in regression analysis is the differ- REPORTING EXAMPLE
ence between R2, or overall explanatory power (global
fit), and regression residuals, or differences between Several colleagues and I wrote an SEM study with the
observed and predicted scores at the case level, not (at the time) upcoming APA reporting standards for
at the variable level as in SEM. Aberrant patterns of SEM in mind. The study, data, research aim, and model
regression residuals, such as severely nonnormal distri- are introduced next. Do not worry at this you point if
butions or heteroscedasticity, indicate a problem even if you do not understand everything about the model or
the value of R2 is reasonably high—see Fox (2020). Just its specification because we focus on issues related to
as reports about regression analyses without descrip- reporting next. Analysis results for this example are
tion of the residuals are incomplete, so too are reports described in Chapter 15, so you’ll get to see this model
in SEM in which only global fit is described. This is a again along with the data described here.
common shortcoming in many, if not most, published In a sample of 193 patients with first-episode psy-
SEM studies. For example, only 17 out of 144, or 12%, chosis (FEP) (schizophrenia, schizoaffective disorder)
of SEM studies published in organizational or manage- or recurrent multiple-episode psychosis (MEP), Sauvé
ment research journals reviewed by Zhang et al (2021) et al. (2019) administered the CogState Schizophrenia
contained information about local fit. The remedy is Battery (CSB) (Pietrzak et al., 2009), a computerized
straightforward: Present a matrix of residuals in the test of cognitive ability intended to measure domains
article, or at least describe the pattern of residuals in affected by psychosis (e.g., working memory, attention-
the article and make the residuals available in supple- vigilance, social reasoning), and the Scale to Assess
mentary materials. Without the details about local fit, Unawareness of Mental Disorder (SUMD) (Amador
readers cannot understand the whole picture about et al., 1993), an interview-based measure of the rela-
model–data correspondence. tive absence of understanding about one’s own mental
Explain the basis for a decision to either reject or condition, also called lack of illness insight or anosog-
retain a model. That is, state the statistical criteria, nosia. Higher scores on the SUMD indicate greater

Pt1Kline5E.indd 41 3/22/2023 3:52:38 PM


42 Concepts, Standards, and Tools

TOPIC BOX 3.2

Review Boilerplates for Common Problems in SEM Manuscripts


I have reviewed hundreds of manuscripts of SEM studies for about 50 different journals or publishers. After
seeing repetitions of similar problems in reporting, I wrote a set of more-or-less standardized comments
that could be adapted to the specifics of a particular work. These boilerplates cover incomplete reporting
about model fit, the interpretation of values of global fit statistics in ways that are not supported by recent
evidence, or the lack of a rationale for retaining a model (see Table 3.1), among other problems. Some of
these comments refer to global fit statistics that are defined in Chapter 10 on model testing and indexing,
but issues in their misuse are raised next:

1. Failure to describe local fit. It is inappropriate to base model evaluation solely on values of global
fit statistics without also considering local fit, or the residuals, which provide additional details
about model fit. It can and does happen that models can generate favorable values of global fit
statistics even though there is evidence for appreciable model–data discrepancy at the level of the
residuals. Recent reporting standards for SEM call on authors to describe both global and local
fit—see Appelbaum et al. (2018), Greiff and Heene (2017), and Vernon and Eysenck (2007) for
more information.
2. Failure to report unstandardized estimates. It is a mistake to omit the reporting of unstandardized
parameter estimates, which is generally the basis for comparing results from the same model and
variables over different samples—­see Grace and Bollen (2005). Please also report the unstan-
dardized estimates with their standard errors.
3. Incorrect model degrees of freedom. The reported value of df M is impossible, given the variables
and effects represented in the model diagram. I suspect something is wrong here, ranging from
a simple typographical error to something more serious, including the analysis of a model that
differs from the model described in the manuscript. Please explicitly tally the numbers of observa-
tions, freely estimated model parameters, and df M. Next, resolve any discrepancies in the presen-
tation.
4. Ignored failed significance tests for the whole model. The model failed the chi-­square test, which
signals covariance evidence against the model, but this outcome is discounted by the incorrect
statement that the chi-­square test is “biased” by sample size. There are two problems here: The
sample size in this study is not large for SEM analyses, and the chi-­square test is affected by sam-
ple size only when the model is incorrect. This problem is compounded by the failure to describe
the residuals, which could indicate poor local fit for certain pairs of variables—­see Hayduk (2014)
and Ropovik (2015).
5. No rationale for the decision to retain a model. Why the respecified model is retained was not
explained. Instead, the author(s) report(s) values of global fit statistics with no interpretation. That
is, how do the author(s) get from these results in the text of the Results section, or

c2 (272) = 587.52, p < .01, CFI = .97, TLI = .946, RMSEA = .057

to the unjustified conclusion that the model has “acceptable” or “excellent” fit? There is no con-
nection, argument, or logic that links the two. It is possible that the model does not actually fit the
data at level of the residuals but, as mentioned, the reader is told nothing about this area.

Pt1Kline5E.indd 42 3/22/2023 3:52:38 PM


Steps and Reporting 43

anosognosia. The main goal of Sauvé et al. (2019) was arrowheads at each end. These symbols represent cor-
to estimate the association between level of general related errors, or the hypothesis that the correspond-
cognitive ability and symptom unawareness where both ing indicators share something unique to that pair. For
concepts were modeled as latent variables. The context instance, the International Shopping List Test is a list
is model generation. of items read aloud to examinees. Two scores from this
Presented in Figure 3.1 is the final model retained by task were analyzed, immediate versus delayed recall of
Sauvé et al. (2019). A total of six tasks from the CSB list items or, respectively ISL and ISLR in Figure 3.1.
are represented as indicators of a general cognitive Because they come from the same task, indicators ISL
ability factor. Each measured variable is represented in and ISLR may covary even after controlling for their
the figure with a rectangle, a standard graphical con- common ability factor. This same could be true for the
vention in SEM. The lines with single arrowheads that two scores from the Groton Maze Chase Test, GML
point from the cognitive ability factor to the indicators and GMR (see Figure 3.1).
signal the assumption of reflective measurement, or the The SUMD is represented in Figure 3.1 as the single
hypothesis that general ability as a theoretical variable indicator of a symptom unawareness factor. This speci-
affects task performance. The factor as a theoretical— fication acknowledges that (1) an observed variable is
and thus unmeasured—variable is represented in the not identical to a theoretical concept and (2) scores on
figure with an oval, another oft-seen graphical symbol. the SUMD are assumed to be affected by measurement
Each indicator has an error term designated in the fig- error (i.e., the score reliability coefficient is < 1.0). The
ure by a line with a single arrowhead oriented at a 45° constant in the figure that appears next to the symbol for
angle that points to that indicator, and the error term the error term of the SUMD, .360, is a value related to
represents the unique variation in each indicator, such empirical reliability coefficients reported in the litera-
as that due to measurement error, not explained by their ture for this measure; how the particular value of “.360”
common ability factor. was computed for the SUMD is explained in Chapter
The errors for two pairs of indicators are connected 15. The line with the single arrowhead that connects
by the symbol for a covariation, or curved lines with the factors in the figure represents the presumed causal

ISL .360

ISLR SUMD
1
1
GML
Cognitive Symptom
Unaware
GMR

OCL

CPAL

FIGURE 3.1. Final model of cognitive capacity and symptom unawareness. ISL, International Shopping List; ISLR, Interna-
tional Shopping List Immediate Recall; GML, Groton Maze Learning task; GMR: Groton Maze Learning task Delayed Recall;
OCL, One-Card Learning task; CPAL, Continuous Paired Associate Learning task; SUMD, Scale to Assess Unawareness of
Mental Disorder. From “Cognitive Capacity Similarly Predicts Insight into Symptoms in First- and Multiple-Episode Psychosis,”
by G. Sauvé et al., 2019, Schizophrenia Research, 206, p. 239. Copyright © 2019 Elsevier B.V. Adapted with permission.

Pt1Kline5E.indd 43 3/22/2023 3:52:38 PM


44 Concepts, Standards, and Tools

effect of cognitive ability on symptom unawareness. Its is relatively small for SEM, N = 193, but the population
value is estimated controlling for measurement error base rate of psychotic disorders is relatively low, about
not only in the SUMD variable but also in all six indi- 1–2% or so, so even this sample size is reasonably large
cators of the cognitive ability factor. The outcome vari- among comparable studies of insight in psychotic dis-
able in this analysis, the symptom unawareness factor, orders (e.g., Phahladira et al., 2019). Briefly, the results
itself has an error term called a disturbance that repre- indicated that cognitive capacity explains about 6%
sents variation not explained by cognitive ability—see of the variation in symptom unawareness and that, as
Figure 3.1. expected, lower levels of cognitive ability predict less
The features of reporting in Sauvé et al. (2019) that illness awareness. Follow-up regression analyses indi-
respected SEM reporting standards are listed next. cated that the relation between cognitive capacity and
Some of these details are reported in the main text; symptom unawareness did not vary appreciably over
others are available in the supplementary materials for patients with first-time versus multiple-episode psycho-
the article: sis.

1. The methodology for preparing the data for analy-


sis under the assumption of normality is described, SUMMARY
including normalizing transformations applied to
scores from individual cognitive tasks. Distribu- Specification is the most important step in SEM because
tional characteristics are verified. the quality of the ideas that underlie the model affects
2. The rules by which the initial single-factor mea- all subsequent steps, including the analysis phase. Mod-
surement model for cognitive capacity was respeci- els should be evaluated for identification, or for whether
fied are stated. Briefly, each indicator was required model parameters can be expressed as functions of the
to share at least 30% of its variance with the com- elements in a data matrix, in the study planning stage.
mon factor, and it is predicted that error terms for This is because adding variables to the model is one
two scores from the same task, such as immediate way to rectify underidentified causal effects, but doing
versus delayed recall of the same stimuli (e.g., ISL, so may be difficult if the data are already collected.
ISLR in Figure 3.1), might covary. The context of model generation is the most common
in SEM studies, and respecification should be guided
3. The syntax, data, and complete output files for all by the same principles as the specification of the origi-
SEM analyses in lavaan are available to readers. nal model. This means that changes to an initial model
Thus, readers can reproduce the original analyses require substantive justification; otherwise, a risk is that
or fit models to the same data not considered by respecification will generate a model that fits the data in
Sauvé et al. (2019). a particular sample but does not replicate. The state of
6. The fact that all solutions are admissible is reported. reporting in too many SEM studies is deficient such that
Model fit is described at both the global and local readers are not given enough information about model
levels. Full matrices of residuals are available so specification, respecification, or fit to the data to fully
readers have full access to complete information comprehend the findings. For example, both global fit
about fit. and local fit, or the residuals, should be described in
5. Unstandardized estimates with standard errors are written reports. The existence of reporting standards
reported for all model parameters, and standard- for SEM analyses should help authors to write better,
ized estimates are reported, too. more complete summaries of SEM analyses. The next
chapter covers preparation of the data.
6. An equivalent model is directly acknowledged.

Other aspects of reporting by Sauvé et al. (2019) are LEARN MORE


not ideal. For example, we did not preregister the analy-
sis plan. The model degrees of freedom are not explic- Recommendations for reporting SEM results as part of JARS-
itly tabulated (they are df M = 12 for Figure 3.1), and Quant reporting standards by the APA are described in
the sample size was not determined based on a priori Appelbaum et al. (2018), the classic work by MacCallum
considerations, such as power analysis. The sample size (1995) about model specification offers helpful advice, and

Pt1Kline5E.indd 44 3/22/2023 3:52:38 PM


Steps and Reporting 45

Schreiber (2008) describes the information that should be MacCallum, R. C. (1995). Model specification: Procedures,
reported in SEM analyses in clear, accessible terms. strategies, and related issues. In R. H. Hoyle (Ed.), Struc-
tural equation modeling: Concepts, issues, and applica-
Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., tions (pp. 16–36). Sage.
Nezu, A. M., & Rao, S. M. (2018). Journal article report-
Schreiber, J. B. (2008). Core reporting practices in struc-
ing standards for quantitative research in psychology:
tural equation modeling. Research in Social and Admin-
The APA Publications and Communications Board Task
istrative Pharmacy, 4(2), 83–97.
Force report. American Psychologist, 73(1), 3–25.

Pt1Kline5E.indd 45 3/22/2023 3:52:38 PM


4

Data Preparation

Besides good ideas (hypotheses), effective SEM practice requires careful data preparation and screening.
Just as in other types of multivariate analyses, data preparation is critical in SEM for three reasons: First,
it is easy to make a mistake entering data into computer files. Second, some estimation methods in SEM
make specific distributional assumptions about the data. These assumptions must be taken seriously because
violation of them could result in bias. Third, data-related problems can make SEM computer tools fail to
yield a logical solution. A researcher who has not carefully screened the data could mistakenly believe
that the model is at fault, and confusion ensues. How to deal with missing data is a problem in many, if not
most, empirical studies. Classical (i.e., obsolete) methods for handling incomplete data sets are contrasted
with modern techniques such as multiple imputation, for which a tutorial is offered for readers who are less
familiar with the method. Other data screening issues addressed in this chapter include outliers, extreme
collinearity, and distributional characteristics. We begin with review of basic options for inputting data to
SEM computer tools.

FORMS OF THE INPUT DATA data, replicate the original analyses or estimate alterna-
tive models not considered in the original work (i.e.,
Most primary researchers—those who conduct origi- conduct a secondary analysis).
nal studies—analyze raw data files, but sometimes the Most SEM computer tools accept either a raw data
raw data themselves are not required. For example, file or a matrix summary. If raw data are analyzed, the
when analyzing continuous outcomes with methods computer will create its own matrix, which is then ana-
that assume normal distributions, a matrix of summary lyzed. Consider the issues listed next when choosing
statistics can be input to an SEM computer tool instead between raw data and a matrix summary as program
of the raw data. In fact, you can replicate most of the input:
analyses described in this book using the matrix sum-
maries that accompany them. This is a great way to 1. Certain kinds of analyses require raw data files.
learn because you can make mistakes using someone’s One situation is when distributions for continuous out-
data before analyzing your own. Some journal articles comes are severely nonnormal and the data are ana-
about the results of SEM analyses contain enough lyzed with a method that assumes normality, but stan-
information, such as correlations, standard deviations, dard errors and model test statistics are computed that
and means, to create a matrix summary, which can then adjust for nonnormality. A second situation involves
be submitted to a computer program for analysis. Thus, missing data. Both classical and modern techniques for
readers of these works can, with no access to the raw missing data require the analysis of raw data files. The

46

Pt1Kline5E.indd 46 3/22/2023 3:52:38 PM


Data Preparation 47

third case is when outcome variables are categorical, TABLE 4.1. Raw Data and Matrix Summaries
either unordered (nominal) or ordered (ordinal). Such in Lower Diagonal Form
outcomes can be analyzed in SEM, but raw data files X W Y
are generally needed.
Raw scores
2. Matrix input offers a potential economy over raw
3 65 24
data files. Suppose that 1,000 cases are measured on 10 8 50 20
continuous variables. The data file may be 1,000 lines 10 40 22
(or more) in length, but a matrix summary for the same 15 70 32
data might be only 10 lines long. 19 75 27
3. Sometimes a researcher might “make up” a data
Covariances
matrix using theory or results from a meta-analysis,
so there were never raw data, only a matrix summary. 38.5000
42.5000 212.5000
Submitting a made-up data matrix to an SEM com-
17.5000 51.2500 22.0000
puter tool is a way to diagnose certain kinds of techni-
cal problems that can occur when analyzing a complex Correlations, standard deviations
model. This idea is elaborated later in the book.
1.0000
.4699 1.0000
If means are not analyzed, there are two basic sum- .6013   .7496 1.0000
maries of raw data for continuous variables—covari- 6.2048 14.5774 4.6904
ance matrices and correlation matrices with standard
deviations. For instance, listed in the top part of Table Note. MX = 11.0000, MW = 60.0000, MY = 25.0000.
4.1 are scores on three continuous variables for five
cases. Presented in the lower part of the table are the
two summary matrices just mentioned in lower diago- researcher submits just a correlation matrix for analy-
nal form, where only unique values are reported in the sis. There are special methods for analyzing correlation
lower left-hand side of the matrix. Both the covariance matrices without standard deviations, but they are not
matrix and the correlation matrix with standard devia- available in all software. Thus, it is generally safer to
tions in the table encapsulate all covariance informa- analyze a covariance matrix (or a correlation matrix
tion in the raw data. Computer tools for SEM generally with standard deviations). The pitfalls of analyzing cor-
accept for input lower diagonal matrices as alternatives relation versus covariance matrices are the reason why
to full ones, with redundant entries above and below you must state in written reports the specific kind of
the diagonal, and can “assemble” a covariance matrix data matrix analyzed and the estimation method used.
given just the correlations and standard deviations. Matrix summaries of raw data must consist of
Four-decimal accuracy is recommended for matrix covariances and means whenever means are analyzed.
input to minimize rounding error. Exercise 1 asks you For example, either the correlation matrix with stan-
to reproduce the covariance matrix from the correla- dard deviations or the covariance matrix in Table 4.1
tions and standard deviations in Table 4.1. would be submitted with an extra row for the means of
It may be problematic to submit just a correlation all three variables, if means are analyzed. Even if your
matrix with no standard deviations for analysis, specify analyses do not concern means, you should neverthe-
that all standard deviations equal 1.0 (which standard- less report the means of all continuous variables. You
izes the variables), or convert raw scores to normal may not be interested in analyzing means, but someone
deviates (z scores) and then submit the data file of stan- else might be. Always report sufficient descriptive sta-
dardized scores. This is because most SEM estimation tistics (including means) so others can reproduce your
methods assume the analysis of unstandardized vari- results. If the analysis requires raw data, make those
ables. If the variables are standardized, then the results data available to readers, such as including the data
can be incorrect, including wrong values for standard file in the supplemental materials for a journal article.
errors or model test statistics. Some SEM computer Some journals host data files for articles in databases
tools issue warning messages or terminate the run if the that are accessible to all readers.

Pt1Kline5E.indd 47 3/22/2023 3:52:38 PM


48 Concepts, Standards, and Tools

POSITIVE DEFINITENESS determinant equals zero, and (3) there is some pattern
of perfect collinearity that involves two variables (e.g.,
The data matrix that you submit—or the one calculated rXY = 1.0) or ≥ 3 variables in a more complex pattern
by the computer from your raw data—should be posi- (e.g., RY•XW = 1.0). Perfect collinearity means that the
tive definite (PD), which is required for most estima- denominators of some calculations will be zero, which
tion methods. A data matrix that lacks this character- results in “illegal” (undefined) fractions in computer
istic is nonpositive definite (NPD); therefore, attempts analysis (estimation fails). Near-perfect collinearity,
to analyze such a matrix would probably fail. A PD such as rXY = .97, manifested as near-zero eigenvalues
data matrix has the properties summarized next: or determinants, can cause the same problem.
Negative eigenvalues (< 0) may indicate a data
1. The matrix is nonsingular, which means that it has matrix element—a correlation or covariance—that
an inverse. A matrix with no inverse is singular. is out of bounds. It is mathematically impossible for
such an element to occur if all elements were calculated
2. All eigenvalues for the matrix are positive (> 0),
from the same cases with no missing data. For exam-
which says that the matrix determinant is also posi-
ple, the value of rXY, the Pearson correlation between
tive.
two continuous variables, is limited by the correlations
3. There are no out-of-bounds correlations or covari- between these two variables and a third variable W. It
ances. must fall within the range defined next:

In most kinds of multivariate analyses (SEM 2


(rXW × rWY ) ± (1 − rXW 2
)(1 − rWY ) (4.1)
included), the computer needs to derive the inverse of
the data matrix as part of its linear algebra operations. If For example, given a value of rXW = .60 and of rWY = .40,
the matrix is singular, these operations fail. If v equals the value of rXY must fall within the range
the number of observed variables, the computer should
also be able to generate v linear combinations of the .24 ± .73, or from –.49 to .97
variables that (1) are pairwise uncorrelated (orthogo-
nal) and (2) reproduce all the covariance information in Any other value for rXY would be out of bounds. Equa-
the original data matrix. The v weighted combinations tion 4.1 specifies a kind of triangle inequality for val-
are called eigenvectors, and the amount of variance ues of correlations among three variables. In a geomet-
explained in the original data matrix by each eigenvec- ric triangle, the length of a given side must be less than
tor is its corresponding eigenvalue.1 It is impossible to the sum of the lengths of the other two sides but greater
derive more eigenvectors than the number of observed than the difference between the lengths of the two sides.
variables because no information remains once all v In a PD data matrix, the maximum absolute value
eigenvectors are extracted from the data matrix. of covXY, the covariance between two continuous vari-
If all v eigenvalues are > 0, the matrix determinant ables X and Y, must respect the limits defined next:
will be positive, too. The determinant is the serial
22 22
product (the first times the second times the third, and XY ≤ s XX × sYY
max cov XY (4.2)
so on) of the eigenvalues. If all eigenvalues are posi- 2 2
tive, the determinant is a kind of matrix variance, or where s X and sY are, respectively, the sample variances.
the volume of the multivariate space “mapped” by the In other words, the maximum absolute value for the
observed variables.2 If any eigenvalue equals zero, then covariance between two variables is less than or equal
(1) the data matrix has no inverse (it is singular), (2) the to the square root of the product of their variances. For
example, given
1 Inprincipal components analysis, a type of exploratory factor
2 2
analysis where the factors are linear combinations of observed covXY = 13.00, s X = 12.00, and sY = 10.00
variables (principal components), eigenvectors correspond to
principal components and eigenvalues are the variances of those The covariance between X and Y is out of bounds
components. because
2 Thereis a good illustration at https://en.wikipedia.org/wiki/
Determinant 13.00 > 12.00 × 10.00 =
10.95

Pt1Kline5E.indd 48 3/22/2023 3:52:39 PM


Data Preparation 49

which violates Equation 4.2. The value of rXY is also In the real world, missing values occur in many, if
out of bounds because it equals 1.19, given these vari- not most, data sets, despite best efforts at prevention.
ances and covariance. Exercise 2 asks you to verify A few missing values, such as < 5% in the total data
this fact. set, may be of little concern. This is because selection
Because the matrix determinant is the serial prod- among alternative methods to deal with missing data
uct of the eigenvalues, the determinant will be nega- tends to make little difference when rates of missing
tive if some odd number of eigenvalues (1 or 3 or 5, data are low (Schafer, 1999). Higher rates of data loss
etc.) is negative. A matrix with a negative determinant present greater challenges, especially if the data loss
may have an inverse, but the whole matrix is neverthe- mechanism is not random (or at least predictable).
less NPD, perhaps due to out-of-bounds correlations or There is no critical rate of missing data that signals a
covariances. See Topic Box 4.1 for more information severe problem, but as that rate exceeds about 10% or
about causes of NPD data matrices and possible solu- so, there is an increasing likelihood of bias. In this case,
tions. the choice of method can affect the results. Best prac-
Before analyzing either raw data or a matrix sum- tices involve the steps listed next (Johnson & Young,
mary, the original data file should be screened for the 2011; Little et al., 2012):
problems considered next. Some of these difficulties are
causes of NPD data matrices, and most are also con- 1. Report the extent of missing data with a tabular
cerns in data screening for other kinds of multivariate summary or diagram, such as a participant flow.
statistical analyses—see Tabachnick and Fidell (2019). Describe procedures used to prevent or mitigate
If scores from psychological tests are to be analyzed, data loss.
the reliabilities of those scores should be estimated 2. Diagnose missing data patterns, which are related
in the researcher’s samples—see the Psychometrics to assumptions about data loss mechanisms. Note
Primer on this book’s website. that different variables in the same data set can be
affected by different data loss mechanisms.
3. Use a modern, principled statistical method for ana-
MISSING DATA
lyzing incomplete data, one that takes full advan-
tage of the structure in the data and does not rely on
The topic of how to deal with missing observations is
implausible assumptions.
complex, and entire books are devoted to it (Enders,
2010; Graham, 2012; Rubin & Little, 2020). There are 4. Acknowledge the reality that certain assumptions
also articles or chapters about dealing with incomplete about data loss are untestable unless new data are
data in SEM (Enders, 2013, 2023; Jia & Wu, 2019), collected.
which is fortunate because it is not possible here to 5. Conduct a sensitivity analysis, where the extant
give a comprehensive account. The goal instead is to incomplete data are reanalyzed using a different
describe obsolete versus modern options and explain method. If the results differ appreciably from the
their relevance in SEM. original findings, they are not considered robust
Ideally, researchers would always work with com- compared to alternative assumptions about missing
plete data sets; otherwise, prevention is the best strat- data.
egy. For example, questionnaire items that are clear and
unambiguous may prevent missing responses, and com-
Data Loss Mechanisms
pleted forms should be reviewed for missing responses
and Missingness Graphs
before participants submit a computer-administered
survey. Little et al. (2012) offered suggestions for reduc- Rubin (1976) described three basic data loss mecha-
ing missing data in clinical trials, including routine nisms. The least troublesome in terms of bias—and
follow-up after treatment discontinuation, allowing for perhaps also the most unlikely in real data sets—is
flexible treatment that accommodates side effects or missing completely at random (MCAR). This means
differences in efficacy, offers of monetary value or other that missingness (i.e., missing yes or no) on a variable
incentives for completeness of records, and targeting an is unrelated to (1) all other variables in the data set and
underserved population, which provides an incentive to (2) the variable itself with no missing observations.
remain in the study. The second point just stated means that the propensity

Pt1Kline5E.indd 49 3/22/2023 3:52:39 PM


50 Concepts, Standards, and Tools

TOPIC BOX 4.1

Causes of Nonpositive Definite Data Matrices and Solutions


The causes of NPD data matrices described by Wothke (1993) are listed next. Some causes can be
detected through data screening:

1. Extreme bivariate or multivariate collinearity among observed variables.


2. The presence of outliers that force the values of correlations to be extremely high.
3. Sample correlations or covariances that barely fall within the limits defined by Equations 4.1 and
4.2 but nevertheless cause the analysis to fail with a warning or error message about an NPD
matrix.
4. Pairwise deletion of missing data.
5. Making a typing mistake when transcribing a data matrix from one source, such as a table in
a journal article, to another, such as a syntax file for computer analysis, can result in an NPD
data matrix. For example, if the value of a covariance in the original matrix is 15.00, then typing
150.00 in the transcribed matrix could generate an NPD matrix.
6. Plain old sampling error can generate NPD data matrices, especially in small or unrepresentative
samples.
7. Sometimes matrices of estimated Pearson correlations, such as polyserial or polychoric correla-
tions derived for observed variables that are not continuous, can be NPD.

Here are some tips for diagnosing whether a data matrix is PD before submitting it for analysis to an
SEM computer tool: Copy the full matrix (with redundant entries above and below the diagonal) into a text
(ASCII) editor, such as Microsoft Windows Notepad. Next, point your Internet browser to a free, online
matrix calculator (many are available) and then copy the data matrix into the proper window on the cal-
culating page. Finally, select options on the calculating page to derive the eigenvalues, eigenvectors, and
determinant. Look for results that indicate an NPD matrix, such as near-zero, zero, or negative eigenvalues.
A useful matrix and vector calculator is available at https://www.symbolab.com/solver/matrix-calculator.
An alternative is to analyze the matrix in R using its native matrix algebra functions (i.e., no special pack-
age is needed). An example follows.
Suppose that the covariances among variables X, W, and Y, respectively, are

1.00 .30 .65


.30 2.00 1.15
.65 1.15 .90

The R syntax that turns scientific notation off, defines and displays the covariance matrix just listed, and
generates the eigenvalues and eigenvectors, the determinant, and the inverse (if any) is given here. The last
command converts the covariance matrix to a correlation matrix:
(continued)

Pt1Kline5E.indd 50 3/22/2023 3:52:39 PM


Data Preparation 51

options(scipen = 999)
a <- matrix(c(1,.3,.65,.3, 2, 1.15,.65, 1.15,.9),
nrow = 3, byrow = TRUE)
a
eigen(a)
det(a)
solve(a)
cov2cor(a)

The eigenvalues generated using the R code just listed for the covariance matrix are

(2.918, .982, 0)

The third eigenvalue is zero, so the covariance matrix has no inverse and the determinant equals 0. Let us
inspect the weights for the third eigenvector, which for X, W, and Y, respectively, are

(–.408, –.408, .816)

Some online matrix calculators report the eigenvector weights as

(–1, –1, 2) or (–.5, –.5, 1)

but the values just listed are proportional to the weights computed in R. None of these weights equals zero,
so all three variables are involved in perfect collinearity. The pattern for these data is

RY•XW = RW•XY = RX•YW = 1.0

To verify this pattern, you should calculate the multiple correlations just listed from the bivariate correlations
for X, W, and Y, respectively, computed in R and presented next in lower diagonal form—see the Regres-
sion Primer on this book’s website for the equations:

1.0
.2121 1.0
.6852 .8572 1.0

The LISREL program offers an option for ridge adjustment, which multiples the diagonal elements in a
covariance matrix by a constant > 1.0 until negative eigenvalues disappear (the matrix becomes PD). These
adjustments increase the variances until they are large enough to exceed any out-of-­bounds covariance
element in the off-­diagonal part of the matrix. This technique “fixes up” a data matrix so that the necessary
algebraic operations can be performed (Wothke, 1993), but parameter estimates, standard errors, and fit
statistics are biased after ridge adjustment. A better solution is to try to solve the problem of nonpositive
definiteness through data screening. There are other contexts where you may encounter NPD matrices in
SEM, but these generally concern (1) matrices of parameter estimates for your model or (2) matrices of
correlations or covariances predicted from your model. A problem is indicated if any of these matrices is
NPD. We will deal with these contexts in later chapters.

Pt1Kline5E.indd 51 3/22/2023 3:52:39 PM


52 Concepts, Standards, and Tools

for an observation to be missing does not depend on a that the missing values depend on information that is
case’s true status on that variable. Thus, there is no sys- not directly available in the analysis. Thus, (1) results
tematic process anywhere that would make some data based on the complete cases only can be severely biased
more likely to be missing than other data. If so, then when the data loss pattern is MNAR, and (2) the choice
the observed incomplete data are just a random sample of methods to deal with the missing data can make a
of scores that the researcher would have analyzed if the big difference in the results. Some of this bias may be
data were complete (Enders, 2010). Results based on reduced if other measured variables happen to covary
the complete cases only should not be biased, although with unmeasured causes of data loss, but whether this is
power of significance tests may be reduced due to a true in a particular sample is typically unknown.
smaller effective sample size. Mohan and Pearl (2021) described missingness
Data are missing at random (MAR) when missing- graphs or m-graphs that visually represent the data
ness is (1) correlated with other variables in the data set, loss mechanisms just outlined. An m-graph is a spe-
but (2) does not depend on the level of the variable itself cial type of directed acyclic graph (DAG) in which uni-
with no missing observations. This means that data are directional causal effects are represented with arrows
actually missing conditionally at random, or after con- that point from presumed causes to their outcomes.
trolling for values of other measured variables. Sup- Presented in Figure 4.1(a) is causal DAG for variables
pose that men are less likely to report on their health with no missing data, where Xo and Wo are specified
status in a particular area than women, but there is no as uncorrelated causes of Yo. Figures 4.1(b)–4.1(d) are
real difference in health between men with no missing m-graphs in which Ym is an incomplete variable with
data versus men who opted not to disclose. After con- missing scores for some, but not all, cases. Variable R
trolling for gender, the data loss pattern is random. The in the m-graphs is the missingness mechanism that
dependence of missingness solely on other observed determines whether a particular score is missing or not
variables explains why Mohan and Pearl (2021) used missing on the observed variable Y*, which is a proxy
the term v-MAR, where “v” stands for “observed for Ym. The relation among Yo, R, and Y* can be sum-
variables.” Information lost due to an MAR process is marized as follows:
potentially recoverable through imputation in which
missing scores are replaced by values predicted from Y if R = 0
Y* =  0 (4.3)
other variables in the data set. Options for imputation if R = 1
m
are described momentarily, but statistical methods suit-
able for the MAR data loss pattern can also be applied where m means “missing.” That is, when R = 0, the true
when the pattern is MCAR. The reverse is not true, value for Y is observed, but when R = 1 that value is
though: Methods that assume MCAR are not generally hidden (not observed).
appropriate when the missing data mechanism is MAR. In Figure 4.1(b), the data loss mechanism is MCAR
The third pattern is missing not at random (MNAR), because missingness occurs with no relation to any vari-
also known as nonignorable missing data, where the able whether it is completely measured (Xo, Wo) or par-
probability of missingness is related to the true level tially measured (Ym). The independencies just stated are
of the variable itself even after controlling for other represented in Figure 4.1(b) by the absence of arrows
variables in the data set. That is, there is a structure that point to R from all other variables. The MAR pat-
to the missing data, and it cannot be treated as though tern is depicted in Figure 4.1(c), where missingness is
it were random. Unlike the MAR pattern, which is a specified as being caused by the fully observed variable
recoverable process, the MNAR pattern is not because Wo but not by the incomplete variable Ym. For example,
it is latent (not directly measured) (Little, 2013). An if variable Wo in the figure is gender, then the whole
example of an MNAR missing data pattern is when m-graph says that although the likelihood of missing
men who fail to answer questions about their health data varies by gender, there is no systematic difference
status are more likely to be ill than men with no miss- between responders and nonresponders on the partially
ing responses on the same variable. Another is when measured variable after controlling for gender. The
respondents with either very low or very high incomes m-graph in Figure 4.1(d) represents an MNAR pattern,
are less likely to answer questions about their income. where missingness is caused by the partially measured
Thus, the challenge presented by the MNAR pattern is variable itself (i.e., Ym → R), and controlling for any

Pt1Kline5E.indd 52 3/22/2023 3:52:39 PM


Data Preparation 53

(a) Complete (b) MCAR (c) MAR (d) MNAR

Xo Wo Xo Wo Xo Wo Xo Wo

Yo Ym R Ym R Ym R

Y* Y* Y*

FIGURE 4.1. No missing data (a). Missingness graphs (m-graphs) under conditions of missing completely at random (MCAR)
(b), missing at random (MAR) (c), and missing not at random (MNAR) (d). Variables Xo, Wo, and Yo are observed in all cases;
Ym has missing data for some cases; Y* is the proxy variable that is actually measured; and R is the missingness mechanism.
If R = 0, then Y* = Yo, but if R = 1, then Y* = missing. From “Graphical Models for Processing Missing Data,” by K. Mohan
and J. Pearl (2021), Journal of the American Statistical Association, 116(534), p. 1025. Copyright © 2021 by Taylor & Francis.
Adapted with permission.

fully measured variable, Xo or Wo, can neither break nor variables in the data set while controlling for their
disrupt this association. intercorrelations. In the bivariate case, where missing
data are confined to a single variable and the other vari-
able is continuous, the Little MCAR Test reduces to
Diagnosing Missing Data
an independent-samples t test. Assuming normal distri-
It is not easy in practice to determine whether the data butions and homoscedasticity, the test statistic for the
loss mechanism is random or systematic, especially Little MCAR Test is distributed over large samples as
when each variable is measured once. One reason is a central chi-square statistic. A significant result (e.g.,
that all three data loss mechanisms—MCAR, MAR, p < .05 when testing at the .05 level) means that the null
and MNAR—can be involved in causing case attri- hypothesis of MCAR is rejected; that is, the data loss
tion or nonresponse, and their relative influence can pattern is either MAR or MNAR. A problem with the
change over variables in the same data set. Another test is that its power can be low in small samples (i.e.,
reason is that while there is a way to determine the null hypothesis of MCAR is retained too often), and
whether the assumption of MCAR is reasonable, it can flag trivial departures from MCAR as significant
there is no specific test that provides direct evidence in large samples.
of either MAR or MNAR if the hypothesis of MCAR Estimating effects sizes in comparisons of com-
is rejected. The only way to distinguish MAR and plete versus incomplete cases over other variables can
MNAR data loss mechanisms is to measure the miss- help the researcher to interpret results from the Little
ing data. For example, if survey nonrespondents are MCAR Test. One tactic involves creating a binary
later followed up by phone to get the missing infor- (dummy) variable, where a score of “1” indicates a
mation, then respondents and nonrespondents in the missing observation and a score of “0” means not miss-
first round can now be compared. If these two groups ing (complete). Next, the means for the two groups just
differ appreciably on the recovered data, there is evi- defined are compared to other variables that are con-
dence for MNAR; otherwise, the hypothesis of MAR tinuous. Magnitudes of univariate group mean differ-
may be viable. Whenever the recovery of missing data ences can be estimated using either standardized mean
is impractical, analysis of the original incomplete data differences, d, or point-biserial correlations, rpb (Kline,
is the only option. 2013a, chap. 5). In large samples, a significant Little
Little (1988) described a multivariate significance MCAR Test result when magnitudes of group differ-
test for the hypothesis of MCAR. It compares cases ences on other variables are considered trivial would
with observed versus missing observations on all other bolster the hypothesis of MCAR. The opposite pattern

Pt1Kline5E.indd 53 3/22/2023 3:52:39 PM


54 Concepts, Standards, and Tools

in a small sample—a nonsignificant Little MCAR Test CLASSICAL (OBSOLETE) METHODS


when magnitudes of group differences on other vari- FOR INCOMPLETE DATA
ables are meaningful—would provide evidence against
the hypothesis of MCAR. Classical techniques for handling missing data are
Observing differences between complete versus simple to understand and easy to implement in statisti-
incomplete cases on other variables is helpful for a cal software, but they are increasingly seen as obsolete
method that imputes scores for the incomplete cases because such methods
when predicting from those other variables, assuming
the data loss pattern is MAR (Figure 4.1(c)). A strat- 1. Assume that the data loss mechanism is MCAR,
egy that anticipates this pattern is to measure auxiliary and results with these methods can be seriously
variables that may not be of theoretical interest except biased when this rather improbable assumption
they are expected to be potential causes or correlates does not hold.
of missingness on other variables. For example, family 2. Take little or no advantage of structure in the data
socioeconomic status (SES) could be a potential aux- when investigating missing data patterns.
iliary variable in a longitudinal study, if it is expected
that lower SES families are more likely to drop out of 3. Basically ignore or minimize the problem by allow-
the study. Including family SES in the analysis may ing an incomplete data file to be produced.
help to reduce bias due to differential attrition related
to this variable. An auxiliary variable could also be one Classical methods are briefly described next so that
that simply covaries appreciably with variables that readers can better understand their limitations, but I
have missing observations or with their causes whether can’t recommend their use. Now, to be fair, if the rate
or not they are related to the missingness mechanism. of missing data is very low (e.g., 1%), then it doesn’t
Suppose that information on family SES is not avail- really matter whether classical or more modern tech-
able, but addresses for places of residence are mea- niques are used because the results will be similar. But
sured. If place of residence, or neighborhood, is related as the rate of data loss increases, the trustworthiness of
to SES, then including residence in the analysis may classical methods decreases, especially if the data loss
recover some of the information about SES as a miss- pattern is not MCAR.
ing data mechanism, thus reducing bias related to SES. There are two broad categories of classical tech-
Auxiliary variables require care in their selection. niques: available case methods, which analyze avail-
For example, including too many such variables in able data through removal of incomplete cases from
small samples can decrease precision and create down- the analysis, and single-imputation methods, which
ward bias in estimates of regression coefficients, espe- replace each missing score with a single calculated
cially if absolute correlations between auxiliary and (imputed) score. Available case methods include list-
other variables are < .10 (Hardt et al., 2012). Enders wise deletion in which cases with missing scores on
(2010) recommended that auxiliary variables should any variable are excluded from all analyses. The effec-
have absolute correlations of about .40 with incomplete tive sample size with listwise deletion includes only
variables, although that particular value is not a golden cases with complete records, and this number can be
rule. Ideally, there should be no missing observations much smaller than the original sample size if miss-
on auxiliary variables; otherwise, their potential role in ing observations are scattered across many records.
recovering information due to data loss on other vari- An advantage is that all analyses are conducted with
ables is diminished (Dong & Peng, 2013). Thoemmes the same cases. This is not so with pairwise deletion
and Rose (2014) described situations where auxiliary in which cases are excluded only if they have missing
variables might actually increase bias. One case is values on variables involved in a particular analysis.
when the auxiliary variable is an outcome of a partially Suppose that N = 300 for an incomplete data set. If
measured variable and where controlling for the aux- 250 cases have no missing scores on variables X and Y,
iliary variable induces a spurious association between then the effective sample size for covXY is this number.
the missing mechanism and the partially measured If fewer or more cases have valid scores on X and W,
variable, although Rubin (2009) speculated that the however, the effective sample size for covXW will not
pattern just described would be rare in real data sets. be 250. This property of pairwise deletion can give

Pt1Kline5E.indd 54 3/22/2023 3:52:39 PM


Data Preparation 55

rise to NPD data matrices, which is demonstrated in replaces missing scores with those on the same
Exercise 3. variable from the nearest complete record.
Mean substitution, where the overall sample (grand)
mean replaces any missing score on that variable, is All the methods just listed have limitations: Pattern
the simplest single-imputation method. A variation is matching and RHDI require large samples. The LOCF
group-mean substitution, where the missing score is method assumes that patients in treatment generally
replaced by a group mean on that variable, such as the improve and that the last measurement before dropout
mean for men when the record for a male participant is is a conservative estimate of eventual outcome. But if
incomplete. Neither method takes account of the infor- patients drop out because they are becoming more ill
mation about the individual case except group member- despite treatment or even die, the LOCF approach can
ship for group-mean substitution. Another problem is grossly overestimate treatment efficacy (Liu, 2016).
that mean substitution distorts the score distribution by
reducing variability. This happens because the imputed
score equals the mean, so they contribute nothing to MODERN METHODS
sum of squares (numerator) of the variance while FOR INCOMPLETE DATA
increasing the degrees of freedom (denominator); that
is, variance decreases after imputation. Consequently, Modern techniques for handling missing data include
covariances are weakened and error variances can be multiple imputation (MI), which generates multiple
underestimated (Liu, 2016). predicted scores for each missing observation, and a
Regression substitution is a bit more sophisticated special version of full information maximum likeli-
because each missing score is replaced by a predicted hood (FIML) estimation for incomplete data files that
score from regression analyses based on variables with neither imputes missing observations nor deletes cases.
no missing data. The method uses more information Both methods assume that the data loss mechanism
than mean substitution, but it assumes that incom- is MAR, which is less strict than the assumption of
plete variables can be predicted reasonably well from MCAR by classical techniques. Depending on the SEM
other variables in the data set. Error variance is still computer tool or procedure, one or both of the modern
underestimated in this method because (1) the imputed methods just mentioned will be available. How the spe-
score for cases having the same values on the predic- cial FIML option deals with missing data is described
tors is a constant, and (2) sampling error affects pre- in Chapter 9.
dicted scores, too, and this uncertainty is not estimated The technique of MI uses variables of theoretical
in single-imputation methods. A variation is to add a interest and optional auxiliary variables to generate for
randomly sampled error term from the normal distri- each missing observation a set of k ≥ 2 imputed scores
bution or other user-specified distribution in stochastic from predictive distributions that model the data loss
regression imputation. mechanism. The result is a total of k data sets, each
Other single-imputation methods include with a unique imputed value for any missing observa-
tion. Next, all k imputed data sets are analyzed with
1. Last observation carried forward (LOCF), where standard statistical techniques, such as fitting the same
in clinical trials the last observation is the most structural equation model to each generated data set.
recent score for participants who drop out of the Finally, the resulting k different estimates of each model
study. parameter are pooled, and the corresponding standard
errors reflect sampling error due both to case selection
2. Pattern matching, in which the computer replaces and imputation, and these standard errors are typically
a missing observation from a case with the most larger than those from single imputation. The three
similar profile on other variables. basic steps of MI just summarized are described in more
3. Random hot-deck imputation (RHDI), which detail in Appendix 4.1. Readers already very familiar
separates complete from incomplete cases, sorts with MI can skip this appendix; otherwise, the presen-
both sets so that cases with similar profiles on back- tation there is somewhat technical, but it is worth the
ground variables are grouped together; randomly effort to learn more about this modern and statistically
interleaves the incomplete and complete cases; and principled approach to dealing with incomplete data

Pt1Kline5E.indd 55 3/22/2023 3:52:39 PM


56 Concepts, Standards, and Tools

files under the assumption of MAR data loss patterns. where R 2j is the proportion of variance in the jth pre-
The technique has widespread application in SEM and dictor explained by all other predictors. Tolerance is
in many other kinds of statistical analyses, too. 1 – R 2j , or the proportion of variance that is unique, or
not explained by other predictors, so VIF is the recipro-
cal of tolerance, and vice versa. Thompson et al. (2017)
OTHER DATA SCREENING ISSUES described how to calculate VIF in secondary analyses
given just a correlation matrix with no raw data.
Considered next are ways to deal with extreme collin- Values for the VIF range from 1.0, which indicates
earity, outliers, violation of distributional assumptions, all predictor variance is unique, to increasingly positive
and very heteroscedastic data matrices. values with no theoretical upper bound, which indicates
higher and higher levels of collinearity. Again, there
Extreme Collinearity is no gold standard but some authors have suggested
that VIF > 10.0 signals possible extreme collinearity
Extreme collinearity—also called extreme multi- (Chatterjee & Price, 1991). This threshold corresponds
collinearity—refers to high levels of interdependence to R 2j > .90 and tolerance < .10. Others have expressed
among predictors of the same outcome. It can lead to skepticism about whether any single threshold for the
inflation of standard errors, instability in the results VIF (10.0, 20.0, or even higher) that ignores other
such that small changes in sample covariance patterns factors, such as sample size, is meaningful (O’Brien,
can generate very different solutions, or analysis fail- 2007). Thompson et al. (2017) reminded us that any
ure due to linear dependence. Extreme bivariate col- cutting point applied to the VIF should not be treated
linearity can occur because what appear to be separate as a hard dichotomy in the sense that, for instance, one
variables really measure the same thing. Suppose that falsely believes that VIF = 9.9 versus VIF = 10.1 makes
X measures accuracy and W measures speed for the a practical difference.
same task. If rXW = .97, for example, then X and W are
redundant notwithstanding their different labels (i.e.,
accuracy is speed, and vice versa). Either one could be Outliers
included in the same regression equation, but not both. Outliers are scores that are very different from the rest.
Although there is no gold standard, it seems to me that A univariate outlier is a score on a single variable that
absolute bivariate correlations > .90 signal a potential
falls outside the expected population values (Mowbray
problem, but others have suggested even lower thresh-
et al., 2019). There is no single definition of a univariate
olds, such as .80 (Abu-Bader, 2010).
outlier for continuous variables. For example, Tabach-
Researchers can inadvertently cause extreme col-
nick and Fidell (2019) suggested that scores that are
linearity when total scores (composites) and their
more than 3.29 standard deviations above or below
constituents are analyzed together. Suppose that a life
the mean are possible outliers. Expressed in terms of
quality questionnaire has five individual scales and a
normal deviates, | z | > 3.29 describes this guideline.
total score that is summed across all scales. Although
Elsewhere I suggested the more conservative heuristic
the bivariate correlation between the total score and
of | z | > 3.0 (Kline, 2020a), but there is no magic cutoff
each of the individual scales may not be very high, the
multiple correlation between the total score and the five point. A limitation is that outlier detection based on nor-
scale scores must equal 1.0 when there is no missing mal deviates is not robust against extreme scores—see
data, which is multivariate collinearity in the extreme. Topic Box 4.2 for a description of an alternative method.
A straightforward method to detect extreme col- More important than any particular numerical
linearity among three or more continuous variables threshold for detecting univariate outliers is that the
is based on the variance inflation factor (VIF). It is researcher investigates extreme scores, which can arise
often available in regression diagnostic procedures of due to
computer programs for general statistical analyses. It
is computed as 1. Mistakes in data entry or coding, such as typing
“95” instead of “15” for a score or failing to specify
1 that “999” means the observation is missing.
VIF = (4.4)
1 − R 2j 2. Intentional distortion or careless reporting, such as

Pt1Kline5E.indd 56 3/22/2023 3:52:39 PM


Data Preparation 57

when research participants respond randomly to can be determined that a case with univariate outli-
questionnaire items as a covert way to be uncoop- ers is not from the same population, such as a gradu-
erative or lie in response to questions about socially ate student who completes a survey while auditing an
undesirable behaviors such as cheating or drug use. undergraduate class, then it is best to remove that case.
3. Administration of measures in ways that violate It is more difficult when extreme scores come from the
standardization, such as giving examinees hints target population; that is, although infrequent, such
that are not part of task instructions. scores arise naturally, so removing them could affect
the generalizability of the results. The basic options are
4. Selection of samples that are unrepresentative of (1) do nothing; (2) remove the outlier from the analy-
the target population or under faulty distributional sis; or (3) minimize its influence through substitution,
assumptions. such as converting extreme scores to a value that equals
5. Natural variation within a population, or an extreme the closest scores not considered extreme (e.g., within
score belongs to a case selected from a different 3.0 standard deviations from the mean); or (4) apply a
population (Osborne, 2013). monotonic transform that pulls extreme scores closer to
the center of the distribution. Another option is a sensi-
The last point just mentioned assumes that an tivity analysis in which results based on different deci-
extreme score is correct (e.g., it is not invalid). If it sions about extreme scores are explicitly compared.

TOPIC BOX 4.2

Robust Univariate Outlier Detection


Suppose that scores for five cases are 19, 25, 28, 32, and 10,000. The last score (10,000) is obviously
an outlier, but it so distorts the mean and standard deviation that even the more conservative | z | > 3.00
rule fails, also called masking:
10,000 − 2,020.80
M = 2,020.80 SD = 4,460.51
= and z = 1.79
4, 460.51
A more robust decision rule for detecting univariate outliers is

| X − Mdn |
> 2.24 (4.5)
1.4826 (MAD)
where Mdn designates the median—­which is more robust against outliers than the mean—and MAD is the
median absolute deviation of all scores from the median. The product of MAD and the scaling factor
1.4826 is an unbiased estimator of s in a normal distribution. The whole ratio is the distance between a
score and the median expressed in robust standard deviation units. The constant 2.24 is the square root of
the approximate 97.5th percentile in a central chi-­square distribution with a single degree of freedom. A
potential outlier thus has a value on Equation 4.5 that exceeds 2.24.
For the five scores in this example, Mdn = 28.00, and the absolute values of median deviations are,
respectively, 9.00, 3.00, 0, 4.00, and 9,972.00. The median of the deviations just listed is MAD = 4.00,
and so for X = 10,000 we calculate

9,972.00
= 1,681.51
1.4826 (4.00)
which clearly exceeds 2.24 and thus detects the score of 10,000 as an outlier. See Rousseeuw and Hubert
(2018) for additional methods of robust outlier detection.

Pt1Kline5E.indd 57 3/22/2023 3:52:39 PM


58 Concepts, Standards, and Tools

A multivariate outlier has extreme scores on ≥ 2 Majewska (2015) described graphical displays for mul-
variables or an atypical pattern of scores. For example, tivariate outliers based on robust D2 statistics. Because
a case may have scores between 2–3 standard devia- interpretation of graphical displays can be rather sub-
tions above the mean on all variables. Although no jective, they are not substitutes for numerical methods.
individual score might be considered extreme, the case
could be a multivariate outlier if this pattern is unusual. Distributions
Here are some options for detecting multivariate outli-
ers with no univariate outliers: The default method in most SEM computer tools is a
form of ML estimation for either complete raw data
1. Some SEM computer tools, such as IBM SPSS files or summary matrices (e.g., Table 4.1) that assumes
Amos and EQS, list cases that contribute the most multivariate normality, also called multinormality,
to multivariate nonnormality, and such cases may for continuous outcome (endogenous) variables. This
be multivariate outliers. means that
2. Calculate for each case its squared Mahalonbis dis-
1. All the univariate frequency distributions are nor-
tance, D 2, which indicates the distance in variance
mal.
units between the profile of scores and the vector of
sample means, or centroids, adjusting for correla- 2. All joint distributions of any pair of variables are
tions among the variables. In large samples, D2 is bivariate normal; that is, each variable is normally
distributed as a central chi-square with degrees of distributed for each value of every other variable.
freedom equal to the number of variables. A rela- 3. All bivariate scatterplots are linear with homo­
tively high value of D 2 and low p value may lead scedastic residuals.
to the rejection of the null hypothesis that the case
comes from the same population as the rest. A con- Other variations on the definition of multivariate nor-
servative level of statistical significance is usually mality in SEM are described in Chapter 9 on global
recommended for this test, such as .001. Leys et al. estimation methods. Because it is often impractical to
(2018) described a robust version of the multivariate examine all joint frequency distributions, it can be dif-
test just mentioned. ficult to assess all aspects of multivariate normality.
Fortunately, many instances of multivariate nonnor-
Visual methods to detect univariate outliers include mality can be detected through evaluation of univariate
the inspection of histograms or box plots (Mowbray frequency distributions. This is because univariate nor-
et al., 2019). In both types of displays, extreme scores mality is necessary but insufficient to guarantee multi-
are represented as further away from the main body variate normality (Pituch & Stevens, 2016).
of scores, or the rest of the distribution. Tukey (1977) Early SEM computer programs had few estimators
developed box plots, also called box-and-whisker other than default ML, but the situation today is very
plots, as a way to graphically display the spread of the different: Most SEM computer programs now offer
data throughout their whole range. The outer parts of multiple estimators that accommodate different kinds
the “box” are defined by the hinges, which correspond of outcome variables, such as continuous, ordinal,
approximately to the first quartile (Q1), or the 25th binary, count, or censored variables, with various types
percentile, and the third quartile (Q3), or the 75th per- of distributions, such as normal or nonnormal distri-
centile. The second quartile (Q2), or the median (50th butions for continuous outcomes and Poisson distribu-
percentile), is represented by a line in the box that is tions for count variables, and so on.3 A robust option for
parallel to two hinges. The whiskers are lines that con- ML estimation for continuous outcomes with nonnor-
nect the hinges with the lowest and highest scores that mal distributions is described in Chapter 9, and other
do not exceed 1.5 times the positive difference between 3A count variable is the number of times a discrete event happens
the hinges (i.e., approximately 1.5 times the interquar- over a period of time, such as the number of hospitalizations over
tile range, or Q3 – Q1). Any scores that fall outside of the past 5 years. In a Poisson distribution, the mean and variance
the limits just stated—the lower and upper fences—are are equal. A censored variable is one for which values occur out-
represented as outliers. Exercise 4 asks you to generate side the range of measurement, such as a scale that registers the
a boxplot for a small data set with an extreme score. value of weight between 1 and 300 pounds only.

Pt1Kline5E.indd 58 3/22/2023 3:52:39 PM


Data Preparation 59

options for noncontinuous outcomes are covered later sum of deviations raised to the same power for scores
in the book. Thus, normality assumptions are less criti- below the mean, so m3 = g1 = 0 in a normal curve. In
cal in modern SEM. Instead, the challenge is for the unimodal distributions where most of the scores are
researcher to select an appropriate estimator, given below the mean such that the distribution has a longer
the distributional characteristics of their data. right tail than the left tail, then m3 > 0 and g1 > 0, which
But there are still occasions in SEM when data indicates positive skew. But if most scores are above
should be screened for multivariate normality. One is the mean such that the left tail is longer than the right
when using the default ML method. Note that raw data tail, then m3 < 0 and g1 < 0, which indicates negative
are required to evaluate multivariate normality. If just a skew.
covariance matrix is submitted for analysis with default In unimodal distributions, kurtosis concerns the
ML, it must be assumed that the original distributions combined weight of the both tails relative to the center
for continuous outcomes are multivariate normal. of the distribution. Thus, kurtosis measures the rela-
Other occasions arise when using a method of MI that tive presence of outliers in both tails of the distributions
assumes multivariate normality or when applying the compared with a normal distribution. It is the fourth
original Little MCAR Test. Quantitative measures of standardized moment, which in the population distri-
nonnormality are described next, with the assumption bution is
that readers already know how to inspect scatterplots,
m
distributions of regression residuals, or other kinds of g 2 = 44 (4.8)
graphical displays or numerical summaries used with s
bivariate linearity and homoscedasticity; otherwise, where m4 is the fourth moment about the mean,
see Cohen et al. (2003, chap. 4). defined by Equation 4.7 for r = 4. The expected value
Significance tests intended to detect multivariate for g2 in a normal distribution is 3.0. Excess kurtosis
nonnormality, such as Mardia’s (1970) test, or detect is defined as
univariate nonnormality, such as the Kolmogorov-
Smirnov (K-S) test, among others (Oppong et al., 2016), g2 – 3 (4.9)
have limited usefulness. One reason is that slight depar-
tures from normality could be significant in large sam- so that its value equals 0 in a normal distribution. Thus,
ples, and power in small samples may be low, so larger positive excess kurtosis, or
departures could be missed. An alternative for assessing
univariate normality are quantitative measures of skew- g2 – 3 > 0
ness and excess kurtosis, the two ways a distribution can
be nonnormal, and they can occur either separately or means that the tails of the distribution are heavier rela-
together in the same variable. A normal distribution has tive to a normal curve, and such distributions are called
zero skewness and excess kurtosis. leptokurtic. Negative excess kurtosis, or
Skewness is the degree of asymmetry in a probabil-
ity distribution. It is defined as the third standardized g2 – 3 < 0
moment, which in the population is
means just the opposite (lighter tails), and distribu-
m
g1 = 33 (4.6) tions with this characteristic are called platykurtic.
s The term mesokurtic refers to distributions with zero
where s is the population standard deviation and m3 is excess kurtosis, such as the normal curve. All refer-
the third central moment, a particular instance of the ences to “kurtosis” from this point concern excess kur-
rth moment defined as tosis. Skewed distributions are generally leptokurtic.
Joanes and Gill (1998) described the three sample
mr =
∑ ( X − m)r (4.7) estimators for skew and kurtosis listed in Table 4.2
N that could be printed by software for general statisti-
where m is the population mean and r is an integer ≥ 1. cal analyses. Statistics g1 for skew and g2 are based on
In Equation 4.7, r = 3 for the third moment. In symmet- sample moments and standard deviations computed
rical distributions, the sum of deviations raised to the as S with N in the denominator. Other statistics in the
third power for scores above the mean will balance the table feature adjustments for small sample size in the

Pt1Kline5E.indd 59 3/22/2023 3:52:40 PM


60 Concepts, Standards, and Tools

computation of sample standard deviations as s with cating severe nonnormality in some computer simula-
N – 1 in the denominator (e.g., b1, b2) or in calculations tion studies, but exceptions are easy to find. For exam-
for central moments (e.g., G1, G 2). In normal popula- ple, Lei and Lomax (2005) treated absolute skewness
tion distributions, all three skewness statistics in Table and kurtosis values > 2.30 as indicating severe nonnor-
4.2 are unbiased, but only G 2 for kurtosis is unbiased mality. The point is that there is no magic demarca-
in such distributions. Also shown in the table are the tion between trivial and appreciable nonnormality that
relative values of error variances for each statistic in will fit all models and data sets, but the assumption of
normal distributions. Estimators G1 for skew and G 2 normality becomes increasingly less plausible as there
for kurtosis tend to have the greatest expected variation is more and more skewness or kurtosis. Exercise 5
over random samples, but differences among their val- addresses a common misinterpretation in this area. Sig-
ues and error variances narrow for samples sizes > 100. nificance testing where skewness or kurtosis statistics
In very small samples, though, differences among these are divided by their standard errors is another method,
estimators can be striking, including results for kurto- but it is problematic for reasons already stated (e.g., low
sis that are both positive and negative in the same vari- power in small samples).
able—see Cain et al. (2017) for more information. Normalizing transformations (normalization), or
Just as there are no golden rules for detecting outliers, monotonic arithmetic operations that compress some
there are also no universal absolute values for skewness parts of a distribution more than others while preserv-
or kurtosis statistics that indicate severe nonnormal- ing rank order so that the transformed scores are more
ity. Finney and DiStefano (2013) noted that absolute normally distributed, are an option, but you should
univariate skewness and kurtosis values greater than, think about the variables of interest. Some variables,
respectively, 2.0 and 7.0, have been described as indi- including reaction times, reports of alcohol or drug use,

TABLE 4.2. Estimators of Skewness and Kurtosis and Relative


Error Variances for Normal Samples
Statistic Equation Unbiased Error variance rank
Skewness
g1 m3 Yes 2
S3

G1 N ( N − 1) Yes 1
g1
N −2

b1 m3 Yes 3
s3

Kurtosis
g2 m4 No 3
−3
S4

G2 N −1 Yes 1
(( N + 1)g2 + 6)
( N − 2)( N − 3)

b2 m4 No 2
−3
s4
Note. Error variances are ranked from highest to lowest; m3 = S (X – M)3/N; m4 = S (X – M)4/N;
and S and s are the sample standard deviations computed with, respectively, N or N – 1 in the
denominator.

Pt1Kline5E.indd 60 3/22/2023 3:52:40 PM


Data Preparation 61

and health care costs, are expected to have nonnormal Transformation means that the original meaningful
distributions (Bono et al., 2017). Normalizing an inher- metric is lost, which could be a sacrifice. Described in
ently nonnormal variable could mean that the target Topic Box 4.3 are types of normalizing transformations
variable is not actually studied. Another consideration that might work—there is no guarantee—with practical
is whether the metric of the original variable is mean- suggestions for using them. Exercise 6 asks you to find
ingful, such as postoperative survival time in years. a normalizing transformation for a small data set.

TOPIC BOX 4.3

Normalizing Transformations
Three kinds of normalizing transformations are described next with suggestions for their use:

1. Positive skewness. Before applying these transformations, add a constant to the scores so that
the lowest modified score is 1.0. A basic operation is the square root transformation, or X1/2,
which works by compressing differences between scores in the upper end of the distribution
more than the differences between lower scores. Logarithmic transformations are another option.
A logarithm is a power (exponent) to which a base number must be raised to get the original
number, such as 102 = 100, so the logarithm of 100 in base 10 is 2. Distributions with extremely
high scores may require a transformation with a higher base, such as log10 X, but a lower base
may suffice for less extreme cases, such as the natural base e (approximately 2.7183) for the
transformation loge X = ln X. The inverse function 1/X is an option for even more severe skewness.
Because inverting scores reverses their order, (1) reflect (reverse) the original scores (multiply them
by –1.0) and (2) add a constant to the reflected scores so that the maximum score is at least 1.0
before taking the inverse.
2. Negative skewness. All the transformations just mentioned also work for negative skewness when
they are applied as follows: First, reflect the scores, and then add a constant so that the lowest
score equals 1.0. Next, apply the transformation, and reflect the scores again to restore their
original order.
3. Other types of nonnormality. Odd-root functions (e.g., X1/3) and sine functions tend to bring in
outliers from both tails of the distribution toward the mean. Odd-­powered polynomial functions,
such as X3, may help for negative kurtosis. If the scores are proportions, the arcsine square root
transformation, or arcsin X1/2, may help to normalize the distribution.

There are other types of normalizing functions, and this is one of their problems: It can be difficult to
find a transformation that works with a particular distribution. The Box–Cox transformations (Box &
Cox, 1964) may require less trial and error. The most common form is defined next for positive scores only:
X l −1
(l )  , if l ≠ 0;
X = l (4.10)
 log X , if l =0.

where the exponent l is a constant that normalizes the scores. Computer software for Box–Cox transforma-
tions attempts to find the optimal value of l for a particular distribution. There are other variations of the
Box–Cox transformation, some of which can be applied in regression analyses to deal with heteroscedas-
ticity (Osborne, 2013).

Pt1Kline5E.indd 61 3/22/2023 3:52:40 PM


62 Concepts, Standards, and Tools

Relative Variances fer by a factor of > 27,000, so the covariance matrix is


ill-scaled. I have seen older SEM computer tools fail to
In an ill-scaled covariance matrix, the ratio of the
analyze this matrix due to this characteristic. To pre-
largest to the smallest variance is greater than, say,
vent this problem, I multiplied the original variables by
100.0. Most estimation methods in SEM are iterative,
the constants listed in the table (including 1.0; i.e., no
which means that initial estimates are derived by the
change) in order to make their variances more uniform.
computer and then modified through subsequent cycles
Among the rescaled variances, the largest variance is
of calculation. The goal is to derive better estimates
only about 13 times greater than the smallest variance.
at each stage, estimates that progressively improve
The rescaled matrix is not ill-scaled.
the fit between model and data. When improvements
from step to step become sufficiently small (i.e., they
fall below the convergence criterion), iteration stops SUMMARY
because the solution is stable. But if the estimates do
not settle down to stable values, the process may fail. The 80/20 rule of data analysis is a variation on the
One cause is variances that are very different: When Pareto principle, named after the Italian economist Vil-
the computer adjusts the estimates from one step to the fredo Pareto, that 80% of a nation’s wealth is owned by
next in an iterative method, the sizes of these changes 20% of the people. In the context of data analysis, it
may be huge for variables with small variances but means that researchers should invest more time (i.e., at
trivial for variances with large variances. The whole least four times as much) screening and preparing their
process may head toward worse rather than better fit. data than actually conducting the substantive analy-
To prevent the problem just described, variables with ses. Data screening in SEM includes the evaluation of
extremely low or high variances can be rescaled by missing data patterns and extent; detection of outliers,
multiplying their scores by a constant, which changes extreme collinearity, and whether data matrices are ill-
the variance by a factor that equals the squared con- scaled; assessment of whether distributional assump-
stant. For example, tions for particular estimators are consistent with the
data; and complete reporting in written summaries of
s X2 = 12.0 and sY2 = .12 the results on how particular problems were dealt with
in the analysis. Making the data file available to other
so their variances differ by a factor of 100.0. Using the researchers is a strong statement of transparency.
constant .10, we can rescale X as follows:
2
s.10 X = .10 × 12.0 = .12
2 LEARN MORE

so now variables .10X and Y have the same variance, or Dong and Peng (2013) offer clear and accessible descrip-
.12. Next, we rescale Y so that it has the same variance tions of techniques for missing data, Manly and Wells (2015)
as X, or 12.0, by applying the constant 10.0, or describe best practices for reporting about the use of MI,
and van Ginkel et al. (2020) review common misunderstand-
2
s10.0Y = 102 × .12 = 12.0 ings about MI.

Dong, Y., & Peng, C.-Y. J. (2013). Principled missing data


Multiplying a variable by a constant is a linear transfor-
methods for researchers. SpringerPlus, 2(1), Article 222.
mation that changes its average and variance but not its
correlation with other variables. This is because linear Manly, C. A., & Wells, R. S. (2015). Reporting the use of
transformations do not alter relative distances between multiple imputation for missing data in higher educa-
scores. An example with real data follows. tion research. Research in Higher Education, 56(4),
Roth et al. (1989) administered measures of exercise, 397–409.
hardiness (resiliency, tough mindedness; referred to as van Ginkel, J. R., Linting, M., Rippe, R. C. A., & van der
“hardy” from this point), fitness, stress, and level of ill- Voort, A. (2020). Rebutting existing misconceptions
ness in a sample of 373 university students. Table 4.3 about multiple imputation as a method for handling
provides a summary matrix of these data. The largest missing data. Journal of Personality Assessment, 102(3).
and smallest variances in this matrix (see the table) dif- 297–308.

Pt1Kline5E.indd 62 3/22/2023 3:52:40 PM


Data Preparation 63

TABLE 4.3 Example of an Ill-Scaled Data Matrix


Variable 1 2 3 4 5
1. Exercise —
2. Hardy –.03 —
3. Fitness .39 .07 —
4. Stress –.05 –.23 –.13 —
5. Illness –.08 –.16 –.29 .34 —

Original M 40.90 0.00 67.10 4.80 716.70

Original s2 4,422.25 14.44 338.56 44.89 390,375.04

Constant 1.00 10.00 1.00 5.00 .10

Rescaled M 40.90 0.00 67.10 24.00 71.67

Rescaled s2 4,422.25 1,440.00 338.56 1,122.25 3,903.75

Rescaled SD 66.50 38.00 18.40 33.50 62.48

Note. These data (correlations, means, and variances) are from Roth et al. (1989); N = 373. Note that
low scores on the hardy measure used by these authors indicate greater hardiness. To avoid confusion
due to negative correlations, the signs of the correlations that involve the hardy measure were reversed
before they were recorded in this table.

EXERCISES

1. Reproduce the covariance matrix in the middle of Locate in your figure values for the extreme of the
Table 4.1 from the correlations and standard devia- lower whisker, lower hinge (H1), median, upper
tions at the bottom of the table. hinge (H2), extreme for the upper whisker, and any
outlier beyond 1.5 × (H2 – H1) from its respective
2. Given covXY = 13.00, s X = 12.00, and sY = 10.00,
2 2 hinge:
show that the corresponding correlation is out of
bounds. Score Frequency Score Frequency
10 5 15 5
11 15 16 4
3. Listed next as (X, W, Y) with missing observations 12 14 17 1
indicated by “—” are scores for 7 cases: 13 13 27 1
(42, 13, 8), (34, 12, 10), (22, —, 12), (—, 8, 14) 14 5
(24, 7, 16), (16, 10, —), (30, 10, —)
5. A researcher finds for a continuous variable that
Compute the covariance matrix for these data using
skewness is 1.97 and kurtosis is 6.90, and concludes
pairwise deletion. Show that the resulting covari-
that the distribution is normal. Comment.
ance matrix is NPD and also that the corresponding
correlation matrix has an out-of-bounds value.
6. Find a normalizing transformation for the data in
Exercise 4. Do not remove the outlier.
4. Generate a box plot for the N = 63 scores listed next.

Pt1Kline5E.indd 63 3/22/2023 3:52:40 PM


64 Concepts, Standards, and Tools

Appendix 4.A from among cases with observed scores similar to the
predicted score for an incomplete case. The number of
complete cases in the donor set can be specified by the
researcher with about 5–10 cases as a typical range. In
Steps of Multiple Imputation small samples, though, 10 cases in the donor set may
include too many dissimilar records, especially for few
predictors (Allison, 2015).
The three basic steps in MI—the imputation step, the
A method for multivariate data loss is fully condi-
analysis (estimation) step, and the pooling (combina- tional specification, also called multivariate imputa-
tion) step—and corresponding decisions required at tion by chained equations (MICE) and sequential
each point are summarized next. It is not possible in regression multivariate imputation. It generates a
this overview to give detailed descriptions about sta- series of conditional models, one for each incomplete
tistical options, especially in the imputation step, but variable in the data set, and the method requires speci-
see Allison (2012), Dong and Peng (2013), van Buuren fication of the imputation model for each incomplete
(2018), or van Ginkel et al. (2020) for more informa- variable. Because the method is based on the separate
tion. distributions of each incomplete variable, it can be
applied to continuous variables with nonnormal dis-
tributions or to variables that are not continuous, such
STEP 1. IMPUTATION as ordinal data. Initial imputed values are randomly
selected from the observed data, and these values are
The imputation step involves the analysis of the vari- subsequently improved for each incomplete variable
ables that make up the imputation model and also through iterative selection from conditional distribu-
the method by which random variation is modeled tions estimated by the observed and imputed scores.
and incorporated in generating imputed scores. That The whole process then cycles through all incomplete
method should match the level of measurement for variables. The method is flexible but can converge to
incomplete variables (e.g., continuous vs. categorical), incompatible conditional models depending on the
whether data loss is univariate or multivariate (occurs order of the univariate imputation steps (van Buuren,
on a single vs. multiple variables), and whether the pat- 2018).
tern is monotone or general. Monotone missing data Another multivariate method for arbitrary data loss
means that variables can be ordered and data loss at patterns is based on the Markov Chain Monte Carlo
a particular point means that all subsequent observa- (MCMC) approach that randomly samples from theo-
tions are missing. Dropout in a longitudinal study is an retical probability distributions, in this case from pre-
example of monotone data loss; any other pattern is a dictive distributions for the missing data, and these
general (nonmonotone) missing data. draws become the imputed scores. A Markov Chain is
An option for univariate data loss is the regres- a probability model in which the likelihood of an event
sion method, which is based on linear regression and depends only on the state of the previous event. That
assumes multivariate normal distributions for continu- is, the probability of a future event can be estimated
ous variables. It works by regressing an incomplete just as well from the current state of the event as when
variable on other variables with complete data. Next, knowing the full history of events. A “chain” thus con-
imputed regression coefficients are selected from the cerns multiple simulated random draws from the same
sampling distributions for the coefficients in the first distribution.
analysis, and imputed scores are generated from the In the application of the MCMC method in MI, it
values of the imputed coefficients and a random error is assumed that the underlying complete data follow a
term. These steps are repeated for k times, the num- multivariate normal distribution, and the computer sim-
ber of imputed data sets. A variation is the predic- ulates draws from such distributions over an iterative
tive mean matching method, which does not require sequence of paired steps called the I-step (imputation,
normality and imputes values “borrowed” from other but not meaning the first step in MI) and the P-step
records called donor cases. This method generates pre- (posterior). At the I-step, imputed scores are drawn for
dicted scores for all cases, including those with com- each incomplete case from the predictive distribution
plete data, and randomly selects as an imputed score in the current iteration. Next, in the P-step, the param-

Pt1Kline5E.indd 64 3/22/2023 3:52:40 PM


Data Preparation 65

eters of the predictive distribution are updated by draws cal infinite number of imputations (Patrician, 2002).
from the posterior distribution. The “chain” consists of Greater numbers of imputations, such as k = 100–200,
I-step/P-step pairs over iterations. Some implementa- are recommended in more recent works (Graham et al.,
tions allow for a burn-in period, or default number 2007; Little, 2013) based on statistical power and the
of iterations (e.g., 200) before the first set of imputed fraction of missing information (FMI), which is not
values is drawn. One rationale for this option is to dis- the rate of missing data. The FMI is instead the propor-
sipate the effects of the distribution from the prior itera- tion of variation in parameter estimates due to nonre-
tion that may differ appreciably from those of the target sponse. That is, it quantifies the amount of parameter
distribution, but burn-in is not an inherent feature of information lost to nonresponse (Lang & Little, 2018),
the MCMC method, and there are other methods to so in this way the FMI is analogous to an R2 statistic for
find good starting points for drawing imputed scores missing data (Enders, 2010). For example, if FMI = .03
(Geyer, 2011). for a particular parameter, then the loss of efficiency
The MCMC method is generally robust against mul- due to missing data is 3%; that is, the estimate based
tivariate nonnormality in large samples (Demirtas et on incomplete data is 97% as efficient compared to
al., 2008), but its use can be problematic with categori- what it would have been with no missing data (Sava-
cal variables. For example, the practice of imputing lei & Rhemtulla, 2012). Wagner (2010) suggested that
values based on normal distributions and then rounding researchers should consistently report the value of the
to the nearest integer or to the nearest plausible value FMI when analyzing incomplete data sets.
can yield very biased results for categorical variables The FMI is a function of k and the ratio of the
(Allison, 2012). The chained equations method (MICE between-imputation variance over the total variance,
algorithm) described earlier is an alternative. Another or the sum of between-imputation variance and the
possibility is to use a method for the imputation step within-imputation variance. The within-imputation
based on logistic, loglinear, or other statistical model variance is the average error variance associated with
for categorical variables—see Audigier et al. (2017) for parameter estimates within all imputed data sets. Con-
more information and examples. ceptually, it estimates what the error variance would be
Yet another multivariate method for arbitrary miss- if there were no missing scores. The between-imputa-
ing data patterns is the expectation–maximization tion variance is the variation in estimates over the k
(EM) algorithm, which is a general purpose itera- imputations. As more information is lost due to missing
tive procedure to find ML estimates for parameters data, the between variance will increase, but when the
that involve latent variables. In the context of MI, the data set includes good predictors of the variables with
method alternates between the E-step (expectation), missing data, both the between variance and the FMI
in which missing scores are imputed from predicted are expected to decrease (Wagner, 2010). In general, the
distributions for the missing data, and the M-step greater the covariances among the observed variables,
(maximization), where parameters for the distribution the lower the value of the FMI (Little, 2013). Dong and
of both the missing and observed data, such as means Peng (2013, p. 5) described equations for computing the
and covariances, are estimated using ML before the FMI, and some computer procedures for MI print the
cycle repeats over the E- and M-steps until there are k value of this statistic in the output.
imputed scores for each missing observation. Because
the EM algorithm is a general method, it can be used
outside of MI to directly estimate parameters for latent STEP 2. ANALYSIS
variables without imputing scores for individual cases
(Dempster et al., 1977). Estimation of standard errors After k complete data sets are generated in the impu-
may be more accurate in the FIML method for incom- tation step, next comes the analysis step, which con-
plete data sets that is described in Chapter 9 (Dong & cerns the analysis model. Parameters are estimated for
Peng, 2013). effects of substantive interest in each of the k imputed
Besides the algorithm, the researcher must also data sets. Ideally, the variables in the imputation model
specify the number of scores to be imputed for each would be the same as those in the analysis model except
missing observation, or k. Suggestions in older works for auxiliary variables (if any). But as the imputation
were roughly k = 3–10 based on the relative efficiency and analysis models are based on increasingly smaller
of imputing these numbers of times against a theoreti- sets of overlapping variables, then results from the

Pt1Kline5E.indd 65 3/22/2023 3:52:40 PM


66 Concepts, Standards, and Tools

imputation step may not be very meaningful in the same model and incomplete data file can and will
analysis step. change each time the procedure is repeated unless a
The imputation model should also be general enough random seed that determines the starting point for
in terms of assumptions about distributions or func- generating random numbers is specified.
tional relations to reflect the data in the analysis model. 2. Indeterminacy also applies to the data: The research-
Suppose that normality is assumed for a variable in the er’s model is analyzed in k imputed data sets, so
imputation model, but the actual distribution is severely there is no definitive summary data matrix. Thus,
nonnormal. Results in the analysis model based on the results could not be reproduced in a secondary
means and covariances, such as regression coefficients, analysis conducted with a single data matrix (all k
may not be grossly inaccurate, but estimates of p val- matrices would be required). But the researcher can
ues or bounds for confidence interval could be severely still make available the original raw data file.
distorted (Schafer, 1999). Enders (2010) offers sugges-
3. There is little doubt that MI generates better esti-
tions for representing interactive effects of continuous
mates than classical techniques in reasonably large
or categorical variables in both models.
samples when the data loss pattern is MAR instead
of MCAR (Schafer & Graham, 2002): It might also
reduce bias compared with classical techniques
STEP 3. COMBINATION
when data loss patterns are both MAR and MNAR,
if variables can be added that measure the miss-
Results from the analysis step from each of the k
ing data mechanism, but there is no guarantee (van
imputed data sets are synthesized in the combination
Ginkel et al., 2020).
(pooling) step. The final parameter estimate is the
average over the k results for that parameter, and its
There are special statistical techniques for analyzing
standard error is estimated based on both the within-
MNAR data (Rubin & Little, 2020), but they are not yet
and between-imputation variances. You should also be
widely used and are more difficult to apply than meth-
aware of the issues listed next:
ods that assume MAR. Methods for MNAR data loss
mechanisms are becoming available in SEM computer
1. There is no unique set of results in MI: This is
tools such as Mplus. Galimard et al. (2016) described
because imputed scores are generated through
adaptations of MI for MNAR mechanisms.
simulated random sampling. Thus, results for the

Pt1Kline5E.indd 66 3/22/2023 3:52:40 PM


5

Computer Tools

Two categories of computer tools for traditional SEM are described in this chapter: freely available software
and commercial software. Free software includes packages for conducting SEM analyses in the R comput-
ing environment (e.g., lavaan, semTools) and stand-alone computer tools with a graphical user interface
(GUI) that require no larger software environment (e.g., JASP, Wnyx). Free computer tools have become
increasingly capable to the point where they can replace commercial tools for basically all but the very most
advanced kinds of analyses. Modern SEM computer programs—both free and commercial—are generally
easier to use than their predecessors. But greater user-friendliness of contemporary SEM computer tools
should not lull the researcher into thinking that SEM is easy or requires minimal conceptual understanding.
Features of computer programs can change quickly with new versions, so check the sources listed next for
the most up-to-date descriptions. Computer tools for nonparametric SEM are described in Chapter 6, and
Chapter 16 covers software options for composite SEM. In a work describing the ideas of the Canadian
communication theorist Herbert Marshall McLuhan, Culkin (1967, p. 70) wrote, “We shape our tools and
thereafter they shape us.” I hope that computer use sharpens, rather than dulls, your ability to think critically
about SEM.

EASE OF USE, NOT SUSPENSION by drawing it in onscreen using geometric symbols


OF JUDGMENT such as boxes, circles, and lines with arrowheads on
one or both ends. Next, the program automatically
The first widely available SEM computer tool was translates the model graphic into lines of code, which
LISREL III (Jöreskog & Sörbom, 1976). At the time, are then executed. Thus, (1) the user need not know
LISREL and related applications were not easy to use very much (if anything) about how to write syntax in
because they required the generation of rather arcane order to run an SEM analysis, and (2) the role for tech-
code, and were available only on mainframe comput- nical programming skills is reduced. For researchers
ers with stark command-line user interfaces. The abun- who understand the basic concepts of SEM, this devel-
dance of relatively inexpensive, yet capable, personal opment can only be a plus—anything the reduces the
computers greatly changed things. Statistical software drudgery and gets one to the results quicker is a benefit.
with a GUI is generally easier to use than their char- But there are potential drawbacks to push-button
acter-based counterparts. Indeed, user-friendliness in modeling. For example, no- or low-effort program-
modern SEM computer tools is a near-revolution com- ming could encourage the use of SEM in uninformed
pared with older programs. or careless ways. This is why it is more important than
Most SEM computer tools still permit the user to ever to be familiar with the conceptual and statistical
write code in that application’s native syntax. Some bases of SEM. Computer programs, however easy to
programs offer the alternative of specifying the model use, should only be the tools of your knowledge and

67

Pt1Kline5E.indd 67 3/22/2023 3:52:40 PM


68 Concepts, Standards, and Tools

not its master. Steiger (2001) noted that emphasis on (Arbuckle, 2021), although Amos does offer a separate
ease of use of statistical computer tools can give begin- syntax editor. Although drawing editors are popular
ners the impression that SEM is easy, but the reality is with beginners, there are potential drawbacks—see
that things can and do go wrong in SEM. Beginners Topic Box 5.1.
often quickly realize that analyses fail because of tech-
nical problems, including a terminated program run
with cryptic error messages or uninterpretable output. TIPS FOR SEM PROGRAMMING
These things happen because (1) actual research prob-
lems can be quite technical, and the availability of user- Listed next are suggestions for using SEM computer
friendly software does not change this fact. Also, (2) tools; see also Little (2013, pp. 25–27):
computer tools are not perfect and, thus, are incapable
of detecting or preventing all failure conditions. That 1. Learn from the examples of others. Annotated syn-
is the reason why this book places so much emphasis tax, data, and output files for all detailed analyses
on conceptual knowledge instead of teaching you how can be downloaded from the book’s website. Read-
to use a particular computer tool: In order to deal with ers can open these files on their own computers
problems in the analysis when—not if—they occur, without installing any special software. There are
you must understand what went wrong and why. additional online resources with syntax examples
for lavaan,1 Mplus,2 LISREL,3 and other SEM
computer tools.
HUMAN–COMPUTER INTERACTION 2. Annotate your syntax files. Comments are usu-
ally designated by special symbols, such as *, #,
There are three basic ways to interact with SEM com-
or !, that are ignored by the computer. Use com-
puter tools:
ments to describe the specified model, data, and
requested output. Explain the creation of any new
1. Batch processing, where the user writes syntax
variable. Such information is useful for colleagues
that specifies the model, data, analysis, and output.
or students who did not conduct the analysis, but
Next, the syntax is executed through some form of
who need to understand it. Annotation also helps
a “run” command.
researchers to know just what they did in a particu-
2. Drawing editor, where the user draws the model on lar analysis days, weeks, months, or even years ago.
screen. When the diagram is finished, the analysis Without sufficient comments, one quickly forgets.
is run in the GUI.
3. Keep it simple. Sometimes beginners try to analyze
3. Templates or menus, where the model and analy- models that are too complicated, and such analyses
sis are specified as the user clicks with the mouse are more likely to fail. With more syntax or screen
on interface elements such as text fields, pull-down space in a drawing editor for a complex model,
menus, or radio buttons. there are also more opportunities for making mis-
takes. It can also be hard to tell whether a very com-
Batch mode is for users who know the syntax for a plex model is identified. If the researcher does not
particular SEM computer tool. In contrast, knowledge know that their model is not really identified, the
of syntax is generally unnecessary when using a draw- failure of the analysis may be wrongly attributed to
ing editor or templates. But even here some knowledge a syntax error or problem with the data.
of syntax can help. In LISREL, for example, both its
drawing editor and template-based mode of interac- 4. Build it up. Start instead with a simpler model that
tion automatically write syntax in a window that must you know is identified. Try to get the analysis of
be run by the user. That syntax can be edited, and the initial model to successfully run. Then build up
sometimes a problem in model specification is appar- 1 https://www.lavaan.ugent.be/tutorial/index.html
ent in the syntax, which the user can correct, if they
2 https://www.statmodel.com/ugexcerpts.shtml
understand the syntax. In contrast, the drawing edi-
tor in IBM SPSS Amos analyzes the model and data 3 https://ssicentral.com/index.php/products/lisrel/lisrel-

without generating editable (or even viewable) syntax examples/

Pt1Kline5E.indd 68 3/22/2023 3:52:40 PM


Computer Tools 69

TOPIC BOX 5.1

Graphical Isn’t Always Better


The potential drawbacks of graphical editors in SEM computer tools are outlined next. They explain why
some researchers switch from a drawing editor when first learning about SEM to working in batch mode
as they gain experience:

1. It can be tedious to draw onscreen a complex model with many variables, such as numerous
repeated measures variables in a longitudinal design or dozens of items on a questionnaire in a
factor analysis. This is because the screen quickly fills up with graphical elements. The resulting
visual clutter can make it difficult to keep track of what you are doing.
2. Specifying analyses where models are simultaneously fitted to data from two or more samples can
be difficult. This is because it may be necessary to look through different screens or windows in
order to find information about data or model specification for each sample.
3. It is easier to annotate the analysis by putting comments in a syntax file compared with working
in a graphical editor, which may not support user-­supplied comments. It is so easy to lose track of
what you have done in an analysis without detailed comments. Thus, using a drawing editor that
does not allow annotations can engender carelessness in record keeping (Little, 2013).
4. Debugging a syntax file after a failed analysis is generally simpler than doing so in a graphical
editor, which may require clicking on multiple graphical elements to inspect specifications in sepa-
rate dialogs boxes or windows. In contrast, everything that specifies the analysis (i.e., commands)
can be viewed in a single place when editing a syntax file.
5. The format of files generated in graphical editors is typically proprietary, which means that they
generally cannot be opened or edited with a different computer tool. In contrast, syntax files are
usually text (ASCII) files that can be opened or edited in any basic text editor, including word
processors. Thus, sharing an analysis with other researchers is generally easier with syntax files.
Including syntax files in supplemental materials for journal articles supports both transparency in
reporting and accessibility for readers.
6. Certain kinds of advanced analyses or options may be available in syntax only. One reason is
that although model diagrams for more basic SEM analyses are more or less standard, this is
less true for more advanced applications such as multilevel analyses, for which there is no single
graphical standard (Curran & Bauer, 2007).
7. There is evidence that graphical user interfaces can help novices who are not trained in informa-
tion technology or in computer science to learn and perform basic tasks faster and in fewer steps.
But graphical interfaces can also impede or slow down expert users compared with text-based
interfaces, and just because an interface is graphical does not guarantee ease of use or reduced
cognitive load (Chen & Zhang, 2007).
8. It seems that it would be easy to generate a publication-­quality diagram in an SEM drawing edi-
tor, but this is not exactly true. Drawing editors may use a fixed symbol set that does not include
special symbols that you want to appear in the diagram. There may be limited options for adjust-
ing the appearance of this diagram (e.g., changing font or line widths). Graphs generated by
drawing editors may be rendered in relatively low resolution that is fine for the computer monitor
but not for display in a printed (paper or virtual) document. There are R packages that can gener-
(continued)

Pt1Kline5E.indd 69 3/22/2023 3:52:41 PM


70 Concepts, Standards, and Tools

ate high-­quality model diagrams, including semPlot (Epskamp, 2022), but they can be frustrating
to use: The diagram is specified in syntax, but the diagram generated by executing that syntax
cannot be directly edited, so if something is wrong, then it is back to the syntax (repeat).

Another option is a professional diagramming and graphics computer tool, such as Microsoft Visio,
that features templates and tools for drawing predefined shapes, such as circles, squares, or lines with
optional arrowheads on one or both ends. A drawback is that professional graphics software can be
relatively expensive, although there are free alternatives like LibreOffice Draw with generally more modest
capabilities. But using comprehensive drawing software to create diagrams of structural equation models
can seem like overkill. This is because only a small fraction of the program functionalities is used to create
model diagrams, which are composed of just a few kinds of graphical elements.
Here is a trade secret I’ll share with you: All model diagrams in this book were created using nothing
other than Microsoft Word Shapes (rectangles, ovals, text boxes, etc.) that are grouped together. Maybe
I am biased, but I think these diagrams are not too shoddy. Sometimes you can do a lot with a simple but
flexible tool. Yes, it takes a lot of time to make a publication-­quality model diagram in Word—or in any
graphical computer tool—but once you make a few examples, you can reuse graphical elements, such as
those for common factors and indicators, in future diagrams. Mai et al. (2022) described semdiag, a free,
open source web application for drawing model diagrams in SEM. An online version of semdiag is avail-
able at https://semdiag.psychstat.org/ There is also a free, open-­source diagramming application that can
be used online through a web browser or as a downloaded application for Windows, Apple (macOS),
Linux, or Google Chrome OS platform computers available at https://www.diagrams.net/

the model by adding parameters that reflect your COMMERCIAL VERSUS FREE
hypotheses until the target model is eventually COMPUTER TOOLS
specified. If the analysis fails after adding a par-
ticular parameter, the reason may be identification. I am often asked, what is the best software package
5. Comment out instead of delete. In analyses where for SEM? My answer has two parts: (1) There is no
models are progressively simplified (trimmed) single “best” package because there are now many
instead of built, the researcher can comment out capable options available. (2) What is available to you
part of the syntax, or deactivate it, by designating at the lowest cost? For example, many universities have
site licenses that permit researchers and students to
those commands as comments in the next analysis.
use commercial SEM computer tools free of charge.
This method preserves the original code as a record
Advantages of commercial SEM programs include reg-
of the changes.
ular updates, user support including a single point of
6. Lend the computer a hand. Sometimes iterative contact for problems, and complete program manuals
estimation fails because the computer needs better and documentation, sometimes with numerous analysis
starting values, or initial estimates of model param- examples and data sets.
eters, to find a converged solution. Most SEM com- Commercial applications are less advantageous for
puter tools generate their own starting values, but users without institutional or grant funds for software.
computer-derived values do not always lead to con- As with other specialized software for multivariate sta-
verged solutions. Fortunately, SEM computer tools tistical analyses, costs to purchase or license SEM com-
generally allow users to specify starting values that puter tools can be rather expensive and can range from
override computer defaults. Suggestions for speci- several hundred to over $1,000 a year or more, although
fying starting values for different types of models discounts may be offered for academic users. There are
are offered later in the book. also researchers who simply prefer using free or open-

Pt1Kline5E.indd 70 3/22/2023 3:52:41 PM


Computer Tools 71

source software. Open source means that software umentation. Although there are articles, chapters, or
development is decentralized by making the source even entire books written for applied researchers who
code available so that individual users can modify the use lavaan for SEM, standard R package documenta-
software and redistribute or publish their version back tion is written in a technical way that assumes knowl-
to the community. Most, but not all, open-source soft- edge of object-oriented programming. Specifically,
ware is free, and just because a computer tool is free the typical “manual” for an R package consists of an
does not mean that it is also open source. alphabetical listing of functions with terse references to
Among researchers in the behavioral sciences, the objects, methods, and data types. Such documents can
best-known computing environment for data manage- be pretty cryptic for researchers without strong pro-
ment, statistical analysis, and graphics that is both free gramming skills. It is also true that numerical output
and open source is R (R Core Team, 2022). A basic in R is not always “pretty,” that is, the output is format-
installation of R has many of the same capabilities for ted in ways that are not easy to read at first glance. An
statistical analysis as commercial products such as example is scientific notation in which very small or
SPSS or SAS/STAT, and thousands of free packages very large numbers are represented in a simpler form
extend its range even further. This includes several (e.g., 1.0E-4 for the number .0001). For researchers
packages for SEM, some of which can analyze a wide unwilling to deal with these challenges, a commercial
range of structural equation models with capabilities tool may be a better option, if cost is no issue.
similar to those of commercial software. This is espe-
cially true for lavaan, described momentarily. Its
extensive capabilities are why I used lavaan in most R PACKAGES FOR SEM
detailed analysis examples described in this book.
There are potential obstacles to using the noncom- Listed in Table 5.1 are major R packages for SEM anal-
mercial software programs just mentioned. One is doc- yses with citations and descriptions. All packages in

TABLE 5.1. Major R Packages for SEM


Name Citation Description
General model fitting
sem Fox et al. (2022) Basic but broad capabilities for analyzing structural
equation models

lavaan Rosseel et al. LAtent VAriable ANalysis, estimates a wide range of


(2023) models with capabilities that rival commercial tools

OpenMx Boker et al. Matrix processor and numerical optimizer that can be
(2022) used with multicore computers or networked computers

piecewiseSEM Lefcheck (2020) Single-equation estimation of path models with global fits
test based on predicted conditional independencies

Utilities and tools


semTools Jorgensen et al. Extends lavaan capabilities, includes utilities for power
(2022) analysis and other special types of analyses

systemfit Henningsen & Estimates simultaneous linear equations with ordinary


Hamann (2022) least squares and instrumental variable methods

bmem Zhang & Wang Estimators and bootstrapped confidence intervals for
(2022) indirect effects with incomplete data

Pt1Kline5E.indd 71 3/22/2023 3:52:41 PM


72 Concepts, Standards, and Tools

the table except OpenMx were used in analyses for this The eponymously named package piecewiseSEM
book. The sem package was probably the very first for (Table 5.1) supports the method of piecewise SEM,
analyzing a wide range of structural equation models, described in Chapter 8. Briefly, the method features
but there are no plans for major development beyond (1) single-equation (local) estimation of observed vari-
its current form.4 It is still quite usable, though, and able path models and (2) a global test of model–data
it offers a range of estimators, including instrumental correspondence (i.e., goodness-of-fit) based on Pearl’s
variable methods. The sem package can also be inte- (2009) approach to nonparametric structural equation
grated with external R packages for multiple imputa- modeling covered in the next chapter. Local estimation
tion (MI), bootstrapping, and estimating polychoric means that the equation for each outcome variable is
correlations, which further extend its capabilities (Fox separately analyzed instead of the computer attempt-
et al., 2022). ing to simultaneously estimate all model parameters,
The lavaan package for SEM (see Table 5.1) has or global estimation. Piecewise SEM is best known in
extensive analysis capabilities and has been updated biology and ecology, but it offers potential benefits for
several times since it was first released in 2011. The researchers in the behavioral sciences, too.
lavaan package has estimators for continuous, binary, Listed in the lower part of Table 5.1 are “toolbox”
and ordinal data, and there are capabilities for item R packages for special kinds of analyses in SEM and
response theory (IRT), latent class, mixture model- other statistical techniques. The semTools package
ing, and multilevel analyses. Modern options for han- extends the capabilities of lavaan to exploratory fac-
dling missing data include full information maximum tor analysis (EFA) or estimation of interactive effects of
likelihood (FIML) and MI, but use of an external R latent variables. The same package can also estimate
power, conduct Monte Carlo simulations, correct sig-
package, such as semTools, is required for the latter.
nificance test results for small sample sizes, estimate
Resources for using lavaan include books (Beaujean,
the reliability of factor measurement, and combine
2014; Gana & Broc, 2019), journal articles (Andersen,
results from MI. It has a special function that uses the
2022; Svetina et al., 2020), a website,5 and a discussion
Kaiser and Dickman (1962) method to generate a raw
group.6
data file of a specified size from an input covariance
The OpenMx package is a powerful matrix proces-
matrix, where all descriptive statistics of generated
sor and numerical optimization tool with capabilities raw scores exactly match those of the specified cova-
for analyzing structural equation models, multilevel riance matrix (i.e., the scores are generated with no
models, mixture models, and models of genetic relat- added sampling error). This capability is handy when
edness—see Table 5.1. Models can be interactively the researcher has a summary covariance matrix but
built one step at a time, which allows code debugging no raw scores, such as from a journal article, and the
in smaller steps. The package supports multicore com- researcher wants to conduct a secondary analysis with a
puters, where multiple central processing units in the computer tool that requires raw scores. A GitHub wiki
same computer operate in parallel on an analysis, and for semTools is available.9
distributed processing over separate computers work- The R package systemfit (Table 5.1) estimates sys-
ing in parallel clusters. These capabilities support the tems of linear equations for observed variables. It has
analysis of extremely large data sets. Missing data are capabilities for OLS and instrumental variable methods.
handled through the FIML method. There is an online Versions of the latter method can be used for an analysis
user guide,7 and the OpenMx community wiki offers called seemingly unrelated regressions (SUR), which
tutorials, example models, and forums.8 The syntax is is actually a misnomer because the regressions for sep-
relatively complex but highly flexible once mastered— arate outcome variables are correlated due to overlap-
see Neale et al. (2016) for examples. ping (correlated) error terms. In contrast, error terms
for multiple outcomes in standard OLS regression are
4 J. Fox (personal communication, November 20, 2020). assumed to be independent. There are also functions in
5 https://www.lavaan.ugent.be/ systemfit that evaluate the overall fit of regression mod-
6 https://groups.google.com/g/lavaan els to the data and conduct diagnostic tests for instru-
7 https://vipbg.vcu.edu/vipbg/OpenMx2/docs//OpenMx/latest/
mental variable methods. The bmem package supports
8 https://openmx.ssri.psu.edu/wiki/main-page 9 https://github.com/simsem/semTools/wiki

Pt1Kline5E.indd 72 3/22/2023 3:52:41 PM


Computer Tools 73

the generation of confidence intervals for estimates of The JASP program is an open-source, integrated
indirect causal effects in mediation analyses when there application for both traditional (frequentist) and Bayes-
are missing data (see the table). It also has capabilities ian statistical analyses (JASP Team, 2022).11 There
for estimating statistical power for tests of indirect are versions for Windows, MacIntosh, and Linux plat-
effects in a variety of mediational analyses, including form computers.12 It features a GUI that is intuitive
ones where the intervening variable is modeled as a even for relatively inexperienced users of statistical
latent variable. computer tools. Statistical capabilities include uni-
variate and multivariate analysis of variance (ANOVA,
MANOVA), regression analysis (linear, linear mixed,
FREE SEM SOFTWARE logistic) and EFA, among others. There is a special
WITH GRAPHICAL USER INTERFACES module for SEM, for which users enter lavaan syn-
tax to specify the model in a special text window, but
The :nyx (pronounced “onyx”) program for SEM runs output options are controlled in the JASP user inter-
under the Java Runtime Environment (version 1.6 or face. The SEM module also offers separate options for
later) on Windows, Apple (macOS), or Unix/Linux plat- analyzing mediation models and latent growth models.
form computers (von Oertzen et al., 2015). It is a graph-
ical computing environment for creating and analyzing
structural equation models that can be freely down- COMMERCIAL SEM COMPUTER TOOLS
loaded.10 There is no native programming language
in :nyx. Instead, the user draws the model onscreen. Listed in the top part of Table 5.2 are stand-alone com-
After associating a data file with the diagram, estima- mercial programs for SEM that do not require a larger
tion of model parameters automatically begins. Miss- computing environment. All four of these computer
ing data are handled by the FIML estimator. There is a tools—EQS, IBM SPSS Amos (hereafter just “Amos”),
post-analysis option to synthesize raw data files where LISREL, and Mplus—allow the user to work in batch
cases are selected with sampling error from a hypo- mode (syntax) or specify the model through a draw-
thetical population where model parameters equal the ing editor or templates. The drawing editors in EQS,
sample estimates. The program can also automatically
11 The
generate syntax for Mplus, sem, lavaan, or OpenMx acronym stands for Jeffreys’s Amazing Statistics Program,
that specifies the model represented in the diagram. named after the British mathematician Harold Jeffreys for his
work on Bayesian probability theory.
10 http://onyx.brandmaier.de/ 12 https://jasp-stats.org/

TABLE 5.2. Major Commercial Software for SEM


Software Environment needed Batch (syntax) Drawing editor Template or menu
Stand-alone programs
EQS — 9 9 9
IBM SPSS Amos — 9 9 9
LISREL — 9 9 9
Mplus — 9 9 9

Procedures or commands in larger environments


Builder, sem, gsem Stata 9 9 9
CALIS SAS/STAT 9
SEPATH STATISTICA 9 9
CFA SYSTAT 9
RAMONA SYSTAT 9
74 Concepts, Standards, and Tools

LISREL, and Mplus automatically write program syn- Amos can also analyze mixture models with latent cat-
tax in a separate window that can be edited, run, or egorical variables either with training data, where some
saved. The EQS and LISREL programs can be used in cases are already classified but not the rest, or without
all stages of the analysis from data entry and screening training data. Books by Blunch (2013), Byrne (2016),
to exploratory analyses to SEM. The Amos and Mplus and Collier (2020) support Amos users.
programs have somewhat more limited capabilities for The LISREL (Linear Structural Relations) program
manipulating raw data files, so users of these applica- is the forerunner to all SEM computer programs.15
tions may elect to prepare their data using other com- Available for Windows platform computers, LISREL
puter tools for general statistical analyses. is actually a suite of applications that includes PRE-
There are versions of EQS (Equations) for Windows, LIS (“pre LISREL”), which generates and prepares raw
Apple (Mac/Unix), and Linux platform computers data files for analysis. It also has capabilities for diag-
(Bentler & Wu, 2020).13 Its syntax is straightforward nosing missing data patterns, MI, bootstrapping, and
and based on the Bentler–Weeks representational Monte Carlo simulation (Jöreskog & Sörbom, 2021).
system, in which model parameters are regression The FIML method for missing data is also available in
coefficients for effects on dependent variables and the LISREL. There are additional bundled applications for
variances and covariances of independent variables analyzing multilevel models and for fitting generalized
when means are not analyzed. All types of models linear models to data from complex survey designs.
are thus set up in a consistent way. Special features of There are two LISREL programming languages, its
EQS include the availability of estimators for ellipti- original syntax based on matrix algebra and SIMPLIS
cal distributions with varying degrees of kurtosis, but (“simple LISREL”), which is not based on matrix alge-
not skew. Options for missing data include FIML for bra nor does it require familiarity with the classic syn-
either normal or nonnormal data or a method based on tax except to specify certain output options. The clas-
the expectation–maximization (EM) algorithm. Other sic syntax is not easy to use until one has memorized
features include bootstrapping, the ability to correctly the whole system, but it is efficient: One can specify
analyze a correlation matrix with no standard devia- a complex model with relatively few lines of code.
tions, EFA capabilities with parallel analysis and bifac- Some advanced capabilities, such as nonlinear param-
tor rotation, and special syntax for multilevel analyses. eter constraints, are unavailable when using the SIM-
Books by Blunch (2016) and Dunn et al. (2020) are for PLIS language. Recent books about LISREL include
EQS users. Jöreskog et al. (2016) and Viera (2011).
The Amos (Analysis of Moment Structures) pro- The Mplus program for Windows, Apple (macOS),
gram is for Windows platform computers and does not and Linux platform computers is divided into a base
require the IBM SPSS environment to run (Arbuckle, program and three optional add-on modules (Muthén
2021).14 It has two main parts, Amos Graphics, in which & Muthén, 1998–2017).16 The Base Program for SEM
users control the analysis by drawing the model on the can analyze models with outcomes that are any combi-
screen, and Amos Basic, its syntax editor that works in nation of dichotomous, nominal, ordinal, censored, or
batch mode. Amos Basic is also a language interpreter count variables. It can also analyze discrete- and con-
and debugger for Microsoft Visual Studio VB.NET tinuous-time survival models. There are capabilities for
or C# (“C-sharp”). Users with programming experi- conducting exploratory factor analysis, bootstrapping,
ence can write VB.NET or C# scripts that modify the Monte Carlo simulation, Bayesian estimation, and IRT
functionality of Amos Graphics. Other utilities include analyses. Special syntax supports the specification of
a file manager, a random seed manager for bootstrap- sampling weights—also called survey weights—that
ping, and viewers for data and output files. The FIML correct for systematic differences in probability sam-
method is used for incomplete data files. Amos has pling between target and sample proportions of cases
additional capabilities for Bayesian estimation, includ- with specific demographic or other characteristics.
ing the generation of graphical posterior distributions. There is also special syntax for specifying latent growth
models and the analysis of indirect casual effects. Both
13 https://mvsoft.com/
15 https://ssicentral.com/
14 https://www.ibm.com/products/structural-equation-

modeling-sem 16 https://www.statmodel.com/

Pt1Kline5E.indd 74 3/22/2023 3:52:41 PM


Computer Tools 75

MI and FIML methods are available for handling miss- related to EQS’s original language, among others. The
ing data under the assumption of MAR. Special meth- missing data method in CALIS is FIML, but MI is also
ods in Mplus for data missing not at random (MNAR) available through the larger SAS/STAT environment.
are described in Chapter 9. The diagram for the analyzed model can be drawn
The Multilevel Add-On estimates multilevel ver- onscreen, but the researcher must specify the diagram
sions of the kinds of models analyzed in the Base Pro- in syntax. O’Rourke and Hatcher (2013) describe exam-
gram, and the Mixture Model Add-On estimates mix- ples of SEM analyses in SAS/STAT.
ture model versions, where the data are assumed to be J. Steiger’s SEPATH (Structural Equation model-
sampled from a mix of subpopulations that correspond ing and Path Analysis) is the SEM module in Statis-
to levels of a latent categorical variable. The Combina- tica, an integrated environment for data visualization,
tion Add-On contains all the features of the multilevel simulation, and statistical analysis.19 There is a desk-
and mixture model analyses just mentioned. Of all SEM top version for Windows platform computers (TIBCO
computer tools, Mplus can analyze perhaps the widest Statistica, 2022) and an enterprise version with compa-
range of statistical models, and some of the very newest nywide server support. Models are specified using the
analysis capabilities can show up first in Mplus. Recent PATH1 programming language based on a represen-
books about Mplus include Finch and Bolin (2017), tational system for SEM by McArdle and McDonald
Geiser (2021), Heck and Thomas (2015), and Wickrama (1984) introduced in Chapter 7. There are also tem-
et al. (2022). See Geiser (2023) for examples of SEM plate-based options that are preprogrammed sequences
analyses conducted with both Mplus and lavaan. of graphical dialog boxes for specifying common types
Listed in the bottom part of Table 5.2 are SEM pro-
of structural equation models—see Table 5.2. Special
cedures or functions within larger software environ-
features include the capabilities to correctly analyze a
ments. Builder is the drawing editor for SEM in Stata,
correlation matrix with no standard deviations, gener-
and the commands sem and gsem are for specifying
ate simulated random samples for Monte Carlo studies,
models in syntax (StataCorp, 1985–2021).17 The sem
and precisely control parameter estimation. A separate
command analyzes models with continuous outcomes,
power analysis module (also by J. Steiger) estimates the
and the gsem (generalized SEM) command analyzes
power of various significance tests in SEM.
outcomes that are continuous, dichotomous, categorical
(ordered or unordered), count, or censored variables. There are two SEM procedures in SYSTAT for
The gsem command also has capabilities for multilevel Windows platform computers (Systat Software Inc.,
modeling in an SEM framework and for analyzing 2018).20 The user interacts with RAMONA (Reticular
models based on IRT. Stata automatically generates for Action Model or Near Approximation) by submitting
Builder diagrams the corresponding syntax, which can batch files in the general SYSTAT environment. Syn-
be edited and saved as a text file. Special symbols in tax for RAMONA is relatively straightforward and
Builder designate the underlying distribution (Gauss- involves only two parameter matrices, one for covari-
ian, Poisson, etc.) for observed variables and the cor- ances between independent variables and the other for
responding link function (logit, probit, etc.). Missing direct effects on dependent variables. A second pro-
data are handled by FIML. The book by Acock (2013) cedure is CFA, where the user specifies measurement
covers SEM in Stata. models and analysis options through graphical dialogs
The procedure CALIS (Covariance Analysis of Lin- or templates (Table 5.2). A special feature of both pro-
ear Structural Equations) in SAS/STAT for Windows cedures just described includes the ability to correctly
and Unix platform computers (SAS Institute, 2022) fit a model to a correlation matrix only. There is a
is for SEM.18 It works in batch mode, and users can “Restart” command that automatically takes parameter
specify their models using one of seven different pro- estimates from a prior analysis as starting values in a
gramming languages, including LISMOD, a matrix- new analysis. This capability is convenient when evalu-
based syntax that corresponds to LISREL’s original ating whether a complex model is identified. There
language, and LINEQS, an equation-based syntax are relatively few other advanced features for SEM in
17 https://www.stata.com/ 19 https://www.statistica.com/

18 https://www.sas.com/ 20 https://systatsoftware.com/

Pt1Kline5E.indd 75 3/22/2023 3:52:41 PM


76 Concepts, Standards, and Tools

SYSTAT including the capability to simultaneously fit 3. The Matlab program is a computing environment
a model to data from multiple groups. and programming language for data analysis, visu-
alization, and simulation (Mathworks, 2022).23
Williams (2021) describes the Toolbox for SEM, a
SEM RESOURCES FOR OTHER set of functions for estimating structural equation
COMPUTING ENVIRONMENTS models with continuous outcomes.

Resources for conducting SEM analyses in more spe-


cialized computing environments are listed next: SUMMARY

1. The freely available semopy (Structural Equation Modern SEM computer tools are generally no more
Models Optimization in Python) package for the difficult to use than other computer programs for mul-
object-oriented programming language Python tivariate statistical analyses. The capability to specify
relies on lavaan-like syntax, features relatively a model by drawing it onscreen helps beginners to be
fast processing times, and has capabilities for impu- productive right away, but with experience they may
tation of missing values, analysis of ordinal out- find that specifying models in syntax is actually more
comes, and estimation of random coefficients mod- straightforward. Problems can be expected in the anal-
els (Igolkina & Meshcheryakov, 2020).21 A second ysis of complex models, and no amount of user-friend-
edition of semopy is recently available (Meshch- liness in the interface of a computer tool can negate this
eryakov et al., 2021). fact. When things in the analysis go wrong, you need,
2. Mathematica is a software system for technical and first, to have a good understanding of the problem and,
symbolic computation with capabilities for inter- second, basic computer skills to correct the problem.
facing with programs written in other languages You should not let ease of computer tool use lead you
(Wolfram Research, 2022).22 Oldenburg (2020) to carry out unnecessary analyses or select analytical
describes Mathematica code for estimating mod- methods or options you do not understand. The con-
els in both the traditional way through covariance cepts and tools covered in Part I of this book set the
matrix-based computations and through methods stage for considering the specification and analysis of
based on least squares optimization that can also be basic kinds of structural equation models in Part II.
applied to nonlinear models.

21 https://semopy.com/

22 https://www.wolfram.com/ 23 https://www.mathworks.com/

Pt1Kline5E.indd 76 3/22/2023 3:52:41 PM


Part II

Specification, Estimation,
and Testing

Pt2Kline5E.indd 77 3/22/2023 3:43:59 PM


Pt2Kline5E.indd 78 3/22/2023 3:43:59 PM
6

Nonparametric Causal Models

The principal concepts in Pearl’s (2009) nonparametric approach to SEM (i.e., the structural causal model)
are introduced in this chapter. A causal model as described next corresponds to a directed acyclic graph
(DAG), which depicts hypotheses of unidirectional causation or temporal ordering among variables in graph-
ical form. A DAG does not allow for causal loops, where ≥ 2 variables are specified to have direct or indirect
effects on each other. In contrast, a directed cyclic graph (DCG) includes at least one causal loop, but we
will deal with such models later in the book. A directed graph as a nonparametric causal model implies two
things: (1) The researcher makes no commitment to distributional assumptions for any variable. (2) A direct
causal effect represents all forms of the functional relation between a putative cause and effect. If variables
X and Y are both continuous, for example, the specification X → Y in a nonparametric model represents the
linear and all curvilinear trends of the causal effect of X. But in parametric causal models—introduced in the
next chapter—the specification X → Y represents just the linear trend for continuous variables.
Another difference is that there are methods and computer programs for analyzing directed acyclic
graphs before any data are collected. Such analyses can alert the researcher to the presence of confounding,
including whether it would be necessary to measure additional variables in order to estimate any specific
causal effect with less bias. It is easier to address the problem of omitted variables when the study is being
planned than after the data are collected. Even if no new variables are required, analysis of the graph can
indicate options for estimating a target causal effect, including which covariates or instruments to select, in
the presence of confounding. A tutorial on instruments (i.e., instrumental variables) is offered in this chapter.
Analysis of a DAG can also help researchers to find testable implications of their causal hypotheses. No
special software is needed to analyze the data. This is because testable implications for linear models with
continuous variables can be evaluated using partial correlations, which can be estimated with standard com-
puter tools for statistical analysis. Thus, researchers who understand nonparametric causal models are better
prepared for all stages of SEM, from model specification through data collection to analysis and reporting
of the results. So if the ideas considered next seem at first unfamiliar, trust me, it is worth persevering. The
payoff will be apparent in later chapters when we apply the ideas outlined next.

GRAPH VOCABULARY pair of variables is adjacent if they are connected by


AND SYMBOLISM an edge; otherwise, that pair is nonadjacent. An arrow
or directed edge represents a presumed direct causal
Variables in directed graphs are also called nodes or effect between two variables, such as
vertices. Some variables are connected by arcs, also
known as edges or links, that designate presumed
functional or statistical dependencies in the graph. A X→Y (6.1)

79

Pt2Kline5E.indd 79 3/22/2023 3:43:59 PM


80 Specification, Estimation, and Testing

where X is a presumed cause of Y. In contrast, a bidi- causal effects from variables at the beginning of the
rectional edge, often symbolized with an arc rendered path to “downstream” variables at the end the path. A
as a dashed line instead of a solid line with arrowheads directed path is also called a front-door path because
at each end, such as it starts with an arrow pointing away from the cause,
and all subsequent arrows in the path are oriented
X Y (6.2) in the same direction. An undirected path is a path
where the arrows do not all point in the same direc-
designates a spurious (noncausal) association between tion. Undirected paths might convey statistical associa-
X and Y due to ≥ 1 unmeasured (latent) common causes. tion, but not causation between variables at either end
An alternative symbolism in a causal DAG is of the path. The goal of specifying a causal DAG is the
same as for any parametric model in SEM: The graph
X UC Y (6.3) represents all hypothesized connections, causal or non-
causal, between any pair of variables in the model.
where UC explicitly represents all unmeasured com-
mon causes of X and Y, and the arrows are rendered as
dashed lines because the corresponding causal effects CONTRACTED CHAINS
on both X and Y are not directly observed. AND CONFOUNDING
By convention, error terms for outcome variables are
not included in a causal DAG. This is in part because Presented in Figure 6.1(a) is the smallest causal struc-
it is assumed that all variables, causal or outcome, ture, the contracted chain X → Y (i.e., Equation 6.1).
will have idiosyncratic error due to unobserved fac- The two variables in a contracted chain are uncondi-
tors that can vary over time or units (cases, settings, tionally dependent because there are no intervening
regions, etc.). Also, error terms for outcome variables variables that could disrupt or block the causal coor-
are implied by arrows in the graph that point to them dination between them. Represented in Figure 6.1(a) is
from other variables, such as in Equation 6.1, where Y the total causal effect—hereafter called just the total
is the outcome. The hypothesis of overlapping or cor- effect—of X on Y. Besides the hypothesis about direc-
related error is represented by Equation 6.2, where tionality, the figure also assumes there are no unmea-
a bidirectional edge connects two variables X and Y sured confounders, or latent common causes of both X
(Equation 6.3 is an alternative specification). Corre- and Y. A variation on this assumption is that all omitted
lated errors are always explicitly represented in a DAG; causes of Y are uncorrelated with X. For this reason,
otherwise, error terms for outcomes are assumed to be variable X can be described as an exogenous regressor
independent. in the prediction of Y.
The direct causes of a variable in a DAG are it par- If the assumptions for Figure 6.1(a) just stated are
ents, and all direct or indirect causes of a variable are correct, the coefficient from the regression of Y on X
its ancestors. All variables directly caused by a given would estimate the total effect without bias. Because
variable are its children, and its descendants include the model is nonparametric, though, no particular
all variables directly or indirectly caused by that same regression technique can be specified. For instance, if
variable. All parents in a DAG are ancestors just as all both X and Y were continuous and their relation strictly
children are descendants. A variable with no parents linear, the method would be bivariate ordinary least
is exogenous, and a variable with at least one parent is squares (OLS) regression. But if Y were dichotomous,
endogenous, just as in parametric structural equation the regression method could be logistic regression or
models. probit regression and, if time-to-event data were also
A path is a sequence of adjacent edges that con- collected for varying risk periods, a proportional haz-
nect ≥ 2 variables regardless of the directions of those ards model, such as a Cox model, could be specified,
edges (i.e., unidirectional or bidirectional). It passes among other possibilities for binary outcomes (Lee et
through any variable along the path just once. In a al., 2009). The point is that the choice of regression
directed path, all edges are unidirectional arrows that technique depends on assumptions about the functional
point away from a cause toward an outcome at the end form of the causal effect and on the level of measure-
of the path through possibly ≥ 1 intervening variables ment for both X and Y, but nonparametric models are
(i.e., indirect effects). Thus, directed paths transmit not concerned with such details.

Pt2Kline5E.indd 80 3/22/2023 3:43:59 PM

You might also like