Kline 2023

FIFTH EDITION
PRINCIPLES and PRACTICE of

KLINE, R. B.
STRUCTURAL EQUATION MODELING
Principles and Practice of Structural Equation Modeling
FMKline5E.indd 1 3/22/2023 5:01:45 PM

Methodology in the Social Sciences
David A. Kenny, Founding Editor
Todd D. Little, Series Editor
www.guilford.com/MSS
This series provides applied researchers and students with analysis and research design books that
emphasize the use of methods to answer research questions. Rather than emphasizing statistical
theory, each volume in the series illustrates when a technique should (and should not) be used and
how the output from available software programs should (and should not) be interpreted. Common
pitfalls as well as areas of further development are clearly articulated.
RECENT VOLUMES
MEASUREMENT THEORY AND APPLICATIONS FOR THE SOCIAL SCIENCES

Deborah L. Bandalos
CONDUCTING PERSONAL NETWORK RESEARCH: A PRACTICAL GUIDE

Christopher McCarty, Miranda J. Lubbers, Raffaele Vacca, and José Luis Molina
QUASI-EXPERIMENTATION: A GUIDE TO DESIGN AND ANALYSIS

Charles S. Reichardt
THEORY CONSTRUCTION AND MODEL-BUILDING SKILLS:

A PRACTICAL GUIDE FOR SOCIAL SCIENTISTS, SECOND EDITION
James Jaccard and Jacob Jacoby
LONGITUDINAL STRUCTURAL EQUATION MODELING WITH Mplus:

A LATENT STATE–TRAIT PERSPECTIVE
Christian Geiser
COMPOSITE-BASED STRUCTURAL EQUATION MODELING:

ANALYZING LATENT AND EMERGENT VARIABLES
Jörg Henseler
BAYESIAN STRUCTURAL EQUATION MODELING

Sarah Depaoli
INTRODUCTION TO MEDIATION, MODERATION, AND CONDITIONAL

PROCESS ANALYSIS: A REGRESSION-BASED APPROACH, THIRD EDITION
Andrew F. Hayes
THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY, SECOND EDITION

R. J. De Ayala
APPLIED MISSING DATA ANALYSIS, SECOND EDITION

Craig K. Enders
PRINCIPLES AND PRACTICE OF STRUCTURAL EQUATION MODELING, FIFTH EDITION

Rex B. Kline
FMKline5E.indd 2 3/22/2023 5:01:45 PM

Principles and Practice of
Structural Equation Modeling
FIFTH EDITION
Rex B. Kline
Series Editor’s Note by Todd D. Little
THE GUILFORD PRESS

New York London
FMKline5E.indd 3 3/22/2023 5:01:45 PM

Copyright © 2023 The Guilford Press
A Division of Guilford Publications, Inc.
370 Seventh Avenue, Suite 1200, New York, NY 10001
www.guilford.com
All rights reserved
No part of this book may be reproduced, translated, stored in a retrieval system, or

transmitted, in any form or by any means, electronic, mechanical, photocopying,
microfilming, recording, or otherwise, without written permission from the publisher.
Printed in the United States of America
This book is printed on acid-free paper.
Last digit is print number: 9 8 7 6 5 4 3 2 1
Library of Congress Cataloging-in-Publication Data is available from the publisher.
ISBN 978-1-4625-5191-0 (paperback) — ISBN 978-1-4625-5200-9 (hardcover)
FMKline5E.indd 4 3/22/2023 5:01:45 PM

For my family—
Joanna, Julia Anne, and Luke Christopher
FMKline5E.indd 5 3/22/2023 5:01:45 PM

For me, it is far better to grasp the Universe as it really is
than to persist in delusion, however satisfying and reassuring.
—Carl Sagan (1996)
FMKline5E.indd 6 3/22/2023 5:01:45 PM

Series Editor’s Note
The new version of Rex Kline’s best-selling (90,000+) (i.e., composite SEM). Other additions to this master-
and well-cited book is here! As with earlier editions, piece of pedagogy include a new chapter on fundamen-
this fifth edition retains all the wonderful features tal regression concepts and psychometric principles
of his influential earlier editions and he continues to that includes a self-test and primers on these topics plus
seamlessly integrate recent advances in structural significance testing. This chapter alone will become a
equation modeling (SEM). Rex is a true scholar of all core piece of my own SEM courses to ensure students
things SEM and has an acclaimed gift for communi- are ready to learn SEM. Other additions include new
cating complex statistical concepts in language that all chapters on SEM analyses in small samples and on
readers can grasp. The accessible style of writing and mediation analysis, American Psychological Associa-
the many pedagogical features of his book (e.g., anno- tion reporting standards, standards for testing measure-
tated reading lists, exercises with answers, topic boxes, ment invariance, principles of model comparison (both
the comprehensive companion website) make it a “must nested and non-nested), extended confirmatory factor
have” for any user or consumer of SEM. It is a resource analysis models (e.g., multitrait–multimethod models),
that keeps improving and expanding with each new and exploratory bifactor models.
edition. It is a resource that I recommend first when I Although Rex suggests in his introduction that no
am asked what book do I recommend. This recommen- single book can cover all of SEM, his book is about
dation holds for beginners and experienced users alike. as thorough as they come. His didactic approach is
As a scholar of modern statistical practice and refreshing and engaging and the breadth and depth of
techniques, Rex has studied all the developments and material covered is simply impressive. As he notes and
advances in the world of SEM generally, as well as you will feel, Rex is a researcher talking to you as a
“hot” topics like Pearl’s structural causal modeling. fellow researcher, carefully explaining in conceptually
His coverage of Pearl’s graph theoretic approach to driven terms the logic and principles that underlie the
causal reasoning, as many of the reviewers also note, world of SEM. The wealth of examples provide an entry
is both easy to understand and comprehensive. It’s so for researchers across a broad array of disciplines. This
good, he ought to get a prize for best in presentation! book will speak to you regardless of your field, disci-
In this new edition, he brings in a third perspective on pline, or specific area of expertise.
SEM, namely, the partial least squares point of view As always, the support materials that Rex provides
vii
FMKline5E.indd 7 3/22/2023 5:01:45 PM

viii Series Editor’s Note
are thorough: he provides all the necessary files (data, tures were a mainstay of earlier editions, but they have
syntax, and output) for each of the book’s detailed now been expanded to cover even more topics. Rex
examples in simple text files that can be opened with bookends all this material with an introductory chapter
any basic text editor so that you can reproduce the that truly sets the stage for the journey through the land
analyses in six different software packages! The syn- of SEM and a concluding chapter that covers very prac-
tax files are annotated with extensive comments, and tical best-practice advice for every step along the way.
links to related webpages are given. The Appendix Enjoy Rex Kline’s elegant work for a fifth time!
material is a treasure trove of useful building blocks, Everything about his classic work is improved and
from the elements of LISREL notation, to practical expanded. It is a new edition that will become well used
advice, to didactic understanding of complex ideas and by me and my students as well as you and yours.
procedures. Rex has assembled real-world examples
of troublesome data, to demonstrate how to handle the Todd D. Little
analysis problems that inevitably pop up. These fea- Lubbock, Texas
FMKline5E.indd 8 3/22/2023 5:01:45 PM

Acknowledgments
It is a myth that a single person writes a book. The truth • D. Betsy McCoach, Department of Educational
is that an author is blessed to work with many talented Psychology, Neag School of Education, University
people who all together bring the project to a success- of Connecticut
ful conclusion. Listed next are the reviewers of earlier • Mijke Rhemtulla, Department of Psychology, Col-
chapter drafts, whose names were revealed to me only lege of Letters and Science, University of California,
after the writing was complete, and their original com- Davis
ments were never associated with their names. That is,
I do not know who wrote what, but what I do know is
• Felix Thoemmes, Department of Human Develop-
ment, College of Human Ecology, Cornell University
that their input was invaluable in revising the fifth edi-
tion. Thank you very much to all of you for your time, • Yanyun Yang, Department of Educational Psychol-
ogy and Learning Systems, College of Education,
insight, and effort to communicate your impressions Florida State University
about various chapters:
Special thanks go to Edward Rigdon, of the Depart-
• Stephanie Castro, Department of Management, Col- ment of Marketing in the Robinson College of Business
lege of Business, Florida Atlantic University at Georgia State University, and also to Jörg Henseler,
• Kyle Cox, Department of Educational Leadership, of the Department of Design, Production, and Man-
Cato College of Education, University of North Car- agement in the Faculty of Engineering Technology at
olina at Charlotte the University of Twente, Enschede, The Netherlands,
for their reviews of draft chapters about, respectively,
• Ting Dai, Department of Educational Psychology,
multiple indicators (Chapter 13) and composite SEM
College of Education, University of Illinois Chicago
(Chapter 16). I am a long-time admirer of Ed’s thought-
• Naomi Ekas, Department of Psychology, College of provoking and accessible work on structural equation
Science and Engineering, Texas Christian University modeling and multiple-indicator methods, among
• Jam Khojasteh, School of Educational Foundations, other issues in multivariate data analysis. I became
Leadership, and Aviation, College of Education and acquainted with Jörg’s work more recently, and he is
Human Sciences, Oklahoma State University among a new generation of scholars who are mod-
ix
FMKline5E.indd 9 3/22/2023 5:01:45 PM

x Acknowledgments
ernizing and expanding structural equation modeling files, and with William Meyer, Production Editor, who
techniques for analyzing composites as proxies for together with copyeditor Gerry Fahey improved the
theoretical concepts for which classical measurement presentation and format of the book prior to its print-
theory is ill suited. Input from both of these scholars ing. The cover was designed by Robin Lenowitz, Assis-
was extremely helpful. tant Art Director, who created an elegant and striking
It was a great privilege to work once again with the design for this new edition. And it is again an honor
Methodology and Statistics publisher and the Devel- to be part of the Methodology in the Social Sciences
opmental Psychology and Geography senior editor at series edited by Todd D. Little of the Educational Psy-
The Guilford Press, C. Deborah Laughton, who is bril- chology and Leadership Department in the College of
liant, focused, organized, and supportive all at the same Education at Texas Tech University.
time. She continues to know exactly the kind of feed- Last, but always first in my heart, once again deep
back an author needs at a particular moment—some- gratitude to my family—wife, Joanna, and our chil-
times before the author themself knows exactly how to dren, Julia and Luke, now beginning their own lives
move forward. I am not the first author—nor will I be as adults (they grew up along with previous editions of
the last—to commend her ability to keep everything on this book)—for all their love, support, and patience.
track and in good perspective. It was also a pleasure Clearly, with all this support, any limitations in this
to work again with Katherine Sommer, Development new edition are mine alone.
Editor, who screened the original word processing
FMKline5E.indd 10 3/22/2023 5:01:45 PM

Contents
Introduction 1
What’s New / 1
Book Website / 2
Pedagogical Approach / 3
Principles > Software / 3
Symbols and Notation / 3
Enjoy the Ride / 3
Plan of the Book / 3
PART I. CONCEPTS, STANDARDS, AND TOOLS

1 • Promise and Problems 7
Preparing to Learn SEM / 7
Definition of SEM / 9
Basic Data Analyzed in SEM / 9
Family Matters / 10
Pedagogy and SEM Families / 14
Sample Size Requirements / 15
Big Numbers, Low Quality / 16
Limits of This Book / 18
Summary / 18
Learn More / 18
xi
FMKline5E.indd 11 3/22/2023 5:01:45 PM

xii Contents
2 • Background Concepts and Self‑Test 19

Uneven Background Preparation / 19
Potential Obstacles to Learning about SEM / 20
Significance Testing / 23
Measurement and Psychometrics / 24
Regression Analysis / 25
Summary / 26
Self‑Test / 27
Scoring Criteria / 28
3 • Steps and Reporting 32

Basic Steps / 32
Optional Steps / 38
Reporting Standards / 39
Reporting Example / 41
Summary / 44
Learn More / 44
4 • Data Preparation 46
Forms of the Input Data / 46
Positive Definiteness / 48
Missing Data / 49
Classical (Obsolete) Methods for Incomplete Data / 54
Modern Methods for Incomplete Data / 55
Other Data Screening Issues / 56
Summary / 62
Learn More / 62
Exercises / 63
APPENDIX 4.A. Steps of Multiple Imputation 64
5 • Computer Tools 67
Ease of Use, Not Suspension of Judgment / 67
Human–Computer Interaction / 68
Tips for SEM Programming / 68
Commercial versus Free Computer Tools / 70
R Packages for SEM / 71
Free SEM Software with Graphical User Interfaces / 73
Commercial SEM Computer Tools / 73
SEM Resources for Other Computing Environments / 76
Summary / 76
FMKline5E.indd 12 3/22/2023 5:01:45 PM

Contents xiii
PART II. SPECIFICATION, ESTIMATION, AND TESTING

6 • Nonparametric Causal Models 79
Graph Vocabulary and Symbolism / 79
Contracted Chains and Confounding / 80
Covariate Selection / 81
Instrumental Variables / 82
Conditional Independencies and Other Types of Bias / 84
Principles for Covariate Selection / 88
d‑Separation and Basis Sets / 89
Graphical Identification Criteria / 92
Detailed Example / 96
Summary / 98
Learn More / 99
Exercises / 99
7 • Parametric Causal Models 100

Model Diagram Symbolism / 100
Diagrams for Contracted Chains and Assumptions / 103
Confounding in Parametric Models / 105
Models with Correlated Causes or Indirect Effects / 106
Recursive, Nonrecursive, and Partially Recursive Models / 109
Summary / 111
Learn More / 112
Exercises / 112
APPENDIX 7.A. Advanced Topics in Parametric Models 113
8 • Local Estimation and Piecewise SEM 117

Rationale of Local Estimation / 117
Piecewise SEM / 118
Summary / 130
Learn More / 130
Exercises / 130
9 • Global Estimation and Mean Structures 131

Simultaneous Methods and Error Propagation / 131
Maximum Likelihood Estimation / 132
Default ML / 135
Analyzing Nonnormal Data / 137
Robust ML / 138
FIML for Incomplete Data versus Multiple Imputation / 138
Alternative Estimators for Continuous Outcomes / 140
Fitting Models to Correlation Matrices / 141
Healthy Perspective on Estimators and Global Estimation / 142
FMKline5E.indd 13 3/22/2023 5:01:45 PM

xiv Contents

Introduction to Mean Structures / 147
Précis of Global Estimation / 150
Summary / 151
Learn More / 151
Exercises / 151
APPENDIX 9.A. Types of Information Matrices and Computer Options 153
APPENDIX 9.B. Casewise ML Methods for Data Missing Not at Random 155
10 • Model Testing and Indexing 156

Model Testing / 156
Model Chi‑Square / 156
Scaled Chi‑Squares and Robust Standard Errors for Nonnormal Distributions / 161
Model Fit Indexing / 163
RMSEA / 166
CFI / 168
SRMR / 169
Thresholds for Approximate Fit Indexes / 170
Recommended Approach to Fit Evaluation / 172
Global Fit Statistics for the Detailed Example / 173
Power and Precision / 174
Summary / 177
Learn More / 179
Exercises / 179
APPENDIX 10.A. Significance Testing Based on the RMSEA 180
11 • Comparing Models 182

Nested Models / 182
Building and Trimming / 183
Empirical versus Theoretical Respecification / 184
Chi‑Square Difference Test / 184
Modification Indexes and Related Statistics / 187
Intelligent Automated Search Strategies / 188
Model Building for the Detailed Example / 188
Comparing Nonnested Models / 190
Equivalent Models / 194
Coping with Equivalent or Nearly Equivalent Models / 196
Summary / 198
Learn More / 199
Exercises / 199
APPENDIX 11.A. Other Types of Model Relations and Tests 200
12 • Comparing Groups 203

Issues in Multiple‑Group SEM / 204
Detailed Example for a Path Model of Achievement and Delinquency / 205
Tests for Conditional Indirect Effects Over Groups / 211
FMKline5E.indd 14 3/22/2023 5:01:46 PM

Contents xv
Summary / 212
Learn More / 213
Exercises / 213
PART III. MULTIPLE‑INDICATOR APPROXIMATION OF CONCEPTS
13 • Multiple-Indicator Measurement 217

Concepts, Indicators, and Proxies / 218
Reflective Measurement and Effect Indicators / 219
Causal–Formative Measurement and Causal Indicators / 220
Composite Measurement and Composite Indicators / 221
Mixed‑Model Measurement / 222
Considerations in Selecting a Measurement Model / 223
Cautions on Formative Measurement / 224
Alternative Measurement Models and Approaches / 225
Summary / 227
Learn More / 228
14 • Confirmatory Factor Analysis 229

EFA versus CFA / 229
Suggestions for Selecting Indicators / 231
Basic CFA Models / 232
Other Methods for Scaling Factors / 234
Detailed Example for a Basic CFA Model of Cognitive Abilities / 236
Respecification of CFA Models / 243
Estimation Problems / 246
Equivalent CFA Models / 249
Special Tests with Equality Constraints / 251
Models for Multitrait–Multimethod Data / 252
Second‑Order and Bifactor Models with General Factors / 255
Summary / 258
Learn More / 259
Exercises / 259
APPENDIX 14.A. Identification Rules for Correlated Errors or Multiple Loadings 260
15 • Structural Regression Models 263

Full SR Models / 263
Two‑Step Modeling / 265
Other Modeling Strategies / 268
Detailed Example of Two‑Step Modeling in a High‑Risk Sample / 269
Partial SR Models with Single Indicators / 274
Example for a Partial SR Model / 278
Summary / 281
Learn More / 283
Exercises / 283
FMKline5E.indd 15 3/22/2023 5:01:46 PM

xvi Contents
16 • Composite Models 284

Modern Composite Analysis in SEM / 285
Disambiguation of Terms / 285
Special Computer Tools / 288
Motivating Example / 289
Alternative Composite Model / 294
Partial Least Squares Path Modeling Algorithm / 297
PLS‑PM Analysis of the Composite Model / 300
Henseler–Ogasawara Specification and ML Analysis / 301
Summary / 304
Learn More / 305
Exercises / 305
PART IV. ADVANCED TECHNIQUES

17 • Analyses in Small Samples 309
Suggestions for Analyzing Common Factor Models / 309
Analysis of a Common Factor Model in a Small Sample / 311
Controlling Measurement Error in Manifest‑Variable Path Models / 315
Adjusted Test Statistics for Small Samples / 316
Bayesian Methods and Regularized SEM / 317
Summary / 318
Learn More / 318
Exercises / 318
18 • Categorical Confirmatory Factor Analysis 319

Basic Estimation Options for Categorical Data / 319
Overview of Continuous/Categorical Variable Methodology / 320
Latent Response Variables and Thresholds / 321
Polychoric Correlations / 321
Measurement Model and Diagram / 323
Methods to Scale Latent Response Variables / 323
Estimators, Adjusted Test Statistics, and Robust Standard Errors / 324
Models with Continuous and Ordinal Indicators / 325
Detailed Example for Items about Self‑Rated Depression / 325
Other Estimation Options for Categorical CFA / 327
Item Response Theory and CFA / 329
Summary / 329
Learn More / 330
Exercises / 330
19 • Nonrecursive Models with Causal Loops 331

Causal Loops / 331
Assumptions of Causal Loops / 333
FMKline5E.indd 16 3/22/2023 5:01:46 PM

Contents xvii
Identification Requirements / 333

Respecification of Nonrecursive Models That Are Not Identified / 336
Order Condition and Rank Condition / 337
Detailed Example for a Nonrecursive Partial SR Model / 338
Blocked‑Error R2 for Nonrecursive Models / 344
Summary / 345
Learn More / 345
Exercises / 346
APPENDIX 19.A. Evaluation of the Rank Condition 347
20 • Enhanced Mediation Analysis 349

Mediation Analysis in Cross‑Sectional Designs / 350
Effect Sizes for Indirect Effects / 353
Cross‑Lag Panel Designs for Mediation / 356
Conditional Process Analysis / 358
Causal Mediation Analysis Based on Nonparametric Models and Counterfactuals / 360
Reporting Standards for Mediation Studies / 368
Summary / 371
Learn More / 371
Exercises / 371
21 • Latent Growth Curve Models 372

Basic Latent Growth Models / 372
Data Set for Analyzing Basic Growth Models with No Covariates / 374
Example Analyses of Basic Growth Models / 379
Example for a Growth Predictor Model with Time‑Invariant Covariates / 382
Practical Suggestions for Latent Growth Modeling / 385
Extensions of Latent Growth Models / 385
Summary / 389
Learn More / 390
Exercises / 390
APPENDIX 21.A. Unequal Measurement Intervals and Options for Defining the Intercept 391
22 • Measurement Invariance 393

Levels of Invariance / 395
Analysis Decisions / 398
Partial Measurement Invariance / 401
Detailed Example for a Two‑Factor Model of Divergent Thinking / 402
Practical Suggestions for Measurement Invariance Testing / 408
Measurement Invariance Testing in Categorical CFA / 409
Other Statistical Approaches to Estimating Measurement Invariance / 410
Summary / 412
Learn More / 412
Exercises / 413
FMKline5E.indd 17 3/22/2023 5:01:46 PM

xviii Contents
23 x Best Practices in SEM 414

Resources / 414
Bottom Lines and Statistical Beauty / 414
Mightily Distinguish Your Work (Be a Hero) / 415
Family Relations / 416
Specification / 417
Identification / 418
Measures / 419
Sample and Data / 419
Estimation / 420
Respecification / 422
Tabulation / 422
Interpretation / 423
Summary / 424
Learn More / 424

Suggested Answers to Exercises 425
References 441
Author Index 471
Subject Index 479
About the Author 494
The companion website (www.guilford.com/kline-materials) supplies

data, annotated syntax, and output for the book’s examples, in files
that can be opened with any basic text editor, as well as primers on
significance testing, regression, and psychometrics.
Introduction
It is both a pleasure and honor to introduce the fifth edi- tural causal model (SCM), into the larger SEM fam-
tion of this book. Like the previous editions, structural ily that dates to the development of path analysis by
equation modeling (SEM) is presented in an acces- Sewall Wright in the 1920s–1930s and to the publica-
sible way for readers without strong quantitative back- tion of LISREL III in 1976 as the first widely available
grounds. Included in this edition are many new exam- computer program for covariance structure analysis,
ples of SEM applications in disciplines that include also called covariance-based SEM. In the same tradi-
health, political science, international studies, cognitive tion, this fifth edition includes composite SEM, also
neuroscience, developmental psychology, sport and referred to as partial least squares path modeling or
exercise, and psychology, among others. Some exam- variance-based SEM, as the third full member of the
ples were selected due to technical problems in the anal- SEM family. Composite SEM has developed from a
ysis, but such examples provide a context for discussing set of methods seen in the 1980s–1990s as more suit-
how to deal with challenges that can and do occur in able for exploratory research that emphasized predic-
SEM, especially in samples that are not large. So not all tion over explanation to a suite of full-fledged modeling
applications of SEM described in this book are trouble techniques for exploratory or confirmatory analyses,
free, but neither are actual research problems. including theory testing. Both the SCM and composite
SEM offer unique perspectives on causal modeling that
can benefit researchers more familiar with traditional,
WHAT’S NEW covariance-based SEM. This means that researchers
acquainted with all three members of the SEM family
The many changes in this edition are intended to can test a wider range of hypotheses about measure-
enhance the pedagogical presentation and cover recent ment and causation. I try to make good on this promise
developments. The biggest changes are summarized throughout the fifth edition.
next:
2. Traditional SEM and composite SEM are
1. The fourth edition of this book was one of the described within Edward Rigdon’s concept proxy
first introductory works to incorporate Judea Pearl’s framework that links data with theoretical concepts
nonparametric approach to SEM, also called the struc- through proxies, which approximate concepts based on
IntroKline5E.indd 1 3/22/2023 4:50:11 PM

2 Introduction
correspondence rules—also called auxiliary theory— CauseAndCorrelation, dagitty, and ggm. Together with
about presumed causal directionality between concepts the lavaan package, a wide variety of analyses for non-
and data. This point refers to the distinction between parametric, parametric, and composite models in SEM
reflective measurement, where proxies for latent vari- is demonstrated, all with no-cost software. Commercial
ables are common factors, and formative measurement, software for SEM is still described, including Mplus,
where proxies for emergent variables are composites of which can feature state-of-the-art analyses before they
observed variables. The choice between the two mea- appear in other computer tools, but free SEM software
surement models just mentioned should be based on is now nearly as capable as commercial products. Also,
theory, not by default due to the researcher’s lack of I would guess that free software could be used in the
awareness about SEM techniques for analyzing com- large majority of published SEM studies.
posites.
6. Extended presentations on regression fundamen-
3. There are additional new chapters on SEM analy- tals, significance testing, and measurement and psy-
ses in small samples and recent developments in media- chometrics beloved by readers of the fourth edition are
tion analysis. Surveyed works about mediation analysis freely available in updated form as primers on the book’s
concern research designs and definitions of mediated website. This change was necessary to include the new
effects, including natural direct and indirect effects and material in the fifth edition. The topics just mentioned
interventional direct and indirect effects estimated in are still covered in the new edition but in a more con-
clinical trials, among other topics. There is also cov- cise way. New to the fifth edition in the main text is
erage of new reporting standards for SEM studies by a self-test of knowledge about background concepts in
the American Psychological Association (APA) and statistics and measurement. There is a scoring key, too,
the technique of piecewise SEM, which is based on so readers can check their understanding of fundamen-
concepts from Pearl’s SCM. There are also extended tals. Readers with higher scores could directly proceed
tutorials on modern techniques for dealing with miss- to substantive chapters on SEM analyses, and readers
ing data, including multiple imputation and full infor- with lower scores can consult any of the primers on the
mation maximum likelihood (FIML), and also about website for more information and exercises.
instrumental variable methods as a way to deal with
the confounding of target causal effects.
BOOK WEBSITE
4. The topics of specification and identification ver-
sus analysis were described in separate chapters in the
The address for the book’s website is https://www.
fourth edition. They are now combined into individual
guilford.com/kline-materials. From the site, you can
chapters for each technique described in the fifth edi-
freely access the computer files—data, syntax, and
tion. I believe this more closely integrated presentation
output files—for all detailed examples in this book.
helps readers to more quickly and easily develop a sense
The website promotes a learning-by-doing approach.
of mastery for a particular kind of SEM technique.
The availability of both syntax and data files means
5. There is greater emphasis on freely available soft- that readers can reproduce the analyses in this book
ware for SEM analyses in this new edition. For exam- by using the corresponding R packages. Even without
ple, the R package lavaan package was used in most doing so, readers can still open the output file on their
analyses described in this book. It is a full-featured own computers for a particular analysis and view the
computer program for both basic and advanced SEM results. This is because all computer files are simple
analyses. It has the capability to analyze both common text files that can be opened with any basic text edi-
factors and composites as proxies for theoretical con- tor, such as Notepad (Windows), Emacs (Linux/
cepts. The syntax in lavaan is both straightforward UNIX), or TextEdit (macOS), among others. Syn-
and used in some other R packages, including cSEM tax files are annotated with extensive comments.
for composite SEM, to specify structural equation Even if readers use a different computer tool, such
models, so it has application beyond lavaan. as LISREL, it is still worthwhile to review the files
Other R packages used for detailed examples in the on the website generated in the R environment. This
fifth edition include semTools, piecewiseSEM, MBESS, is because it can be helpful to view the same analy-
MIIVsem, psych, WebPower, systemfit, sem, bmem, sis from somewhat different perspectives. Some of the
Introduction 3
exercises for this book involve extensions of the analy- SYMBOLS AND NOTATION
ses for these examples, so there are plenty of opportuni-
ties for practice with real data sets. Advanced works on SEM often rely on the symbols
and notation associated with the original matrix-based
syntax for LISREL, which features a profusion of dou-
PEDAGOGICAL APPROACH bly subscripted lowercase Greek letters for individual
model parameters, uppercase Greek letters for matrices
Something that has not changed in the fifth edition of parameters for the whole model, and two-letter acro-
is pedagogical style: I still speak to readers (through nyms in syntax for matrices. For example, the symbols
my author’s voice) as one researcher to another, not as ( x)
λ12 , Λ x, and LX
statistician to the quantitatively naïve. For example, the
instructional language of statisticians is matrix algebra, refer in LISREL notation to, respectively, a specific
which conveys a lot of information in a short amount of loading on an exogenous (explanatory) factor, the
space, but readers must already be versed in linear alge- parameter matrix of loadings for all such factors, and
bra to understand the message. There are other, more LISREL syntax that designates the matrix (Lambda-
advanced works about SEM that emphasize matrix X). Although I use here and there some symbols from
presentations (Bollen, 1989; Kaplan, 2009; Mulaik, LISREL notation, I do not oblige readers to memorize
2009b), and these works can be consulted when you LISREL notation to get something out of the book.
are ready. Instead, fundamental concepts about SEM This is appropriate because LISREL symbolism can be
are presented here in the language of applied research- confusing unless one has learned the whole system by
ers: words, tabular summaries, and data graphics, not rote.
matrix equations. I will not shelter you from some of
the more technical aspects of SEM, but I aim to cover
ENJOY THE RIDE
fundamental concepts in accessible ways that promote
continued learning. Learning a new set of statistical techniques is not every-
one’s idea of fun. (If doing so is fun for you, that’s okay,
I understand and agree.) But I hope the combination
PRINCIPLES > SOFTWARE of accessible language that respects your intelligence,
examples of SEM analyses in various disciplines, free
You may be relieved to know that you are not at a dis- access to background tutorials (i.e., the primers) and
advantage at present if you have no experience using computer files for detailed examples, and the occa-
an SEM computer tool. This is because the coverage sional bit of sage advice offered in this book will help to
of topics in this book is not based on the symbolism, make the experience a little easier, perhaps even enjoy-
syntax, or user interface associated with a particular able. It might also help to think of this book as a kind of
software package. In contrast, there are many books travel guide about language and customs, what to know
linked to specific SEM computer programs. They can and pitfalls to avoid, and what lies just over the horizon
be invaluable for users of a particular program, but per- in SEM land.
haps less so for others. Instead, key principles of SEM
that users of any computer tool must understand are
PLAN OF THE BOOK
emphasized here. In this way, this book is more like a
guide to writing style than a handbook about how to use
Part I introduces fundamental concepts, reporting
a particular word processor. Besides, becoming profi- standards, preparation of the data, and computer tools.
cient with a particular software package is just a matter Chapter 1 lays out both the promise of SEM and wide-
of practice. But without strong conceptual knowledge, spread problems in its application. Concepts in regres-
the output from a computer tool for statistical analy- sion, significance testing, and psychometrics that are
ses—including SEM—may be meaningless or, even especially relevant for SEM are reviewed in Chapter
worse, misleading. 2, which also include the self-test in these areas. Basic

4 Introduction
steps in SEM and reporting standards are introduced ciations for each pair of measured variables. Chapters
in Chapter 3 along with an example from a recent 11–12 extend these ideas to, respectively, the compari-
empirical study. How to prepare the data for analysis in son of alternative models all fit to the same data and
SEM and options for dealing with common problems, the simultaneous analysis of a model over data from
including missing data, are covered in Chapter 4, and multiple groups, also called multiple-group SEM.
computer tools for SEM, both commercial and free, are Part III deals with the analysis of models where at
described in Chapter 5. least some theoretical concepts are approximated with
Part II deals with the fundamentals of hypothesis multiple observed variables, or multiple-indicator mea-
testing in SEM for classical path models, which in the surement. Such models are often referred to as “latent
analysis phase feature a single observed measure for variable models,” but for reasons explained in Chapter
each theoretical variable, also called single-indicator 13, our models include only proxies for latent variables,
measurement. It begins in Chapter 6, which introduces not latent variables themselves. These proxies are of
nonparametric SEM as described by Judea Pearl (i.e., two general types: common factors based on reflective
the SCM). The SCM is graphical in nature; specifically, measurement models and composites based on forma-
causal hypotheses are represented as directed graphs tive measurement models. The analysis of pure reflec-
where theoretical variables are depicted with no com- tive measurement models in the technique of confirma-
mitment to any distributional assumptions or specific tory factor analysis (CFA) is described in Chapter 14,
operational definitions for any variable. Graphs in non- and Chapter 15 deals with the analysis of structural
parametric SEM can be analyzed by special computer regression (SR) models—also called latent variable
tools without data. This capability allows researchers path models—where causal effects between observed
to test their ideas before collecting the data. For exam- variables or common factors are estimated. Chapter 16
ple, the analysis of a directed graph may indicate that is about composite SEM, which analyzes causal models
a particular causal effect cannot be estimated unless with multiple-indicator measurement based on forma-
additional variables are measured. After the data are tive, not reflective, measurement and where proxies for
collected, it is a parametric model that is typically conceptual variables are composites, not common fac-
analyzed, and such models and their assumptions are tors. Application of the technique of confirmatory com-
described in Chapter 7. The technique of piecewise posite analysis (CCA), the composite analog to CFA, is
SEM, which connects the two perspectives, nonpara- demonstrated.
metric and parametric, through novel techniques for Part IV is about advanced techniques. How to deal
analyzing path models, is covered in Chapter 8. with SEM analyses in small samples is addressed in
Chapters 9–12 are perhaps the most important ones Chapter 17, and Chapter 18 concerns the analysis of
in the book. They concern how to test hypotheses and categorical data in CFA. Chapter 19 explains how to
evaluate models in complete and transparent ways that analyze nonrecursive models with causal loops that
respect both reporting standards for SEM and best involve two or more endogenous (outcome) variables
practices. These presentations are intended as counter- assumed to influence each other, and Chapter 20 sur-
examples to widespread dubious practices that plague veys recent developments that enhance, improve, and
many, if not most, published SEM studies. That is, the extend ways to assess hypotheses of causal media-
state of SEM practice is generally poor, and one of my tion, or indirect causal effects that involve at least one
goals is to help readers distinguish their work above intervening variable. The state of mediation analysis
this din of mediocrity. Accordingly, Chapter 9 outlines in the literature is problematic, but some of the newer
methods for simultaneous estimation of parameters in approaches and methods described in this chapter seem
structural equations models and explains how to ana- promising. The analysis of latent growth models for
lyze means along with covariances. Chapter 10 deals longitudinal data is the subject of Chapter 21, and the
with the critical issue of how to properly assess model application of multiple-group CFA to test hypotheses
fit after estimates of its parameters are in hand. A criti- of measurement invariance is dealt with in Chapter
cal point is that model fit should be routinely adjudged 22. The capstone of the book is the summary of best
from at least two perspectives: global or overall fit, and practices in SEM in Chapter 23. Also mentioned in this
local fit at the level of residuals, which in SEM con- chapter are common mistakes with the aim of helping
cerns differences between sample and predicted asso- you to avoid them.

Part I
Concepts, Standards, and Tools
Pt1Kline5E.indd 5 3/22/2023 3:52:35 PM

Pt1Kline5E.indd 6 3/22/2023 3:52:35 PM
1
Promise and Problems
This book is your guide to the principles, practices, strengths, limitations, and applications of structural equa-
tion modeling (SEM) for researchers and students without extensive quantitative backgrounds. Accordingly,
the presentation is conceptually rather than mathematically oriented, the use of formulas and symbols is kept
to a minimum, and many examples are offered of the application of SEM to research problems in disciplines
that include psychology, education, health sciences, cognitive assessment, and political science, among oth-
ers. When you finish reading this book, I hope that you will have acquired the skills to begin to use SEM in
your own research in an informed, principled way. Here is my four-point plan to get you there:
1. Review fundamental concepts in regression, hypothesis testing, and measurement beginning in the
next chapter. A self-test of knowledge in each area just mentioned is provided with a scoring key.
Additional resources for review of background concepts are freely available on this book’s website.
2. Convey what should be communicated in complete, transparent, and verifiable reporting of the
results in SEM studies. Formal reporting standards for SEM by the American Psychological Associa-
tion (APA) are both explained and modeled by example.
3. Describe all three major members of the SEM family. Each offers unique perspectives that can help
you in every stage of an SEM study, from planning through operationalization to data collection and
then finally to the analysis of a model that faithfully represents your specific hypotheses.
4. Emphasize best analysis practices and provide warnings about questionable practices that are seen
in too many published SEM studies.
To summarize, the main goal is to help you produce research that is distinguished by its accuracy, complete-
ness, and quality by following best practices, that is, SEM done right.
PREPARING TO LEARN SEM Know Your Area
Listed next are suggestions for the best ways to get Strong familiarity with the theoretical and empiri-
ready to learn about SEM. I offer these suggestions cal literature in your research area is the single most
in the spirit of giving you a healthy perspective at the important thing you could bring to SEM. This is
beginning of our task, one that empowers your sense of because everything—from the specification of your
being a researcher. initial model to modification of that model in subse-
Pt1Kline5E.indd 7 3/22/2023 3:52:35 PM

8 Concepts, Standards, and Tools
quent reanalyses to interpretation of the results—must script is a joy to read. I have reviewed many in my career
be guided by your domain knowledge. So, you need, and have always argued for publication. If the models and
first and foremost, to be a researcher, not a statistician data are not working well, that is worthy of announcing it
or a computer nerd. This is true for most kinds of statis- to the world as students and I have done. (p. 642)
tical analysis in that the value of the product (numeri-
cal results) depends on the quality of the ideas (your This quote is from a dissertation defense where some
hypotheses) on which the analysis is based. committee members were critical of an SEM analysis
where no model was retained. Asked by the committee
chair for a response, Schreiber (personal communica-
Know Your Measures tion, December 1, 2020) replied that retaining no model
Kühnel (2001) reminded us that learning about SEM has is a beautiful thing when it is apparent that predictions
the by-product that researchers must address fundamen- based on theory did not work out in practice. In con-
tal issues of measurement. Specifically, analyzing mea- trast, whether or not a scientifically trivial model fits
sures with strong psychometric characteristics, such as the data is irrelevant (Millsap, 2007).
good score reliability and positive evidence for validity, is
essential in SEM. For example, it is impossible to approx- Use the Best Research Computer
imate hypothetical constructs without thinking about in the World . . .
how to operationalize and measure those constructs.
When you have just a single measure of a construct, it is Which is the human brain; specifically—yours. At the
critical for this single indicator to have good psychomet- end of the analysis in SEM—or any other type of sta-
ric properties. Similarly, the analysis of measures with tistical analysis—it is you as the researcher who must
deficient psychometrics could bias the results. evaluate the degree of support for the hypotheses,
explain any unexpected findings, relate the results to
those from prior studies, and consider the implications
Eyes on the Prize of the findings for future research. Without your content
The point of SEM is to test a theory by specifying expertise, a statistician or computer nerd could help you
a model that represents predictions of that theory to select data analysis tools or write program syntax,
among plausible constructs measured with appropri- but could not help with the other things just mentioned.
ate observed variables (Hayduk et al., 2007). If such a As aptly noted by Pedhazur and Schmelkin (1991), “no
model does not ultimately fit the data, this outcome is amount of proficiency will do you any good, if you do
interesting because there is value in reporting models not think” (p. 2).
that challenge or discredit theories. Beginners some-
times mistakenly believe that the point of SEM is to Get a Computer Tool for SEM
find a model that fits the data, but this, by itself, is not
Obviously, you need a computer program to conduct
impressive. This is because any model, even one that
the analysis. In SEM, there are now many choices of
is grossly wrong (misspecified), can be made to fit the
computer tools, many for no cost. Examples of free
data by making it more complicated (adding parame-
computer software include various SEM packages
ters). In fact, if a structural equation model is specified
such as lavaan, semTools, cSEM, and OpenMx for R,
to be as complex as possible, it will perfectly represent
which is an open-source language and environment for
the data, not only in a particular sample, but also in any
statistical computing and graphics. Other options are
other sample using the same variables.
Wnyx, a graphical environment for creating and test-
There is also real scientific value in learning about
ing structural equation models, and JASP, an open-
just how and why things went wrong with a model
source computer program with a graphical user inter-
faithfully based on a particular theory. Schreiber (2017)
face (GUI) and capabilities for both frequentist and
put it like this:
Bayesian statistical procedures. It also has modules for
One component about SEM analyses that is rarely talked SEM, including mediation analysis and latent growth
about is when no model is retained. This is an overlooked modeling. Commercial options for SEM include Amos,
area because researchers fear a lack of being able to publish Adanco, EQS, LISREL, Mplus, and SmartPLS, among
it. A well written and argued “no model retained” manu- others.
Pt1Kline5E.indd 8 3/22/2023 3:52:35 PM

Promise and Problems 9
In the open science spirit of free access to research data-driven research, where preliminary causal models
tools and resources, greater emphasis in this book is are generated, to more confirmatory research, where
placed on free computer programs for SEM than on one or more extant models based on a priori hypoth-
commercial options. Specifically, the lavaan package eses are tested or compared. While the data in SEM
for R is used in most of the detailed analysis examples. come from measured variables, they can be treated in
It has both basic and advanced options for a wide range the analysis as approximating hypothetical constructs.
of SEM analyses, and its capabilities rival those of Thus, by analyzing manifest variables as indicators for
commercial software. Several other R packages that target constructs, it is also possible to estimate causal
supplement or extend lavaan analyses, such as sem- relations among those constructs.
Tools, are also used in examples. Syntax, data, and Pearl (2012, 2023) defined SEM as a causal infer-
output files can be freely downloaded from the website ence method that takes three inputs (I) and generates
for this book—see the Introduction—which is also an three outputs (O). The inputs are
exemplar of open access.
I-1. A set of causal hypotheses based on theory or
results of empirical studies that are represented
Join the Community in the structural equation model. The hypotheses
An electronic mail network called SEMNET is dedi- are typically based on assumptions, only some of
cated to SEM.1 It serves as an open forum for discus- which can actually be tested in the data.
sion and the whole range of issues associated with I-2. A set of queries or questions about causal rela-
SEM. It also provides a place to ask questions about tions among variables of interest such as, what is
analyses or more general issues, including philosophi- the magnitude of the direct causal effect of X on
cal ones (e.g., the nature of causality or causal infer- Y (represented as X → Y), controlling for other
ence). Subscribers to SEMNET come from various presumed causes of Y? All queries follow from
disciplines, and they range from novices to seasoned model specification.
veterans, including authors of many works cited in this I-3. Most applications of SEM are in observational
book. Sometimes the discussion gets lively (sparks can studies, or nonexperimental designs, but data
fly), but so it goes in scientific discourse. An archive from experimental or quasi-experimental studies
of prior discussions on SEMNET can be searched for can be analyzed, too—see Breitsohl (2019) for
particular topics. A special interest group for SEM is examples.
available for members of the American Educational
Research Association (AERA).2 There is even a theme The outputs of SEM are
song for SEM, the hilarious Ballad of the Casual Mod-
eler, by David Rogosa (1988).3 You can blame me if the O-1. Quantitative estimates of model parameters for
song gets stuck in your head. hypothesized effects including, for example,
X → Y, given the data.
O-2. A set of logical implications of the model that
DEFINITION OF SEM may not directly correspond to a specific param-
eter but can still be tested in the data. For exam-
The term structural equation modeling (SEM) refers ple, a model may imply that variables W and Y
to a set of statistical techniques for estimating the mag- are unrelated after controlling for certain other
nitudes and directions of presumed causal effects in variables in the model.
quantitative studies based on cross-sectional, longitu- O-3. The degree to which the testable implications of
dinal, experimental, or other kinds of research designs. the model are supported by the data.
Its application can range from more exploratory and
1 https://listserv.ua.edu/cgi-bin/wa?A0=semnet
BASIC DATA ANALYZED IN SEM
2 https://www.aera.net/SIG118/Structural-Equation-Modeling-
SIG-118 When unstandardized variables are analyzed in SEM,

3 https://web.stanford.edu/class/ed260/ballad.mp3 the basic datum for continuous variables is the covari-
Pt1Kline5E.indd 9 3/22/2023 3:52:35 PM

ance, which is defined for observed variables X and Y narily flexible in terms of effects and types of variables
as follows: that can be analyzed as presumed causes or outcomes.
There are other differences between SEM and regres-
covXY = rXY SDX SDY (1.1) sion techniques: Associations for observed variables
only are estimated in standard regression analysis, but
where rXY is the Pearson correlation and SDX and SDY relations for latent variables can also be estimated in
are their standard deviations.4 A covariance estimates SEM. In regression, the roles of predictor and criterion
the strength of the linear relation between X and Y in are theoretically interchangeable. For instance, there is
their original (raw score) units, albeit with a single no special problem in bivariate regression with specify-
number. Because the covariance is an unstandardized ing X as a predictor of Y in one analysis (Y is regressed
statistic, its value has no fixed lower or upper bound. on X) and then in a second analysis with regressing X
For example, covariances of, say, –1,025.45 or 19.77 are on Y. There is no such ambiguity in SEM, where the
possible, given the scale of the original scores. The sta- specification that X affects Y is a causal link that reflects
tistic covXY conveys more information than rXY, which theory and also depends on other assumptions in the
says something about association but in a standardized analysis. Thus, both the semantics and interpretation of
metric only. There are times in SEM when it is appro- the results in regression versus SEM are distinct (Bol-
priate to analyze standardized rather than unstandard- len & Pearl, 2013). Yes, there are situations where stan-
ized variables. If so, then rXY is the basic datum for dard regression techniques can be used to estimate pre-
variables X and Y. Reasons to analyze unstandardized sumed causal effects, but the context for SEM is causal
versus standardized variables in SEM are described at modeling, not mere prediction.
various points in the book.
Some researchers, especially those who use ANOVA
(analysis of variance) as their main analytical tool, have FAMILY MATTERS
the impression that SEM is concerned only with covari-
ances or correlations. This view is too narrow because The method of SEM consists of three distinct families of
means can be analyzed in SEM, too. But what really techniques or approaches to causal inference. All origi-
distinguishes SEM is that means of latent variables can nated in the pioneering work by the geneticist Sewall
also be estimated. In contrast, ANOVA is concerned Wright. His method of path coefficients (Wright,
with means of observed variables only. It is also pos-
1934)—or path analysis as it is now called—featured
sible to estimate effects in SEM traditionally associated
the estimation of causal effects based on hypotheses
with ANOVA, including between-group and within-
represented in a statistical model. Wright’s path mod-
group (e.g., repeated measures or longitudinal data)
els included both observed and latent variables, and in
mean contrasts. For example, in SEM one can estimate
graphical form they closely resemble today’s model
the magnitude of group mean differences on latent vari-
diagrams in SEM (e.g., Pearl, 2009, p. 415). Presumed
ables, which is not feasible in standard ANOVA. Means
causal effects were estimated in sample data, and pre-
are not analyzed in probably most published SEM stud-
ies, but the option to do so provides additional flex- dictions based on the model were compared with pat-
ibility. Several examples of the analysis of means are terns of observed associations in samples. In hindsight,
described later in the book. Wright’s innovations were remarkable, and his work
Just as in regression analysis, it is possible in SEM to continues to have influence to this day.
estimate curvilinear relations for continuous variables Although all three SEM families date to key innova-
or analyze noncontinuous variables, including nominal tions in the 1970s–1980s, each was further developed
or ordered-categorical (ordinal) variables, among other in relatively distinct sets of research disciplines or
variations, as either presumed causes or outcomes. areas. Consequently, other than a handful of scholars
Interactive effects can also be analyzed in SEM. An knowledgeable of the histories and intricacies in at least
outdated view is that only linear effects of continuous two different SEM families, many researchers trained
variables that do not interact can be estimated (Bollen in a particular SEM family were only partially aware
& Pearl, 2013), but the reality is that SEM is extraordi- of other possibilities or approaches to causal modeling.
But things are changing, in part due to the increasing
4 Thecovariance of a variable with itself is just its variance, such influence of multidisciplinary research that includes
as cov XX = s X2 . several disciplines in an integrated way under the same
Pt1Kline5E.indd 10 3/22/2023 3:52:35 PM

subject or area of study. The promise is that different multivariate ANOVA), and canonical variate analysis
viewpoints or methods over disciplines will comple- (canonical correlation), among others, analyzes com-
ment each other and provide a more comprehensive posites, too, but the focus of composite SEM is aimed
understanding of the problem or novel approaches to more at causal modeling. The technique of confirma-
dealing with it (Choudhary, 2015). tory composite analysis (CCA), a member of this fam-
The three SEM families are listed next and described ily, is the composite-based analogue to CFA. The name
afterward: “composite SEM” is used from this point forward.
3. The third family member more familiar to
1. Researchers in psychology and related disci-
researchers in epidemiology, computer science, and
plines are probably most familiar with the SEM fam-
medicine is the structural causal model (SCM) or
ily called covariance structure analysis, covariance
nonparametric SEM, which originated in Judea
structure modeling, or covariance-based SEM. All
Pearl’s work in the 1970s–1980s on Bayesian prob-
techniques of this type estimate parameters of causal ability networks and later extended to the more gen-
models made up of observed variables or proxies for eral problem of causal inference (Pearl, 2009). In the
latent variables by minimizing the difference between SCM, causal hypotheses are represented in a directed
the sample covariance matrix and the predicted covari- acyclic graph (DAG) when unidirectional causation is
ance for the same measured variables, given the model, assumed or in a directed cyclic graph (DCG) when
which represents hypotheses about how and why model certain variables are hypothesized to affect each other,
variables should be related (covary). Latent variables or reciprocal causation. The method is nonparametric
are approximated with common factors of the type ana- because the specification of a causal graph requires no
lyzed in classical factor analysis techniques that date commitment to any particular operational definition,
to the beginning of the 1900s (e.g., Spearman, 1904). distributional assumption, or specific functional form
The technique of confirmatory factor analysis (CFA) is of statistical association, such as linear versus curvilin-
a member of this family. Although this choice reflects ear, for any pair of variables. Unlike model diagrams in
my own background in psychology, I will use the sim- the other two SEM families, which are basically static
pler term “traditional SEM” to refer to the covariance- entities that require data to be analyzed, there are spe-
based SEM family.5 cial methods and computer programs for analyzing a
2. The second member of the SEM family is better causal graph with no data. This capability permits the
known among researchers in disciplines such as market- researcher to analyze alternative causal models in the
ing, organizational research, business research, or infor- planning stage of a study. The results of such analyses
mation systems, among others. It is called variance- can inform the researcher about how to select covari-
based SEM, composite SEM, or partial least squares ates that control possible confounding of target causal
path modeling (PLS-PM). The term “PLS-PM” also effects, among other possible ways to deal with the
refers to an estimation algorithm based on regression problem. This approach has also motivated the devel-
techniques that underlie many, but all not, applications opment of novel approaches to mediation analysis that
of composite SEM. It analyzes composites, or weighted are described later in the book. From now onward, this
combinations of observed variables, to approximate family is called “nonparametric SEM.”
hypothetical constructs. The term “variance-based”
signals that the goal is not strictly to explain the sample Traditional SEM
covariance matrix, although this goal can also be pur-
sued in a composite analysis. Instead, these techniques This SEM family is a synthesis of two frameworks:
analyze total variation among observed variables when the path analytic method by Wright, based on regres-
estimating causal effects, given the model. The classi- sion techniques to estimate causal effects; and the
cal general linear model (GLM) of multivariate statis- factor analytic approach to estimate latent variables
tics that includes multiple regression, MANOVA (i.e., from observed or manifest variables. It dates to (1)
the introduction of path analysis to the social sciences
5 Yes,what is considered as tradition by a person with a particu- in the 1950s–1960s by Blalock (1961) and others and
lar background may be seen as novelty to another person from a (2) the subsequent integration of regression techniques
different background. That’s life and multidisciplinary research. and factor-analytic methods into a unified framework
Pt1Kline5E.indd 11 3/22/2023 3:52:35 PM

in the 1970s–1980s called the JWK model (Bentler, Traditional SEM itself is part of an extended fam-
1980). The acronym refers to the work of three authors: ily of methods for analyzing latent variable models
K. G. Jöreskog, J. W. Keesling, and D. Wiley. The first that is briefly outlined next; see the sources cited for
widely available computer program to fit casual mod- more information. For example, latent variables in
els to sample covariance matrices was LISREL III by traditional SEM are assumed to be continuous. There
Jöreskog and Sörbom (1976), which was the progenitor are other techniques for analyzing models with cat-
for later versions of LISREL and other computer pro- egorical latent variables. The levels of a categorical
grams that include Amos, EQS, lavaan, and Mplus, latent variable are called classes, and they represent
among others. a mixture of subpopulations where membership is not
This family of traditional SEM techniques is prob- known but is inferred from the data. Thus, a goal of
ably the most widely used. It offers the potential the analysis is to estimate the nature and number of
benefits summarized next (Bagozzi & Yi, 2012): By latent classes. The technique of latent class analy-
integrating regression techniques with factor analytic sis can be viewed as a kind of factor analysis but one
methods, it is possible to estimate causal effects for where classes are approximated from observed vari-
any c ombination of observed or latent variables speci- ables that could be either categorical or continuous.
fied as presumed causes or outcomes. The explicit Lanza and Rhoades (2013) described applications of
distinction between observed and latent variables can latent class analysis to identify subgroups in treatment
take direct account of measurement error, or score outcome studies.
unreliability. This characteristic lends a more realistic Muthén (2001) described the analysis of mix-
sense to the analysis in that researchers in the behav- ture models—also called mixture modeling—with
ioral sciences often analyze scores that are subject to latent variables that may be continuous or categorical.
measurement error. As mentioned, traditional SEM When both are present in the same model, the analy-
supports a range of applications from exploratory to sis is basically traditional SEM but conducted across
more confirmatory, depending on the researcher’s inferred subpopulations. The work just cited is part of
hypotheses and aims. There is a rich variety of mod- the larger ongoing effort to express all latent variable
els that can be analyzed for longitudinal data, more so models within a common mathematical framework
than in composite SEM. (Bartholomew, 2002). The Mplus computer program is
There are also limitations of traditional SEM: It especially adept at analyzing a variety of latent variable
requires large samples, which makes it challenging to models. This is because it can analyze all basic kinds of
apply the method in research areas where it is difficult to traditional SEM models and mixture models, too. Both
collect large numbers of cases, such as in studies of rare kinds of analyses just mentioned can also be combined
disorders. Exactly what is meant by “large samples” is in Mplus with multilevel modeling—also called linear
addressed in a later section of this chapter. There are mixed modeling or hierarchical linear modeling—
certain types of hypotheses that are difficult to test in for analyzing data with repeated measurements or that
this SEM family. An example is when hypothetical con- are organized in hierarchical levels, such as children
structs are defined in ways that contradict the estimation with families, where scores within the same level are
of latent variables with common factors as one would do probably not independent (Nezlek, 2008). Computer
in the technique of CFA. In such cases, the composite programs like Mplus blur the distinction between tradi-
SEM family—which uses composites to approximate tional SEM and techniques such as latent class analysis,
concepts, not common factors—may serve as a better mixture modeling, and multilevel modeling.
alternative for reasons explained in Chapters 3 and 16.
Other limitations are due to the misuse of traditional
Composite SEM
SEM, which is unfortunately widespread. These prob-
lems are not due to the method per se, but I would guess Composite SEM was developed in the 1970s–1980s by
that many, if not most, published SEM studies have one Herman O. A. Wold (1982). It analyzes composites as
flaw so severe that the results may have little or no inter- proxies for hypothetical variables, not common fac-
pretative significance. This critical issue is also elabo- tors. Statistical methods for composites are generally
rated later in the chapter. See Bollen, Fisher, Lilly, et al. simpler than methods for common factors, and there
(2022), who described the 50-year history of traditional are fewer distributional assumptions and other require-
SEM since 1972, reviewing strengths and vulnerabili- ments for composite methods. This approach was once
ties, and outlining future directions. described as a “soft modeling” alternative to “heavy-
Pt1Kline5E.indd 12 3/22/2023 3:52:35 PM

weight” traditional SEM with its complex estimation can now be followed in both composite SEM and tradi-
algorithms, potential for technical problems to scuttle tional SEM. Very recent developments (e.g., Schuberth,
the analysis, and the need for large samples. In its early 2021) extend the capabilities of standard SEM com-
years, the emphasis in composite SEM was on predic- puter programs like lavaan, Mplus, LISREL, and oth-
tion of target constructs by other variables, including ers to analyze composite models, which brings to the
covariates or other composites for different theoretical analysis all the advantages formerly associated with
variables. Because estimators were based on standard traditional SEM. Composite SEM can also test kinds
regression techniques, these methods generally maxi- of hypotheses about measurement that are difficult to
mized R2, or the proportion of explained variance, for evaluate in traditional SEM, and in such cases com-
outcome variables. In contrast, traditional SEM tech- posite SEM is the preferred technique. Relatively new
niques do not necessarily maximize R2 for individual estimators can directly control for measurement error
outcomes. Thus, composite SEM at the time was gen- in composites. All these developments explain why I
erally thought of as a prediction method versus tradi- think it is important for you to know about this member
tional SEM, which was seen as emphasizing explana- of the SEM family, too.
tion through maximizing the similarity of sample and
model-implied data matrices.
Nonparametric SEM
In addition to the distinction of prediction versus
explanation, authors of works about composite SEM Pearl (2009) described his graph-theoretic approach,
published roughly in the 1990s–2010s (e.g., Hair et the SCM, as unifying two frameworks to causal infer-
al., 2012) highlighted these relative advantages: For ence, traditional SEM and the potential outcomes
the same number of observed variables in the model, model (POM), also called the Neyman–Rubin model
composite SEM generally required smaller sample after Jerzy Neyman and Donald Rubin (Rubin, 2005).
sizes than traditional SEM, and the power of signifi- Briefly, the POM elaborates on the role of counterfac-
cance tests was described as generally higher in com- tuals in causal inference. A counterfactual is a hypo-
posite SEM for the same sample size. Its relative lack thetical or conditional statement that expresses not
of distributional assumptions was rightly touted as an what has happened but what could or might happen
advantage, as was its more exploratory nature com- under different circumstances (e.g., “I would not have
pared with traditional SEM. Drawbacks included its been late, if I had correctly set the alarm”). In treat-
lack of a direct way to control for measurement error ment outcome studies, for example, there are two basic
or test the overall fit of the model to the data, which are counterfactuals: (1) what would the outcome of control
standard practices in traditional SEM. The inability to cases be, if they were treated; and (2) what would the
assess overall fit in composite SEM limited its role in outcome of treated cases be, if they were not treated? If
testing theories; specifically, it offered no direct way to each case was either treated or not treated, these poten-
determine how well the model as a whole explained the tial outcomes would not be observed. This means that
data. At the time, composite SEM could be described the observed data—outcomes for treatment versus con-
as “SEM-lite” for testing hypotheses about latent vari- trol—is a subset of all possible combinations.
ables compared with its older brother, traditional SEM. The POM is concerned with conditions under which
Oh how times change. Technical and conceptual causal effects may be estimated from data that do not
advances in composite SEM since about 2010 have include all potential outcomes. In experimental designs,
been a turning point, and this is no hype. (I am not big random assignment helps with the replicability of stud-
on hype; keep it real, please.) Just a few key develop- ies in terms of the equivalence of treated and control
ments are summarized next; see Henseler (2021) and groups. Thus, any observed difference in outcome can
Chapter 16 in this book for more detailed accounts: It is be attributed to the treatment, provided there is suffi-
now possible to apply composite SEM across the whole cient replication. But things are more complicated in
range of studies from more exploratory to more con- quasi-experimental or nonexperimental designs where
firmatory. One reason is that new methods allow the the assignment mechanism is both nonrandom and
researcher to test the fit of the whole model to the data, unknown. In this case, the average observed difference
just as in traditional SEM. The same basic stages of may be a confounded estimator of treatment effects
the analysis, from specification of the model through versus selection factors. In such designs, the POM
its analysis and possible modification based on the data, distinguishes between equations for the observed data
Pt1Kline5E.indd 13 3/22/2023 3:52:35 PM

versus those for causal parameters. This feature helps PEDAGOGY AND SEM FAMILIES
to clarify how estimators based on the data may differ
from estimators based on the causal model (MacKin- Next, I explain my approach to teaching you about
non, 2008). The POM has been applied many times in SEM, given the descriptions of the three SEM families
randomized clinical trials and in mediation analysis, a just considered. First, beginning in Chapter 6, I describe
topic covered later in this book. the logic of causal inference for nonparametric struc-
Some authors describe the POM as a more disci- tural equation models where only theoretical variables
plined method for causal inference than traditional are represented in a causal graph. These variables are
SEM (Rubin, 2009), but such claims are problematic for not yet operationalized. Of course, theoretical variables
two reasons (Bollen & Pearl, 2013). First, it is possible must eventually be measured. There could be a vari-
to express counterfactuals in SEM as predicted values ety of ways to operationalize a construct. For example,
for outcome variables, once we fix values of its causal there is single-indicator measurement, where a single
variables to constants that represent the conditions in observed variable is the proxy for the corresponding
counterfactual statements (Kenny, 2021). Second, the construct. Scores on that indicator could be treated
as ordinal data that measure relative standing without
POM and SEM are logically equivalent in that a theo-
assuming equal intervals or treated as continuous data
rem in one framework can be expressed as a theorem in
if intervals are assumed to be more or less equal, among
the other. That the two systems encode causal hypoth-
other possibilities. An alternative is multiple-indicator
eses in different ways—in SEM as functional relations
measurement, where a set of ≥ 2 observed variables
among observed or latent variables and in the POM as is used to approximate the same theoretical variable.
statistical relations among counterfactual (latent) vari- That is, instead of placing all of one’s measurement
ables—is just a superficial difference (Pearl, 2009). eggs in the basket of a single observed variable, scores
Thus, the SCM as outlined by Pearl is a framework from sets of multiple of variables are combined when
for causal inference that the extends the capabilities of approximating theoretical variables.
both traditional SEM and the POM. This is why Hay- But exactly how target constructs should be measured
duk et al. (2003) described the SCM as the future of is a decision that should come after working out your
SEM and also why I introduce it to readers both in this basic causal hypotheses by specifying and analyzing a
edition and in the previous (4th) edition of this book. causal graph in nonparametric SEM. Concepts about
Grace et al. (2012) described the SCM as the third gen- ways to control confounding and determine whether
eration of SEM. The first generation of SEM dates to it is possible to estimate specific causal effects in the
Sewall Wright, who invented path analysis and corre- graph are reviewed. If analysis of the casual graph indi-
sponding path diagrams as the graphical expression of cates that a particular effect cannot be estimated, then
causal hypotheses. The second generation of SEM is something needs to be done, such as adding covariates
the synthesis of path analysis and factor analysis in tra- to address confounding. If a causal effect can be esti-
ditional SEM.6 Earlier I described the possibility in this mated, analysis of the graph may indicate that more
approach of analyzing the graphical model before the than one estimate of that effect could be generated in
data are collected as a way to help plan the study. These the data. Such knowledge helps the researcher to plan
methods can also locate testable implications implied the analysis in ways that prevent unpleasant surprises,
by the graph. No special software is needed to analyze such as discovering after collecting the data that cer-
the data. This is because testable implications for contain hypothesized effects cannot be estimated without
tinuous variables can be estimated with standard com- adding variables to the model (for which it might be
puter tools for statistical analysis, such as IBM SPSS, too late).
with no need for special SEM software. This particular Next, I describe in Chapter 7 the specification of
approach is called piecewise SEM, which is described parametric causal models that correspond to actual
in Chapter 8. measured variables. To keep things simpler, the mea-
surement approach is single indicator and the corre-
6 Muthén (2001) described second-generation SEM as the capa- sponding parametric models are manifest variable
bility in traditional SEM to analyze continuous or categorical path models analyzed in the classical technique of
outcomes in models with fixed or random effects, such as latent path analysis, the most senior member of the tradi-
growth models. tional SEM family. I outline in Chapters 8–12 the fun-
Pt1Kline5E.indd 14 3/22/2023 3:52:35 PM

damental principles of estimating model parameters, posite SEM) is a large-sample technique. Implications
evaluating model fit to the data, testing hypotheses of this characteristic are considered throughout the
about the model, respecifying the model if its account book, but I can say now that certain types of estimates
of the data is rejected, comparing alternative models fit in SEM, such as standard errors for effects of latent
to the same data, and analyzing the same model over variables, may be inaccurate when the sample size is
multiple groups or conditions. These topics are the core not large. The risk for technical problems in the analy-
of traditional SEM. They all generalize to latent vari- sis is greater, too.
ables models based on multiple-indicator measure- Because sample size is such an important issue, let
ment of the kind analyzed in the technique of CFA, but us now consider the bottom-line question: What is a
I want you to understand these fundamentals before we “large enough” sample size in SEM? It is impossible to
add the complexities of multiple-indicator measure- give a single answer because the factors that are sum-
ment to the mix. That is, even if you are most interested marized next can affect sample size requirements:
in latent variable modeling, a strong knowledge of path
analysis with observed variables will you help you get 1. More complex models, or those with more param-
there. eters, require bigger sample sizes than simpler models
Coverage of latent variable models begins in Chap- with fewer parameters. This is because models with
ter 13 with a conceptual treatment of multiple-indicator more parameters require more estimates, and larger
measurement. I should caution you that this chapter has samples are necessary in order for the computer to esti-
a different structure compared with most other chap- mate the additional parameters with reasonable preci-
ters in the book. Specifically, there are no data analysis sion.
examples nor equations. Instead, we deal with concepts
about measurement with multiple indicators that the 2. In analyses in which all outcome variables are
researcher should understand before applying specific continuous and normally distributed, all effects are lin-
analysis techniques. The goal is help you to specify ear, and there are no interactive effects require smaller
models where sets of multiple indicators approximate sample sizes. This is in comparison to analyses in which
hypothetical constructs in ways consistent with your some outcomes are not continuous or have severely
hypotheses. nonnormal distributions or in which there are curvilin-
Described in Chapters 14–15 are, respectively, the ear or interactive effects. Sample size limitations also
traditional SEM technique of CFA for analyzing mea- curtail the use of estimation methods available in SEM,
surement models and the analysis of so called latent some of which need very large samples because of the
variable path models where single or multiple indi- assumptions they make—or do not make—about the
cators are used to approximate theoretical variables. data.
These types of models represent the apex in traditional 3. Larger sample sizes are needed if score reliabil-
SEM that can be extended in many ways to test addi- ity is relatively low; that is, less precise data require
tional, more advanced hypotheses that are covered larger samples in order to offset the potential distorting
in Chapters 17–22. Chapter 16 is devoted to compos- effects of measurement error. Latent variable models
ite SEM with an emphasis on explaining the kinds of can control measurement error better than observed
hypotheses about measurement that are difficult to test variable models, so fewer cases may be needed when
in traditional SEM but can be evaluated with relative there are multiple indicators for constructs of interest.
ease in composite SEM. I do not describe traditional The amount of missing data also affects sample size
SEM and composite SEM as competitors; instead, I requirements. As expected, higher levels of missing
emphasize their unique roles and capabilities as com- data require larger sample sizes in order to compensate
plementary approaches. for loss of information.
4. There are also special sample size considerations
SAMPLE SIZE REQUIREMENTS for particular kinds of structural equation models. In
factor analysis, for example, larger samples may be
Attempts to adapt SEM techniques to work in smaller needed if there are relatively few indicators per factor,
samples are described in Chapter 17, but it is still gener- the factors explain unequal proportions of the variance
ally true that SEM (more so in traditional than in com- across the indicators, some indicators covary appre-
Pt1Kline5E.indd 15 3/22/2023 3:52:36 PM

ciably with multiple factors, the number of factors is sample size would be 20q, or N = 200. Less ideal would
increased, or covariances between factors are relatively be an N:q ratio of 10:1, which for the example just given
low. for q = 10 would be a minimum sample size of 10q, or
N = 100. As the N:q ratio falls below 10:1 (e.g., N = 50
Given all of these influences, there is no simple rule for q = 10 for a 5:1 ratio), so does the trustworthiness
of thumb about sample size that works across all stud- of the results. The risk for technical problems in the
ies. Also, sample size requirements in SEM can be analysis is also greater.
considered from at least two different perspectives, (1) It is even more difficult to suggest a meaningful
the number of cases required in order for the results to absolute minimum sample size, but it helps to consider
have adequate statistical precision versus (2) minimum typical sample sizes in SEM studies. A median sample
sample sizes needed in order for significance tests in size may be about 200 cases based on reviews of stud-
SEM to have reasonable power. Recall that power is the ies in different research areas, including operations
probability of rejecting the null hypothesis in signifi- management (Shah & Goldstein, 2006) and education
cance testing when the alternative hypothesis is true in and psychology (MacCallum & Austin, 2000). But
the population. Depending on the model and analysis, N = 200 may be too small when analyzing a complex
sample size requirements needed for power to equal or model or outcomes with nonnormal distributions, using
exceed, say, .95, can be much greater than those needed an estimation method other than maximum likelihood,
for statistical precision. or finding that there are missing data. With N < 100,
Results of a computer simulation (Monte Carlo) almost any type of SEM may be untenable unless a very
study by Wolf et al. (2013) illustrate the difficulty with simple model is analyzed, but models so basic may be
“one-size-fits-all” heuristics about sample size require- uninteresting. Barrett (2007) suggested that reviewers
ments in SEM. These authors studied a relatively small of journal submissions routinely reject for publication
range of structural equation models, including factor any SEM analysis where N < 200 unless the popula-
analysis models, manifest-variable versus latent-vari- tion studied is restricted in size. This recommendation
able models of mediation, and single-indicator versus is not standard practice, but it highlights the fact that
multiple-indicator measurement models. Minimum analyzing small samples in SEM is problematic.
sample sizes for both precision and power varied widely Most published SEM studies are probably based on
across the different models and extent of missing data. samples that are too small. For example, Loehlin and
For example, minimum sample sizes for factor analysis Beaujean (2017) noted that results of power analyses
models ranged from 30 to 460 cases, depending on the in SEM are “frequently sobering” because researchers
number of factors (1–3), the number of indicators per often learn that their sample sizes are far too small for
factor (3–8), the average correlation between indicators adequate statistical power. Westland (2010) reviewed
and factors (.50–.80), the magnitude of factor correla- a total of 74 SEM studies published in four different
tions (.30–.50), and the extent of missing data (2–20% journals on management information systems. He esti-
per indicator). mated that (1) the average sample size across these
I describe in Chapter 10 various methods to estimate studies, about N = 375, was only 50% of the minimum
target sample sizes in traditional SEM, but here I can size needed to support the conclusions; (2) the median
suggest at least a few rough guidelines about sample sample size, about N = 260, was only 38% of the mini-
size requirements for statistical precision: For latent mum required and also reflected substantial negative
variable models where all outcomes are continuous and skew in undersampling; and (3) results in about 80% of
normally distributed and where the estimation method all studies were based on insufficient sample sizes. We
is maximum likelihood—the default method in most will revisit sample size requirements in later chapters,
SEM computer tools—Jackson (2003) described the but many, and probably most, published SEM studies
N:q rule. In this heuristic, Jackson (2003) suggested are based on samples that are too small.
that researchers think about minimum sample sizes
in terms of the ratio of the number of cases (N) to the
number of model parameters that require statistical BIG NUMBERS, LOW QUALITY
estimates (q). A recommended sample-size-to-param-
eters ratio would be 20:1. For example, if a total of q There is no denying that SEM is increasingly “popular”
= 10 parameters requires estimates, then a minimum among researchers. Thousands of SEM studies have
Pt1Kline5E.indd 16 3/22/2023 3:52:36 PM

been published at an accelerating rate since the 1990s Many problems were apparent. For instance, an explicit
or so (Hair et al., 2012; Thelwall & Wilson, 2016). The justification for using SEM was provided in only
increasing availability of computer programs, both about 40% of the studies; distributional assumptions
commercial and free, has made SEM in its various were addressed in about 20%; a specific rationale for
forms ever more accessible to applied researchers. It the sample size was given in about 30%; and specific
is not hard to understand the enthusiasm for SEM. As details about the correspondence between model and
described by David Kenny in the Series Editor Note in data were reported in 20% of reviewed studies. These
the second edition of this book, researchers love SEM results are disheartening, but they are hardly atypical.
because it addresses questions they want answered Nothing in SEM protects against equivalent models,
and it “thinks” about research problems the way that which explain the data just as well as the researcher’s
researchers do. But there is evidence that many—if preferred model, but an equivalent model makes dif-
not most—published reports of the application of SEM fering causal claims. The problem of equivalent mod-
have serious flaws as described next. els is relevant in probably most applications of SEM,
MacCallum and Austin (2000) reviewed about 500 but most authors of SEM studies do not even mention
SEM studies in 16 different psychology research jour- it (MacCallum & Austin, 2000). Ignoring equivalent
nals, and they found problems with the reporting in models is a serious kind of confirmation bias whereby
most cases. For example, in about 50% of the articles, researchers test a single model, give an overly posi-
the reporting of parameter estimates was incomplete tive evaluation of the model, and fail to consider other
(e.g., unstandardized estimates omitted); in about 25% explanations of the data (Shah & Goldstein, 2006). The
the type of data matrix analyzed (e.g., correlation vs. potential for confirmation bias is further strengthened
covariance matrix) was not described; and in about by the lack of replication, a point considered next.
10% the model specified or the indicators of factors It is rare when SEM analyses are replicated across
were not clearly specified. Shah and Goldstein (2006) independent samples either by the same researchers
reviewed 93 articles in four operations management who collected the original data (internal replication)
research journals. In most articles, they found that it or by other researchers who did not (external replica-
was hard to determine the model actually tested or the tion). The need for large samples in SEM complicates
complete set of observed variables. They also found replication, but most of the SEM research literature is
that the estimation method used was not mentioned in made up of one-shot studies that are never replicated.
about half of the articles, and in 31 out of 143 studies, It is critical to eventually replicate a structural equation
the model described in the text did not match the statis- model if it is ever to represent anything beyond a mere
tical results reported in text or tables. statistical exercise. Thus, the SEM research literature
Zhang et al. (2021) reviewed 144 SEM studies pub- is part of the broader replication crisis in psychology
lished in 12 top organizational and management jour- and other disciplines. Kaplan (2009) noted that despite
nals in 2011–2016. Each article was evaluated against over 40 years of application of SEM in the behavioral
criteria that included sciences, it is rare that results from SEM analyses are
used for policy or clinically relevant prediction studies.
1. The clarity of the rationale for using SEM over The ultimate goal of SEM—or any other type of
alternative methods. method for statistical modeling—should be to attain
2. Whether the statistical models analyzed were what I call statistical beauty, which means that the
described in sufficient detail. final retained model (if any)
3. The completeness of the reporting on data integ-
rity, including the rationale for the sample size and 1. Has a clear theoretical rationale (i.e., it makes sense).
information about distributional assumptions, miss- 2. Differentiates between what is known and what is
ing data, and reliability of the scores analyzed. unknown—that is, what is the model’s range of con-
4. Whether hypotheses were tested in a clear, specific venience, or limits to its generality?
order. 3. Sets conditions for posing new questions.
5. Whether the statistical results were described in
adequate detail so that readers could evaluate the That most applications of SEM fall short of these goals
trustworthiness of the conclusions. should be taken as an incentive by all of us to do better.
Pt1Kline5E.indd 17 3/22/2023 3:52:36 PM

LIMITS OF THIS BOOK ing the data, and the potential to distinguish between
observed and latent variables and to test a wide range
Many advanced applications in SEM are described in of hypotheses about measurement and causation. More
Chapters 17–22, but it is impossible (and undesirable, and more researchers are using SEM, but in too many
too) to cover the whole range of extended or special- studies there are serious problems with the way it is
ized analyses in a single volume. Just a few of these applied or with how analysis results are reported. How
topics are mentioned next with citations for sources to avoid getting into trouble with SEM is a major theme
that provide more information. There are special struc- in later chapters of this book. The ideas introduced in
tural equation models for imaging data, such as func- this chapter set the stage for review in the next chap-
tional magnetic resonance imaging (fMRI) (Cooper et ter of fundamental principles in statistics that underlie
al., 2019), and also for genetic data (Luo et al., 2019). SEM.
Bayesian SEM combines the methods of Bayesian esti-
mation with the analysis of structural equations models
(Depaoli, 2021). Bayesian options for SEM are men- LEARN MORE
tioned at various points in the book, but their applica-
tion requires strong knowledge of Bayesian statistics. Bollen and Pearl (2013) describe myths about traditional
Two other advanced topics not covered in this book SEM, Hair (2021) outlines the history of composite SEM, and
Wolfle (2003) traces the introduction of path analysis to the
include multilevel SEM (Castanho Silva et al., 2020)
social sciences.
and the analysis of interactive effects of latent variables
(Cortina et al., 2021), although methods for observed
Bollen, K. A., & Pearl, J. (2013). Eight myths about causal-
variables are described in Chapters 12 and 20. ity and structural equation models. In S. L. Morgan
(Ed.), Handbook of causal analysis for social research
(pp. 301–328). Springer.
SUMMARY
Hair, J. F. (2021). Reflections on SEM: An introspective, idio-
The SEM families of techniques have their origins syncratic journey to composite-based structural equation
in regression-based analyses of observed variables, modeling. SIGMIS Database, 52(SI), 101–113.
factor-analysis-based evaluations of common factor Wolfle, L. M. (2003). The introduction of path analysis
models, composite-based analyses of proxies for theo- to the social sciences, and some emergent themes: An
retical variables, and methods from computer science annotated bibliography. Structural Equation Modeling,
for analyzing causal graphs. Essential features include 10(1), 1–34.
the capabilities to analyze causal models before collect-
Pt1Kline5E.indd 18 3/22/2023 3:52:36 PM

2
Background Concepts and Self‑Test
Newcomers to SEM should have good statistical knowledge in at least three areas: (1) regression analysis;
(2) correct interpretation of results from statistical significance testing including the role of bootstrapping;
and (3) psychometrics, or statistical measures of the properties of scores from psychological tests including
evidence for their reliability or validity. Some estimates in SEM are interpreted exactly as regression coef-
ficients, and these interpretations depend on many of the same assumptions as in regression analysis. The
potential for bias due to omitted predictors that covary with measured predictors or due to measurement
error in predictors or the criterion is similar in both regression and SEM when (1) there is a single observed
measure of each construct, and (2) score reliability is not explicitly represented nor controlled in the analysis.
Results of significance testing both for the whole model and for estimates of its individual parameters are
widely reported in SEM studies, although statistical significance is hardly the sole basis for inference in SEM
for reasons explained throughout the book. It is also true that results in significance testing are widely mis-
understood in perhaps most analyses, including SEM, and you need to know how to avoid making common
mistakes. If predictor or outcome variables are measured with psychological tests, their score reliabilities
should be routinely assessed in the researcher’s sample. Knowledge of psychometrics is also essential when
selecting among alternative measures of the same target concept.
This chapter serves as a check on your understanding of regression fundamentals, significance test-
ing, and psychometrics, and as a gateway to additional resources for self-study. Specifically, after a review
of potential obstacles to a strong command of these topics and a summary of background topics from the
perspective of learning about SEM, a self-test of comprehension with a scoring system is provided. Relatively
low scores (e.g., < 50% correct) in a particular area, such as psychometrics, would signal the need for further
reading. To that end, a total of three supplementary chapters, the Regression Primer, Significance Testing
Primer, and Psychometrics Primer, are freely available on the website for this book. Some advice: Even if you
think that you already know these topics, you should take the self-test. Many readers tell me that they learned
something new after hearing about the issues outlined next.
UNEVEN BACKGROUND States and Canada about training in statistics, measure-

PREPARATION ment, and research methods. They replicated a compa-
rable survey from almost 20 years before on the same
There is evidence that not all of the background top- topic. In 2008, the median for training in statistics and
ics just mentioned are adequately covered in graduate measurement was 1.6 years associated mainly with a
school. For example, Aiken et al. (2008) surveyed just 1-year introductory statistics sequence. About 80% of
over 200 psychology doctoral programs in the United doctoral programs offered in-depth training in both the
19
Pt1Kline5E.indd 19 3/22/2023 3:52:36 PM

“old standards” (e.g., ANOVA) and multiple regres- Reasons for this critical assessment include the likeli-
sion (MR) with the result that students could generally hood that many, if not most, p values reported in the
perform such analyses themselves (Aiken et al., 2008). literature are wrong, the myriad of false beliefs associ-
Significance testing in ANOVA, MR, and other tech- ated with significance testing, and the concern about p
niques would also typically be covered in introductory hacking—see Topic Box 2.1 for elaboration.
graduate statistics courses (Kline, 2020b). But whether Instruction in measurement was available in about
graduate students—or even experienced researchers— 60% of psychology doctoral programs surveyed by
understand significance testing or related concepts, Aiken et al. (2008), but coverage of core topics such
such as confidence intervals, is questionable, a topic as classical test theory, item response theory (IRT),
addressed soon. and test construction was typically brief with a median
Aiken et al. (2008) also described critical gaps in length of just 4.5 weeks for all topics. This is not enough
statistics or measurement training. For instance, only time to develop any real expertise about measurement.
about 30% of doctoral programs offered in-depth In fact, I bet that an undergraduate student who has
training in regression diagnostics, or techniques for taken a one-semester (e.g., 3-credit) course in psycho-
assessing whether assumptions are tenable or if there metrics knows more about psychological measurement
are scores, such as outliers, with undue influence on the and test construction than a graduate student without
results. Misunderstanding about assumptions in MR is this background. The meagre amount of doctoral-level
common. Examples include being unaware that per- measurement training in 2008 was actually a slight
fect score reliability for predictor variables is assumed, improvement over results for the same area reported 20
believing falsely that measurement error necessarily years earlier.
causes underestimation of individual regression coef-
ficients, and the myth that the requirement for normal
distributions applies to the observed scores instead POTENTIAL OBSTACLES TO LEARNING
of the residuals (Williams et al., 2013). Over 90% of ABOUT SEM
nearly 900 articles in clinical psychology research jour-
nals reviewed by Ernst and Albers (2017) were unclear I do not know whether the results just summarized
about assumption checks in MR analyses. Another gap about psychology graduate training in statistics, mea-
is that logistic regression for dichotomous outcomes is surement, and research design apply generally to other
covered in depth in only about 10% of doctoral pro- disciplines. But in my experience, based on working
grams (Aiken et al., 2008). This result suggests that with SEM novices from disciplines such as education,
training in other methods for binary data, such as the biology, health sciences, operations management, mar-
probit regression model, or methods for categorical keting, and commerce, researchers face the following
outcomes with three or more levels, such as ordinal potential obstacles to learning about SEM:
regression for ordered categories or multinomial logis-
tic regression for unordered categories, is infrequent at 1. The emphasis in graduate training in regression
the graduate level. techniques is mainly on continuous outcomes, so new
Another gap is that controversies in significance test- researchers may be relatively unprepared to apply esti-
ing are thoroughly covered in only about 30% of psy- mators for categorical outcomes in SEM.
chology doctoral programs (Aiken et al., 2008). This is
unfortunate because ongoing debate about the proper 2. Overemphasis of statistical significance can blind
role of significance testing, including none, is part of a researchers to other aspects of the results that are just
larger credibility crisis about psychology research that as critical, if not even more so, than p values. Antona-
includes concerns about replication, reporting, mea- kis (2017) referred to this particular stumbling block
surement, and the wasting of perhaps most research as significosis, or an ordinate focus on statistical sig-
funds in some areas (Szucs & Ioannidis, 2017). How to nificance. A related distortion is dichotomania, or the
properly report results from SEM analyses is covered compulsion to dichotomize continuous p values against
in the next chapter, but traditional significance test- an arbitrary standard, such as p < .05 for results touted
ing is seen as lacking in more and more disciplines; as “significant” versus p ≥ .05 for other results that are
expert and novice researchers alike are unaware of discounted, ignored, or lamented because they are “not
that decline, which is a serious problem (Kmetz, 2019). significant.” This binary interpretation is not supported
Pt1Kline5E.indd 20 3/22/2023 3:52:36 PM

Background Concepts and Self‑Test 21
TOPIC BOX 2.1
Cautionary Tales About Significance Testing:

Inaccuracies, Errors, and Hacking
For two reasons, most p values reported in the research literature could be incorrect:
1. Assumptions of significance testing—random sampling from population distributions with known

properties (e.g., normality, homoscedasticity) and the absence of all other sources of error other
than sampling error—are generally implausible. Most samples in human studies are ad hoc, or
samples of convenience selected because they happen to be available. Convenience sampling
may have little, if anything, to do with random sampling. Few distributions in real data sets are
normally distributed or homoscedastic, and even slight departures from distributional assumptions
in small, unrepresentative samples can grossly distort p values (Erceg-Hurn & Mirosevich, 2008).
2. There are mistakes in too many journal articles in the reporting of p values. Nuijten et al. (2016)
found that over 50% of reviewed articles published in psychology research journals during the
years 1985–2013 contained at least one incorrect p value, given reported test statistics and
degrees of freedom.
There is ample evidence that most researchers, including those with the highest levels of statistical
training, do not fully understand what p values mean (McShane & Gal, 2016). Psychology professors in
two different surveys endorsed false belief about p values at rates generally no lower than among under-
graduate students (Haller & Krauss, 2002; Oakes, 1986), and about 90% of both groups endorsed at
least one incorrect interpretation. Summarized next are what I call the big five misinterpretations
of p values in significance testing. They are described for the case p < .05 when testing at the .05 level:
1. Most researchers endorse the local Type I error fallacy that the likelihood that the decision just
taken to reject the null hypothesis is a Type I error is less than 5%. This belief is wrong because any
particular decision to reject the null hypothesis is either correct or incorrect, so no probability (other
than 0 or 1.0) is associated with it. Also, it is only with sufficient replication that we could determine
whether or not the decision to reject the null hypothesis in a particular study was correct.
2. Probably most researchers endorse the odds-against-chance fallacy, or the false belief that
the probability that a particular result is due to chance (i.e., arose by sampling error alone) is
less than 5%. This belief is wrong because p is calculated by the computer assuming that the null
hypothesis is true, so the probability that sampling error is the only explanation is already taken
to be 1.0. Thus, it is illogical to view p as somehow measuring the likelihood of chance. Besides,
the probability that sample results are affected by error of some kind—sampling, measurement,
implementation, or specification error, among others—is virtually 1.0. From this perspective, basi-
cally all sample results are wrong in that they generate incorrect point estimates of the target
parameter, and significance testing in primary studies does nothing to change this reality (Ioan-
nidis, 2005).
3. Just under half of researchers believe that the probability is less than 5% that the null hypothesis
is true, which is the inverse probability error, also called the fallacy of the transposed
conditional (Ziliak & McCloskey, 2008), and the permanent illusion due to its persistence
over time and disciplines (Gigerenzer & Murray, 1987). This error stems from forgetting that
(continued)
Pt1Kline5E.indd 21 3/22/2023 3:52:36 PM

p values are conditional probabilities of data under the null hypothesis, not the other way around.
There are direct methods in Bayesian statistics to estimate conditional probabilities of hypotheses,
but not in classical significance testing.
4. The false belief that 1 – p is the probability of finding another significant result in a future sample
is the replication (replicability) fallacy, which is endorsed by about half of researchers. For
example, if p < .05, then the likelihood of replication is believed to exceed .95 under this myth.
Knowing the probability of replication in hypothetical future samples would be very useful, but the
quantity 1 – p is just the probability of a result or one even less extreme under the null hypothesis.
In general, replication is a matter of experimental design, sampling, and whether some effect
actually exists in the population.
5. The validity (valid research hypothesis) fallacy is the myth that 1 – p is the likelihood
that the alternative hypothesis is true, a false hope endorsed by about half of researchers. As
mentioned, probabilities of hypotheses are not estimated in significance testing, and both p and
its complement, 1 – p, are conditional probabilities of data, not hypotheses.
There are many other cognitive errors about significance testing outcomes. For example, the filter
myth is the false belief that p values sort results into two categories: those “significant” findings that are
not due to chance versus the “not significant” results that are due to chance. This myth is basically an exten-
sion of the odds-against-chance fallacy applied to results with higher (not significant) p values. Authors
in just over 50% of nearly 800 published empirical studies reviewed by Amrhein et al. (2019) committed
the zero fallacy—also called the slippery slope of nonsignificance (Cumming & Calin-Jageman,
2017)—by falsely interpreting the absence of statistical significance as evidence for a zero population
effect size.
The misinterpretations just described plus others—see the Significance Testing Primer—can promote
a cognitive style where researchers report low p values with great confidence or even bravado but with
little real comprehension (Lambdin, 2012). The same distortions can also lead to confirmation bias where
statistically significant results are uncritically accepted as proving the researcher’s hypotheses—see Calin-
Jageman and Cumming (2019) for examples. False beliefs also pose a challenge to instructors of a first
graduate statistics course: Students’ heads are filled with myth about significance testing that can interfere
with further learning unless those fallacies are identified and remediated (Kline, 2020b).
Confidence intervals are described as ways to summarize results in a more comprehensive way than
is possible in significance testing, and statisticians generally prefer interval estimation over point estima-
tion, or the reporting of results with no margins of error (or error bars in graphical form) (Cumming &
Calin-Jagerman, 2017). Because significance testing and confidence intervals are based on essentially the
same concepts and assumptions, it is not surprising that similar kinds of errors can affect both. Abelson
(1997) described the law of diffusion of idiocy, which states that every misstep in significant testing
has a counterpart with confidence intervals. There is also evidence that confidence intervals, too, are
widely misinterpreted (Hoekstra et al., 2014). Listed next are some common false beliefs (Morey et al.,
2016):
1. Fundamental confidence fallacy, or the myth that confidence level, such as 95%, indicates
the likelihood (i.e., .95) that the interval contains the value of the target parameter.
2. Precision fallacy, or the misunderstanding that narrower confidence intervals automatically
signal greater precision of knowledge about the parameter.
(continued)
Pt1Kline5E.indd 22 3/22/2023 3:52:36 PM

3. Likelihood fallacy, or the untruth that a particular confidence interval includes equally likely
values for the parameter.
The phenomenon of p hacking involves the possibility of presenting any result as “significant”
through decisions, such as covariate selection, transformations of scores, or the treatment of missing data,
that are not always disclosed (Simmons et al., 2011). Most instances of hacking involve lowering p val-
ues so that key results are significant, but p values can also be increased when results that are not sig-
nificant favor the researcher’s goals. Perhaps the most infamous example of hacking to increase p is the
Vioxx calamity, where analyses of data from a clinical trial were manipulated to render nonsignificant an
increased risk of heart attack in the treatment group (Ziliak & McCloskey, 2008). Both p hacking and other
dubious practices that favor the researcher’s hypotheses are probably widespread (John et al., 2012). The
parallel in SEM is model hacking, where significance test results are manipulated to increase the odds
of retaining the model, which represents the researcher’s hypotheses. How to avoid model hacking in SEM
begins with complete and transparent reporting of the results, including explaining the bases for deciding
whether to retain any model.
by what can be trivial differences in continuous p val- SIGNIFICANCE TESTING

ues; that is, a result where p = .04 does not appreciably
differ from another result where p = .06 when testing at Many effects can be tested for statistical significance in
the .05 level. SEM, ranging from things such as the variance for a sin-
gle variable up to entire models evaluated across mul-
3. Relatively little formal training in measurement tiple samples. There are four reasons, however, why the
or psychometrics can put researchers in a difficult role for significance testing in SEM should be smaller
position when selecting measures for their research or compared with more standard techniques like ANOVA:
evaluating the quality of data from psychological tests
in their own samples. It can also hinder accurate and 1. The capability to evaluate an entire model at
complete reporting about psychometrics. once brings a higher-level perspective to the analysis.
Although statistical tests of individual effects in the
Whatever researchers lack in their training about model may be of interest, at some point the researcher
statistics, measurement, or research design when must make a decision about the whole model: Should
approaching what is, for them, a new technique such as it be rejected?—modified?—if so, how? This deci-
SEM can be addressed through self-study or participa- sion should not be based on significance testing alone
tion in seminars, summer schools, or other continuing because other factors, such as whether statistical results
education experiences. The need to periodically update are meaningful, given the research context and hypoth-
one’s data analysis skills is also a normal part of being eses, are equally if not more important than statistical
a researcher. That is, lifelong learning as self-initiated significance. There is also a sense in SEM that the view
education focused on personal or professional devel- of the entire model takes precedence over that of spe-
opment is a healthy perspective not just for research- cific details (individual effects).
ers, but for everyone. So let’s get on with knocking the 2. The technique of SEM generally requires large
rust off what you already know or adding to your skill samples, but in significance testing, effects with low p
set. Described next from the perspective of learning values in very large samples are sometimes of trivial
about SEM are special issues in significance testing, magnitude. By the same token, virtually all effects that
measurement, and regression—see the corresponding are not zero could be significant in a sufficiently large
primers for more detailed presentations. sample. Just the opposite can happen in smaller sam-
Pt1Kline5E.indd 23 3/22/2023 3:52:36 PM

ples: Effects of appreciable magnitude that are actually in the unstandardized solution, but standard errors for
true in the population may fail to be significant due to standardized estimates might be part of optional out-
low statistical power. put that must be requested by the researcher. Because
the unstandardized and standardized estimates for the
3. Researchers should be more concerned with esti-
same parameter have their own standard errors, it can
mating effect size and evaluating the substantive signif-
happen that, say, the unstandardized estimate is signifi-
icance of their results than with statistical significance,
cant but the standardized estimate is not significant at
which has little to do with scientific or practical impor-
the same level, or vice versa. This outcome is neither a
tance. In particular, researchers should not believe that
computer error nor contradictory because unstandard-
an effect or association exists just because a result is
ized and standardized estimates each have their own
significant, especially if an arbitrary threshold, such
distinct sampling distributions. Confusion about this
as p < .05, is applied to dichotomize continuous p val-
issue can be minimized by not dichotomizing p values,
ues. Likewise, researchers should not conclude that an
such as in NFSA-style reporting of the results.
effect is absent just because it is not significant (Was-
serstein et al., 2019); that is, avoid committing the zero
fallacy—see Topic Box 2.1.
MEASUREMENT AND PSYCHOMETRICS
4. Standard errors for the effects of latent vari-
ables are estimated by the computer, and those stan- Given limited training about measurement in many
dard errors are the denominators of significance tests graduate programs, it is not surprising that reporting
for those effects. The value of the standard error could on psychometrics in the research literature is too often
change if, say, a different estimation method is used or deficient. This is especially true for reliability coeffi-
sometimes even across different software packages for cients, which indicate the degree to which test scores
the same analysis and data. Thus, it can happen that an are consistent over variations in testing situations,
effect for a latent variable is “significant” in one analy- times, test administrators or scorers, forms, or selec-
sis but is “not significant” in another analysis with a tions of items from the same domain. Lower values of
different estimator or computer tool. Now, differences reliability coefficients indicate less precise scores. If
in p values across different computer programs for a reliability coefficient, designated here as rXX, equals
the same effect and estimation method are not usually zero, it means that the scores are basically random
great, but slight differences in p can make big differ- numbers, and random numbers measure nothing. The
ences in significance testing, such as p = .051 versus result rXX = 1.0 indicates flawless consistency, but such
.049 for the same effect when testing at the .05 level. a result in real data would be pretty extraordinary (i.e.,
don’t hold your breath waiting for perfection).
You may be surprised to know there is no require- Reporting values of reliability coefficients should be
ment to dichotomize p values at all. This means that routine whenever scores from psychological tests are
exact p values are simply reported but are not com- analyzed, but the reality is unfortunately different. For
pared against any bright-line rule or threshold, such as example, Vacha-Haase and Thompson (2011) reviewed
.05, or any other standard whatsoever. This means that nearly 50 meta-analyses of results from about 13,000
the stale, musty, and worn-out term “significant” is not primary studies in which test scores were analyzed.
used for results with low p values, so there are no aster- They found that about 55% of authors mentioned noth-
isks or any other symbol in text or tables that designate ing about score reliability. In 15% of reviewed studies,
“significant” results just as there is no special demarca- authors merely reported values of reliability coefficients
tion of other effects with higher p values as “not sig- from other sources, such as test manuals. Inferring
nificant.” Hurlbert and Lombardi (2009) referred to this from reliability coefficients derived in other samples,
perfectly legitimate reform of traditional significance such as a test’s normative sample, to a different popu-
testing as neo-Fisherian significance assessments lation is called reliability induction. Such reasoning
(NFSA), and it is consistent with the call to researchers about the generalizability of reliability coefficients
to stop using the term “significant” in point 3 just listed. needs explicit justification. But authors of reviewed
You should also know that SEM computer tools gen- studies rarely compared characteristics of their samples
erally print by default standard errors for the estimates with those from cited studies of score reliability. For
Pt1Kline5E.indd 24 3/22/2023 3:52:36 PM

example, scores from a computer-based task of reaction REGRESSION ANALYSIS

time developed in samples of young adults may not be
as precise for elderly adults, who may be unfamiliar Two topics are addressed next: (1) Effects of measure-
with this method of testing. ment error on the results in MR analyses and (2) the
A better practice is for researchers to report val- limited usefulness of significance testing when test-
ues of reliability coefficients in their own samples. It ing hypotheses about incremental validity in the pres-
is critical to do so because reliability is a property ence of even moderate amounts of measurement error.
of scores in a particular sample, not an immutable Perfect score reliability (rXX = 1.0) is assumed in MR
characteristic of tests. This is because the precision for the predictors but not for the criterion, also called
of scores from the same test varies over samples from the response, outcome, or dependent variable. This is
the same population due to sampling error. Variation because only the criterion has residuals, or differences
in score reliability may be even greater over samples between actual and predicted scores on the response
taken from a population different from the target pop- variable. The residuals give random measurement error
ulation for the test. Urbina (2014) used the term rela- a place to “go” for the criterion; that is, error is mani-
tivity of reliability to emphasize that (1) the quality of fested in the regression residuals. But predictors have
reliability belongs to scores, not tests, and (2) scores no residuals to “absorb” their measurement error; thus,
for individual cases are more or less reliable due to it must be assumed that their scores are perfectly con-
unique characteristics of examinees, such as motiva- sistent.
tion or fatigue, and also to examiner qualifications, Measurement error in the criterion only does not
such as experience in administering or scoring the bias unstandardized regression coefficients, if that
test, and conditions where testing takes place. Some error is unrelated to the predictors. But there is down-
of these factors are unrelated to the test itself, but they ward bias in the standardized coefficients, which also
can still influence the consistency and precision of makes the value of R2 decrease as error in the crite-
scores. Along these lines, researchers should also cite rion increases under the same condition (Williams et
reliability coefficients in published sources (reliabil- al., 2013). Things are more complicated when there is
ity induction) but with comments on the similarities error in the predictors: Measurement error in a single
between samples described in those sources and the predictor that is not shared with any other variable will
researcher’s sample. attenuate the absolute value of the regression coeffi-
Appelbaum et al. (2018) described revised jour- cient for that predictor. But regression coefficients for
nal article reporting standards for quantitative studies the other, error-free predictors can also be affected.
(JARS-Quant) for journals published by the Ameri- Whether those other coefficients are attenuated, exag-
can Psychological Association. Revised guidelines for gerated, or unbiased depends on the values and signs
reporting on psychometrics (including SEM studies) of (1) correlations between the underlying latent vari-
call on authors to able imperfectly measured by a predictor and all other
explanatory variables, and (2) the effect of the concep-
1. Report values of reliability coefficients for the tual variable just mentioned on the criterion (Bollen,
scores analyzed, if possible. 1989; Kenny, 1979). Values of these correlations and
their potential distorting effects are rarely known in
2. Describe the specifics of those reliability coeffi-
practice, so it difficult to anticipate the magnitudes and
cients, such as the length of the retest interval for
directions of this propagation of measurement error
test–retest reliabilities, characteristics of scorers
in a single predictor.
and their training for interrater reliabilities, or the
The effects of measurement error in multiple predic-
specific type of internal consistency coefficients for
tors are also hard to anticipate because there are two
composite (multi-item) scales.
sources of distortion for each predictor: Attenuation
3. Report the characteristics of external samples if bias due to error in a particular predictor, plus a term
reporting reliabilities from those samples, such as the value of which is influenced by the degree of unreli-
original test normative samples. ability error in other predictors (Kenny, 1979). It is pos-
4. Provide estimates of convergent and discriminant sible that these two sources of bias could cancel each
validity where relevant. other out, but that outcome could be rather unlikely.
Pt1Kline5E.indd 25 3/22/2023 3:52:36 PM

Whether bias is positive or negative in a particular coef- about .50, the error rate in significance testing is just
ficient is also generally unknown, but overall distortion greater than 65%. In even larger samples, the error
tends to increase as score reliabilities among the pre- rate can approach 1.0. These counterintuitive results
dictors are lower. happen because (1) there is a curvilinear relation such
Another issue is the presence of measurement error that Type I error rates generally peak over the range
that is shared over ≥ 2 predictors or with the criterion. rXX = .30–.70, and (2) Type I errors also increase with
Correlated measurement error can arise when mul- sample size, holding all else constant. In larger sam-
tiple variables are assessed with a common method, ples, significance tests have greater power to detect a
variables share common informants or stimuli, or the true relation between a predictor and the criterion, but
same variable is measured on ≥ 2 occasions among the measurement error causes the significance test to con-
same cases, among other possibilities. The signs and flate a common effect of multiple predictors with the
magnitudes of the error correlations also affect the unique contribution for a single predictor (Westfall &
degree and direction of bias in regression coefficients Yarkoni, 2016). Thus, it is possible that inferences in
(Bollen, 1989; Williams et al., 2013)—see Trafimow MR analyses about incremental validity based mainly
(2021) for more information and examples. There are on p values are suspect in many, if not most, published
ways to adjust individual regression coefficients for regression studies.
attenuation, but they assume that measurement error is You will learn that a relative strength of SEM over
not correlated and, as mentioned, measurement error standard regression techniques is the capability to
can attenuate or inflate regression coefficients. Whether explicitly represent in the model the score reliability
such simple corrections give accurate results in real- for any observed variable or the presence of correlated
world data is questionable (Williams et. al, 2013). measurement errors (if any) for particular pairs of vari-
Incremental validity concerns the relative contribu- ables. All other results, including coefficients for puta-
tions of different variables in predicting some criterion. tive causal variables, are estimated given the informa-
That contribution is estimated by the regression coeffi- tion about psychometrics specified by the researcher.
cients, which statistically control for effects of all other The potential advantage of SEM for testing hypotheses
predictors in the equation. In this way, the coefficients about incremental validity is even greater when theo-
indicate the relative contribution of each predictor retical concepts are specified as measured by multiple
above and beyond the rest.1 Many researchers interpret observed variables, or indicators. Thus, hypotheses
a statistically significant regression coefficient as evi- about incremental validity are better supported by
dence for the incremental validity of the corresponding methods like SEM that can explicitly model unreli-
predictor. There are thousands of published regression ability than in standard regression analysis (Westfall &
studies in which hypotheses about incremental valid- Yarkoni, 2016).
ity are tested in this way (Hunsley & Meyer, 2003), so
the issue of whether inferences based on p values for
regression coefficients are trustworthy is very relevant. SUMMARY
Based on rational analysis and computer simula-
tion results, Westfall and Yarkoni (2016) demonstrated It helps to begin learning about SEM with a good
that effective Type I error rates for significance tests comprehension of basic concepts in regression, sig-
of regression coefficients in MR are surprisingly high nificance testing including bootstrapping, and psycho-
even in reasonably large samples (e.g., N = 300) with metrics. Not all of these topics may be covered with
at least moderate levels of scores reliability (e.g., rXX equal depth in graduate programs, so researchers often
need to supplement their knowledge. There are also
= .80) in analyses with just two predictors. Under the
widespread myths about significance testing and psy-
conditions just stated and assuming that the bivari-
chometrics that can interfere with learning about SEM.
ate correlation of each predictor with the criterion is
For example, there are many false interpretations of
1 Other metrics or statistics for measuring incremental validity p values from significance testing that altogether can
include partial correlations, semipartial correlations (also called make researchers overly confident in their results and
part correlations), and sequential increases in R2 when predictors distract them from other aspects of their data, such as
are added to the equation in a particular order—see Grömping effect size or precision. The black-box view that reli-
(2015). ability is a property of tests rather than of scores in a
Pt1Kline5E.indd 26 3/22/2023 3:52:36 PM

particular sample runs counter to the best practice of Regression

reporting psychometrics, including values of score reli-
Questions 1–4 concern predictors X and W and crite-
abilities for the data analyzed. Measurement error in
rion Y. All variables are continuous. For the same data,
a single predictor in standard regression analysis can
the unstandardized and standardized regression equa-
bias the coefficients for other predictors, too, and its
tions are listed next as, respectively,
effects do not always involve attenuation of regression
coefficients, especially if measurement error is shared Ŷ = 2.15X + 1.30W + 2.34
over multiple predictors or with the criterion. Presented
next is a self-test of knowledge about concepts in sig- and
nificance testing, regression, and psychometrics. After ẑY = .59z X + .34zW
responding to the questions, score your answers using
the criteria that follow the self-test. Refer to the primers 1. Interpret each unstandardized coefficient. (5)
available on the book’s website for additional informa- 2. Interpret each standardized coefficient. (5)
tion about each background topic. 3. Which variable, X or W, contributes the most to pre-
diction, and why? (3)
4. If X, W, and Y were all measured in a different sam-
SELF‑TEST
ple and new regression analyses conducted, which
estimates—unstandardized or standardized—are
Sorry, no multiple-choice items here. I believe that
preferred for comparing results for each predictor
essay questions are better able to detect strengths
over the samples, and why? (4)
or weaknesses in knowledge about statistics (Kline,
2020b). Write a concise response to each question. The 5. What is the least squares criterion in standard
number of possible points for each question is indicated ordinary least squares (OLS) regression? What are
in parentheses, and maximum possible total scores are advantages and disadvantages? (4)
listed after the questions for each topic. The scoring cri- 6. Describe R2 as an estimator of r2, the population
teria for each item are given in the next section. proportion of explained variance. (4)
7. What is a corrected (adjusted, shrunken, shrinkage-
Significance Testing corrected) R2, and what is a potential complication
in its interpretation? (6)
1. What is a sampling distribution? (4)
8. What is the problem of overfitting (overparameter-
2. For the same continuous variable, interpret SD = ization) in regression analysis? (4)
15.00 versus SE = 3.00 (respectively, standard devi-
9. Describe the effects of omitting a predictor that
ation vs. standard error of the mean). (4) covaries with the criterion above and beyond all the
3. What does p = .03 in significance testing mean? (4) included (measured) predictors (also called omit-
4. What is a in significance testing, and how does it ted-variable bias or left-out variables error). (5)
differ from p? (4) Maximum score = 40
5. Define power and b in significance testing. (5)
6. List the factors that affect power. (6) Psychometrics
7. State the combination of factors in question 6 that
leads to the highest power. (6) All questions assume classical test theory.
8. Interpret these results (i.e., each numerical value) 1. What is the difference between reliability and valid-
for two independent samples: t(48) = 2.25, p = .029 ity? (4)
for a one-tailed hypothesis. (6)
2. Give a general interpretation for rXX = .80. (4)
9. Given t = 6.75/3.00 = 2.25 for two independent sam-
3. Given rXX = 1.0 for test–retest reliability, explain
ples, interpret the numerator and denominator, and
what is wrong with this statement: Each case
describe how group size affects each value. (7)
obtained exactly the same score at both occasions.
Maximum score = 46 (3)
Pt1Kline5E.indd 27 3/22/2023 3:52:37 PM

4. What is the relation between the (Cronbach’s) alpha 3. If the null hypothesis is true and all distributional
coefficient and the split-half reliability coefficient assumptions are correct, the results in 3% of ran-
for scores from the same test? (6) dom samples would be as extreme as the sample
5. What does the alpha coefficient measure, that is, result or even more so. Whether this result is “sig-
what determines its value? Comment on this state- nificant” is unknown because no criterion level for
ment: Alpha = .90; thus, the items are unidimen- statistical significance was stated. In NFSA, there
sional (they measure a single factor). (5) is no criterion level (i.e., p values are not dichoto-
mized), so the terms “significant” or “not signifi-
6. What is required to evaluate alternate-forms reli-
cant” do not apply.
ability? What is a purpose for alternate forms?
What does rXX = .75 mean for immediate versus 4. The quantity a is the criterion level of statistical
delayed administration of the forms? (6) significance specified before the data are collected.
It is the conditional probability of rejecting the null
7. What is the standard error of measurement (SEm)?
hypothesis over random samples when the null
How does rXX affect its value? What does SEm = 7.50
hypothesis is true, or a Type I error. The value of
mean? What is a practical application of SEm? (6)
p is the conditional probability of a sample result
8. Why is it a misconception that the value of the Pear- or one even more extreme assuming that the null
son correlation rXY always has a range from –1.0 to hypothesis is true, sampling is random, and all dis-
1.0? What is a risk of this myth? (4) tributional assumptions hold.
9. What is convergent validity versus discriminant 5. Power is the probability of correctly rejecting the
validity, and what is the concern about common null hypothesis over random samples when the
method variance when evaluating either type of alternative hypothesis is true. Its complement, or
validity? (4) 1 – power, is b, or the probability of failing to reject
Maximum score = 42 (retaining) the null hypothesis when the alternative
hypothesis is true.3
6. Power is determined by sample size, the magnitude
SCORING CRITERIA of the true effect in the population, the level of a,
the reliability of the scores, whether the effect is
Award 1 point for each underlined definition or phrase between-subjects or within-subjects, and whether
mentioned in your answer. Calculate the total score the test statistic is parametric or nonparametric.
for each area and then calculate the proportions of the 7. Bigger samples, a larger population effect size,
maximum possible total score. lower levels of a (e.g., .05 instead of .01), higher
score reliability, a within-subject effect, and a para-
Significance Testing metric test statistic are all associated with higher
power.
1. A sampling distribution is the probability distribu-
tion for a statistic over many random samples all the 8. The degrees of freedom are 48, which means that
same size (N) and drawn from the same population. the total sample size is N = 50, but the size of each
group is unknown.4 The mean difference is 2.25
2. The term SD = 15.00 is the square root of the aver-
standard errors in magnitude. Assuming random
age squared distance between the scores and the
sampling from two different populations with nor-
mean,2 and it estimates the population standard
mal distributions and equal variances, the probabil-
deviation, s. The quantity SE = 3.00 is the esti-
ity of observing a mean difference as large as 2.25
mated standard deviation in the sampling distribu-
standard errors or even greater is .029.
tion of random means each based on N = 25 cases
(i.e., SE = SD/N1/2). Thus, the estimated square root 9. The group mean difference is 6.75, and its value is
of the average squared distance between sample not affected by group size. The standard error of the
means and the population mean is 3.00. 3 It is just as correct to say that 1 – b = power.
2A common but incorrect response is that SD is the average abso- 4 Equal group sizes for the independent-samples t test are not
lute distance between the scores and the mean—see Huck (2016). required.
Pt1Kline5E.indd 28 3/22/2023 3:52:37 PM

mean difference is 3.00, and it estimates the stan- tors within the same sample—see Grace and Bollen
dard deviation in a sampling distribution of ran- (2005).
dom mean differences, where samples in each pair 5. The least squares criterion defines a unique solu-
were selected from two different populations with tion for the regression coefficients, including the
normal distributions and equal variances. Its value intercept, such that the sum of squared residuals is
is affected by group size: Larger groups lead to minimized; that is, S (Y – Ŷ)2 is as small as possible.
smaller standard errors, keeping all else constant. This solution is the best possible solution in a par-
ticular sample, one that maximizes the value of R2.
Regression A drawback is capitalization on chance: The “best”
solution in one sample will not be so in a different
1. All answers for this question refer to the origi- sample unless the covariance matrices and means
nal (raw score) metric of each variable: A 1-point are identical in both samples.
increase in X predicts an increase in Y of 2.15
points, while controlling for W; a 1-point increase 6. The statistic R2 is a positively biased estimator of r2
in W predicts an increase in Y of 1.30 points, con- (i.e., R2 generally overestimates r2). Bias increases
trolling for X; and the predicted score on Y is 2.34 as the sample size decreases for a fixed number of
when X = W = 0. predictors or bias increases as predictors are added
for a fixed number of cases. In very large samples,
2. Given an increase in X of 1 standard deviation, the
the expected value of R2 is essentially equal to that
expected increase in Y is .59 standard deviations,
of r2.
controlling for W; and Y is predicted to increase by
.34 standard deviations, given an increase in W of 7. An adjusted R2 estimates r2 by reducing the value
one standard deviation while we are controlling for of R2 as a function of sample size (N) and the num-
X. In the standardized solution, the intercept equals ber of predictors (k). In general, there is greater
zero because the mean for all variables is zero. reduction in smaller samples than in larger samples
3. Unstandardized coefficients for X and W—respec-
just as there is greater reduction as the ratio k/N
tively, 2.15 and 1.30—are not directly comparable increases (e.g., predictors are added while sample
unless the two variables have the same raw score size is constant). Also, there is generally greater
metric. Because both variables do have the same reduction for smaller than for larger values of R2.
metric in the standardized solution (i.e., mean = 0, It can happen that adjusted R2 values are < 0 (nega-
standard deviation = 1.0), their standardized coef- tive); if so, then it is usually treated as though the
ficients can be directly compared: The relative corrected R2 is zero (Cohen et al., 2003).
contribution of predictor X is largest, or about 1.75 8. Overfitting refers to a regression equation with
that times of W (i.e., 59/.34) in a standard deviation too many predictors relative to the sample size. It
scale. inflates R2 values to the point where they start to
4. It is the unstandardized coefficients that are gener- describe random variation more so than any real
ally preferred. If two independent samples differ in relations among variables. Overfitting also reduces
their variances for the same measures, the basis for generalizability of the results to another sample,
the standardized estimates is not comparable over that is, the results may not replicate because they so
samples.5 Standardized coefficients are better for heavily reflect idiosyncratic variation in the origi-
comparing the relative contributions of the predic- nal sample that is not repeated in other samples. For
example, if N = 100 and there are k = 99 predictors,
5 Here is an example: Suppose that the same multiple-choice test then the value of R2 must equal 1.0 because there is
is administered to each of two different classes. Scores in each no error variance with so many predictors, and this
class are reported as the proportion correct, but relative to the is true even if the data are random numbers.
highest raw score in each group, not the total number of items.
9. Omitting a relevant predictor increases standard
Although the proportions in each class are standardized and have
the same range (0–1.0), they are not directly comparable across errors for measured predictors, reduces the value
the classes if the highest scores in each group are unequal. Only of R2 , and biases the intercept unless the mean on
the raw scores (total number of items correct) would have the the omitted predictor is zero. If the omitted variable
same meaning over classes. is uncorrelated with all measured predictors, the
Pt1Kline5E.indd 29 3/22/2023 3:52:37 PM

regression coefficients for the measured predictors 4. Both methods are based on a single administration
are unbiased; otherwise, their coefficients will be of the test. Both are measures of internal consis-
biased (Mauro, 1990). This bias can be positive or tency but at different levels: Alpha measures consis-
negative, and it generally increases as correlations tency at the item level, but the split-half coefficient
between omitted and included predictors increase. measures consistency at the level of test halves, or
consistency in total scores over two halves of the
test corrected for overall test length. There is a
Psychometrics
single value of the alpha coefficient, but there are
1. Reliability concerns the precision or consistency of as many possible split-half coefficients as there are
scores whereas validity concerns the accuracy of ways to split test items into two halves. If all items
interpretations for those scores in a particular con- have the same variance, the alpha coefficient equals
text of use. Reliability is a prerequisite for valid- the average of all possible split-half coefficients.
ity—very inconsistent scores measure nothing and 5. Alpha measures the combined effect of test length
thus have no meaningful interpretation—but reli- and the interrelatedness of responses over test
ability does not guarantee validity.6 That is, just items, which can also be expressed as the ratio of
because scores are consistent and repeatable does variability in examinees’ responses across all test
not mean that they actually measure the target con- items over the variance of test total scores. Increas-
struct. ing either factor just mentioned will increase the
2. A total of 1 – .80 = .20, or 20% of the observed value of alpha. Coefficient alpha assumes that the
variation in scores is due to the type of random items are unidimensional (homogenous), so a high
error measured by the method used to derive the value of alpha does not confirm that hypothesis. It
coefficient (e.g., test–retest, alternate-form, inter- is possible to obtain a relatively high value of alpha
rater, internal consistency). The remaining 80% is even when the items are multidimensional (Schmitt,
systematic, or not due to the kind of measurement 1996; Streiner, 2003; Urbina, 2014).
error estimated by the coefficient. Some of that 6. The alternate-forms method requires at least two
remaining variance could be due to sources of error different complete versions of the test that should be
not estimated by the coefficient. If so, then addi- comparable in terms of length, difficulty, and con-
tional reliability studies would be needed (i.e., those tent. The forms are meant to be used interchange-
that depend on a different method). ably and for the same purpose. In situations where
3. Assuming there are individual differences among people are required to take the same test on multiple
cases, the pattern where each case obtains exactly occasions, administration of alternate forms may
the same score at each occasion would generate the reduce practice effects. Immediate administration,
outcome rXX = 1.0. But so would any other pattern 1 – .75 = .25, or 25% of observed variation is due
where the magnitudes of differences between the to content sampling error, or idiosyncratic effects
cases are perfectly preserved over the occasions of item wording, participant familiarity with item
even though perhaps no case obtained the same content, or inadvertent selection of items from dif-
score both times. For example, if the values in ferent domains that lead to inconsistent scores over
parentheses listed next represent scores from each the forms. Delayed administration, a total of 25%
case at the first and second occasions, or of the variance, is due to the combined effects of
(12, 16), (14, 18), (15, 19), (18, 22), content sampling error and time sampling error, but
and (20, 24) with no additional information, the two sources of
error are confounded.
then rXX = 1.0. Thus, test–retest coefficients measure
whether individual differences are perfectly main- 7. The statistic SEm estimates the standard deviation
tained over the two occasions, not just whether the of observed scores (X) around a hypothetical true
scores are the same. (error-free) score T, where observed scores have
a random error term, or X = T + E, such that the
6 Saying that reliability is a necessary but insufficient condition average of the error component of the scores, E, is
for validity means the same thing. zero and normally distributed. One context is a dis-
Pt1Kline5E.indd 30 3/22/2023 3:52:37 PM

tribution of many repeated measures from a hypo- 9. Convergent validity involves the hypothesis that
thetical case around their true score, and another scores from ≥ 2 tests claimed to measure the same
is a distribution of observed scores for all cases in construct should appreciably covary, and discrimi-
the population with the same true score. The result nant validity concerns the prediction that scores
SEm = 7.50 says that the square root of the average from ≥ 2 tests claimed to measure different con-
squared distance between the observed scores and structs should not appreciably covary. Method vari-
the true score is 7.50. As rXX increases, the value ance is due to use of a particular measurement or
of SE m decreases, and if rXX = 1.0, then SEm = 0. source of information, and common method vari-
The term SEm is used in constructing confidence ance can inflate validity coefficients for two tests
intervals for true scores, given observed scores for based on the same method. Thus, it is better to
individual cases. evaluate convergent or discriminant validity when
8. The value of rXY can be perfect (i.e., –1.0 or 1.0) only no two tests are based on the same measurement
if the distributions of X and Y have exactly the same method.
shape, their association is strictly linear, and their
score reliabilities are perfect (rXX = rYY = 1.0); other- Well, how did you do? Don’t fret if any of your three
wise, the maximum absolute value for rXY is < 1.0. scores expressed as a percentage are < 50%. A score
A risk is that a researcher is unable to accurately in this range is not uncommon, especially if your sta-
judge the strength of association if they are unaware tistics knowledge is rusty—see Kline (2020b). Scores
of the range of possible values for rXY. For example, around 70% suggest at least a basic understanding, so
what is thought to be a relatively low value of rXY, your review of the corresponding primer could be more
such as .30, may actually be close to its maximum focused on areas of relative weakness. Aced it (> 90%)?
value, given the score reliabilities for variables X Nice, but don’t get cocky: You should still browse the
and Y (Huck, 2016). primers to check for any missing tidbits in your knowl-
edge.
Pt1Kline5E.indd 31 3/22/2023 3:52:37 PM

3
Steps and Reporting
Described in this chapter are the basic steps in SEM and journal article reporting standards for SEM analyses
recently published by the American Psychological Association. The first step in SEM, model specification, is
the most important step of all. This statement is true because results from all later steps assume that the model
analyzed is reasonably correct; otherwise, there may be little meaningful interpretation of the results. The
state of reporting in SEM studies is poor—no, that’s too mild—it is actually in a state of crisis. This is because
too little information about model specification, correspondence between model and data, or possible limita-
tions to conclusions is given in many—and, I would guess, unfortunately, most—published SEM studies. With-
out complete reporting, readers are unable to judge whether the findings are trustworthy, including whether
the model actually fits the data, if a model was retained. Helping you to distinguish your own reporting so
that common mistakes and omissions are avoided is thus the major goal of this chapter.
BASIC STEPS 3. Select the measures (operationalize the concepts)

and collect, prepare, and screen the data.
Six basic steps comprise most SEM analyses, and in a 4. Analyze the model:
perfect world two additional optional steps would be
a. Evaluate the fit of the model; if it is poor, respec-
carried out in every analysis. Review of the basic steps
ify the model, provided that doing so is justified
will help you to understand the relation of specification
by theory (skip to step 5); otherwise, retain no
to later steps and to recognize the utmost importance
model (skip to step 6).
of specification. The basic steps are actually iterative
b. Assuming that a model is retained, interpret the
because problems at a particular step may require a
return to an earlier step. Note that a possible outcome parameter estimates.
is the decision to retain no model, which is a legitimate c. Consider equivalent or near-equivalent models
conclusion to an SEM analysis. Remember that the goal (skip to step 6).
in SEM is not to retain a model at any cost—it is instead 5. Respecify the model, which is assumed to be identi-
to a test a theory to the best of the researcher’s ability. fied (return to step 4).
Basic steps are listed next and discussed afterward. 6. Report the results.
Later chapters elaborate on specific issues at each step
beyond specification for particular SEM techniques:
Specification
1. Specify the model. In SEM, specification involves the representation of
2. Evaluate whether the model is identified (if not, go the researcher’s hypotheses as a series of equations or
back to step 1). as a model diagram (or both) that define the expected
32
Pt1Kline5E.indd 32 3/22/2023 3:52:37 PM

Steps and Reporting 33
relations among observed or unobserved variables. dence to the data, and that’s it. But this scope of model
Depending on the theory, these relations could be testing is so narrow that it occurs only on relatively few
specified as causal or noncausal. The latter (noncausal) occasions. A second, somewhat less restrictive context
refers to statistical relations that arise due to spurious involves the testing of alternative models, and it refers
associations, such as when two variables are affected to situations in which ≥ 2 a priori models are available.
by a common cause, or confounder, but are not caus- Alternative models usually include the same observed
ally related to each other. The model as a whole should variables but represent different patterns of hypoth-
thus represent all the ways the variables are expected to esized relations among them. This context requires
relate to one another. sufficient bases to specify more than one model, such
Outcome (dependent) variables in SEM are referred as at least two competing theories that make different
to as endogenous variables, each of which has at predictions for the same variables. Another example is
least one presumed cause among other variables in in relatively new research areas when there is uncer-
the model. Endogenous variables usually have error tainty about expected patterns of relations. In this more
terms, which represent variation that is not explained exploratory case, the researchers might test a range of
by the causes of those variables. Given the hypotheses, models from simpler to more complex instead of com-
an endogenous variable could be specified as a cause paring models based on different theories (MacCallum,
of a different endogenous variable. Endogenous vari- 1995). In either case, the particular model with the best
ables as just described are intervening (intermediate) acceptable fit to the data may be retained, but the rest
variables that act as a causal link between other vari- will be rejected.
ables. That is, intermediates are specified as affected Probably the most frequent specification context in
by causally-prior variables, and in turn they affect other SEM is model generation, where an initial model is
variables further “downstream” in the causal pathway. fitted to the data. If model fit is found to be unsatis-
For reasons explained in later chapters, an interven- factory, it is respecified usually by adding effects, or
ing variable is not synonymous with a mediator—also parameters, to the original model, which makes the
called a mediating variable—but a mediator is always model more complex and also generally improves its
an intervening variable. Other causes of some endog- fit. But if the initial model has acceptable fit, that model
enous variables in the model are independent variables, might be simplified by dropping parameters, or making
called exogenous variables in SEM, which themselves the model simpler, which generally worsens fit. In both
are strictly causal. This is because whatever causes scenarios just described, the respecified model is tested
exogenous variables are not represented in the model; again with the same data. The goal of model generation
that is, their causes are unknown as far as the model is is to “discover” a model with three attributes: It makes
concerned. theoretical sense, it is reasonably parsimonious, and it
Whether a variable is endogenous or exogenous is has acceptably close correspondence to the data.
determined by the theory being tested, not by analysis. Because the initial model in model generation is
This means that (1) the model is specified before the not always retained, I suggest that researchers make,
data are collected, and the model represents the total in the specification step, a list of possible modifica-
set of hypotheses to be evaluated in the analysis and tions that would be justified according to theory. This
(2) the technique of SEM is not generally a method for means to prioritize the hypotheses, representing just
causal discovery so that if given a true causal model, the very most important ones in the model, and leave
then SEM could be applied to estimate the directions the rest for a “backup list,” if needed. This is because
and magnitudes of causal effects represented in the it is often necessary in SEM to respecify models (step
model. However, this is not how SEM is typically used: 5), and respecification should respect the same prin-
Instead, a causal model is hypothesized, and the model ciples as specification. Preregistration of the analysis
is fitted to sample data assuming that it is correct. plan would make a strong statement that changes to
Therefore, specification is the most important step. the initial model were not made after the examin-
There are three general contexts for model specifica- ing the data (Nosek et al., 2018); that is, the basis for
tion in SEM (Jöreskog, 1993). In a strictly confirma- changing model was a priori, not post hoc in order to
tory application, the researcher has a single model that get a model—at worst, any arbitrary model—to fit the
is either retained or rejected based on its correspon- data.
Pt1Kline5E.indd 33 3/22/2023 3:52:37 PM

Identification Instead, there are graphical methods and identifica-

tion heuristics, or rules of thumb, that can determine
Although graphical models are useful heuristics for
whether some, but not all, kinds of models are identi-
organizing knowledge and representing hypotheses,
fied. There are also computer tools that analyze graphi-
they must eventually be translated into a statistical
cal representations of some, but not all, kinds of path
model that can be analyzed using a computer program.
models for identification (Textor et al., 2021), which is
A statistical model in SEM is a set of simultaneous both convenient and less subject to error compared with
equations, where the computer must eventually derive manual application of heuristics. Dealing with identi-
a single set of estimates that resolve unknown values, or fication is one of the biggest challenges for newcom-
model parameters, over all equations, given the data. ers to SEM (Kenny & Milan, 2012); accordingly, it is a
Statistical models must generally respect certain recurrent theme throughout this book.
rules or restrictions. One requirement is that of iden- Suppose that a researcher specifies a structural equa-
tification, or whether every model parameter can be tion model that is true to a particular theory, but the
expressed as a function of the variances, covariances, resulting model is not identified. In this case, there is
or means in a hypothetical data matrix. Meeting this little choice in SEM but to respecify the model so that
requirement means it is possible to find a unique value it is identified, but respecifying the original model can
for that parameter, given a particular estimation method be akin to making an intentional specification error
and its statistical criteria. But if a parameter is not iden- from the perspective of the theory. There are two basic
tified, any particular value might not be unique. This options: (1) Apply graphical identification methods to
means that the researcher could not distinguish true determine which parameters in the original model are
versus false values for that parameter even with access identified, and then estimate only those parameters;
to population data (Bollen, 2002). that is, analyze what is possible and skip the rest (i.e.,
Mathematical proof for identification generally the underidentified parameters). This first option means
involves solving the equations for a parameter in terms that the whole model is not analyzed. (2) Respecify the
of symbols for elements in a hypothetical data matrix, model by adding variables, such as covariates or instru-
and these symbols generally do not refer to specific mental variables, so that all parameters are identified
numerical values in any given sample. For example, while still respecting the original theory. This second
Dunn (2005) described a symbolic proof of the claim option is thus a balancing act between identification
that least squares estimators of regression coefficients and fidelity to theory. But the main point is that identi-
and intercepts minimize the sum of squared residuals fication should be evaluated while planning the study
(i.e., the least squares criterion) in standard regression and before the data are collected. Otherwise, it may
analysis. Note that identification is an inherent property be difficult—if not impossible—to add variables to the
of a parameter in a particular model. This means that model to fix an identification problem, if the data are
a parameter that is not identified remains so regardless already collected.
of both the data matrix and the sample size (N = 100,
1,000, etc.). Attempts to analyze a model with at least
Measure Selection and Data Collection
one parameter that is not identified may be fruitless;
therefore, a model that is not identified should be The various activities for this stage—select good mea-
respecified (return to Step 1). Presented in Topic Box sures, collect the data, and screen them—are discussed
3.1 is an intuitive account of model identification. You in the next chapter on data preparation and in the Psy-
can use an online simultaneous equations calculator to chometrics Primer available on the website for this
work through the examples.1 book.
Structural equation models are often sufficiently
complex that is impractical for applied researchers Analysis
without strong backgrounds in linear algebra to inspect
model equations or parameters for identification. This step involves using an SEM computer program
to conduct the analysis. Computer tools for SEM, both
1 https://www.symbolab.com/solver/simultaneous-equations- commercial and those freely available, are described in
calculator Chapter 5. Here is what takes place during this step:
Pt1Kline5E.indd 34 3/22/2023 3:52:37 PM

TOPIC BOX 3.1
Conceptual Explanation of Identification

Consider the following formula
a+b=6 (3.1)
where a and b are unknowns that require estimates. Because there are more unknowns than formulas in
Equation 3.1, it is impossible to find a unique set of estimates. In fact, there are an infinite number of solu-
tions for (a, b), including
(4, 2), (8, –2), and (2.5, 3.5)
and so on. Thus, Equation 3.1 is not identified; specifically, it is underidentified or underdetermined,
which here signals the excess of unknowns over the number of formulas or, respectively, 2 > 1. A similar
thing happens when a computer tries to derive unique estimates for an underidentified set of equations: It
is not possible to do so, and the attempt fails.
The next example shows that having an equal number of unknowns and formulas does not guarantee
identification. Consider the following set of formulas:
a+b=6 (3.2)
3a + 3b = 18
Although Equation 3.2 has 2 unknowns and 2 formulas, it does not have a unique solution. Actually, an
infinite number of solutions satisfy Equation 3.2, such as (4, 2), and so on. This happens due to an inherent
characteristic: The second formula in Equation 3.2 is linearly dependent on the first formula (i.e., multiply
the first formula by the constant 3), so it cannot narrow the range of solutions that satisfy the first formula.
Thus, Equation 3.2 is underidentified because the effective number of formulas is 1, not 2.
Now consider the following set of 2 formulas with 2 unknowns where the second formula is not lin-
early dependent on the first:
a+b=6 (3.3)
2a + b = 10
Equation 3.3 has a unique solution; it is (4, 2), and none other. Thus, Equation 3.3 is just-identified,
just-determined, or saturated with equal numbers of unknowns and distinct (i.e., not wholly depen-
dent) formulas, or 2 for each. That unique solution (4, 2) also perfectly reproduces the constants in Equa-
tion 3.3 (6, 10).
Let’s see what happens when there are more formulas than unknowns. Consider the following set of
formulas with 3 equations and 2 unknowns:
a+b=6 (3.4)
2a + b = 10
3a + b = 12
(continued)
Pt1Kline5E.indd 35 3/22/2023 3:52:37 PM

There is no single solution that satisfies all three formulas in Equation 3.4. For example, the solution (4, 2)
works only for the first two formulas in Equation 3.4, and the solution (2, 6) works only for the last two
formulas. But there is a way to find a unique solution: Impose a statistical criterion that leads to an overi-
dentified or overdetermined set of equations with more formulas than unknowns. An example for
Equation 3.4 is the least square criterion from regression analysis but with no constant (intercept) in the
prediction equation. Expressed in words:
Find values of a and b in Equation 3.4 that yield total scores such that the sum of squared
differences between the constants 6, 10, and 12 and these total scores is as small as possible.
Applying the criterion just stated to the estimation of a and b yields a solution that not only gives the
smallest possible difference (.67) but that also is unique, or (3, 3.33). This solution does not perfectly
reproduce the constants in Equation 3.4 (6, 10, 12): the total scores generated by the solution just stated
are 6.33, 9.33, and 12.33, but no other solution comes closer, given the least squares criterion applied
in this example.
Definitions of underidentified, just-identified, and overidentified models in SEM are elaborated on in
later chapters, but some implications can be stated now: An unidentified model has at least one param-
eter that cannot be algebraically expressed as a unique function of the covariances, variances, or means
among the observed variables. Such models may have other parameters that are identified, and these
identified parameters could potentially be estimated with sample data. But a model with ≥ 1 underidenti-
fied parameter(s) cannot be analyzed as a whole unless it is respecified to reduce the excess number of
unknowns (parameters) or certain constraints are imposed by the researcher in the analysis; otherwise,
underidentified models are too complex to be analyzed with available data.
The unique solution for a just-identified model can perfectly reproduce the data, but that feat is not
very impressive because model and data are equally complex. Any respecification of a just-identified
model (within requirements about identification) will also perfectly explain the same data even though
such respecifications oftenmake opposing causal claims (i.e., the models are equivalent). Also, a justified-
identified model will perfectly fit any arbitrary sample data matrix for the same variables. For all these
reasons, (1) just-identified models are not discomfirmable, and (2) such models usually have little scientific
merit (MacCallum, 1995), but Pearl (2009) discussed some exceptions.
Overidentified models typically do not perfectly explain the data because such models are simpler
than the data. The possibility for imperfect fit makes such models discomfirmable, and a question in the
analysis is whether the degree of model–data discrepancy warrants rejecting the model (i.e., go to the
respecification step). The principle of disconfirmability also implies a preference for models that are
not highly parameterized, or have so many unknowns relative to the data that they can hardly “disagree”
with those data (MacCallum, 1995). Box (1976, p. 792) put it like this. “Just as the ability to devise simple
but evocative models is the signature of the great scientist so overelaboration and overparameterization
is often the mark of mediocrity.” Other aspects of parameter identification in SEM are elaborated in later
chapters.
Pt1Kline5E.indd 36 3/22/2023 3:52:37 PM

1. Evaluate model fit, which means determine how well ter (and more honest, too) to retain no model (Hayduk,
the model explains the data. Perhaps more often than 2014). This is because there are risks to respecifying
not, an initial model does not fit the data very well. models based more on improving fit to the data in a
Not if, but when this happens to you, skip the rest of particular sample than on substantive considerations
this step and consider the question, “Can a respecifi- (MacCallum, 1995):
cation of the original model be justified, given rel-
evant theory and results of prior empirical studies?” 1. Data-driven respecification can so greatly capi-
2. Assuming the answer is “yes” and given satisfac- talize on sampling error that any retained model
tory fit of the respecified model, then interpret the is unlikely to replicate, especially when complex
parameter estimates. models are analyzed in small samples.
3. Next, consider equivalent or near-equivalent mod- 2. Estimates of some parameters in purely data-driven
els. Recall that equivalent models explain the same respecified models may have little or no substantive
data just as well as the researcher’s preferred model meaning.
but with a contradictory pattern of causal effects 3. Because the model is evaluated based on the same
among the same variables. For a given model, data used to modify it, any retained model should
there may be many—and in some cases infinitely be validated with data from a new sample; that is,
many—equivalent versions. Thus, the researcher evidence for model validation should come from a
should explain why their favorite model should not different sample. Thus, a respecified model should
be rejected in favor of equivalent ones. There may not be treated as confirmed without fitting it to new
also be near-equivalent models that fit the same data.
data nearly as well as the researcher’s preferred
model, but not exactly so. Near-equivalent models MacCallum (1995) noted that it would be appropriate
are often just as critical a validity threat as equiva- for journal editors to reject manuscripts for SEM stud-
lent models, if not even more so. ies based on model generation if the concerns just listed
are not addressed. This advice strikes me as reasonable
When testing alternative models, it is not the model in that the direct acknowledgment of study limitations
with the best relative fit to the data that would be auto- is part of reporting, which is considered next.
matically retained. This is because the best-fitting
model among alternatives may not itself have accept-
Reporting
able fit to the data when viewed from a more absolute
standard. Also, the researcher in this context should The last basic step is an accurate and complete descrip-
not be solely concerned with model fit. This is because tion of the analysis in a written report. The fact that so
the parameter estimates for a candidate model should many published articles that concern SEM are flawed
also make theoretical sense. As noted by MacCallum in this regard was discussed earlier. These blatant
(1995), a model that fits well but generates nonsensi- shortcomings are surprising considering that there have
cal parameter estimates is of little scientific value. This been, for some time, published guidelines for reporting
principle is also true in strictly confirmatory and model SEM results. For example, recent APA standards for
generation applications of SEM. SEM analyses (Appelbaum et al., 2018) described later
in this chapter were based on earlier SEM standards
for the journal Archives of Scientific Psychology by
Respecification
Hoyle and Isherwood (2013). McDonald and Ho (2002),
A researcher usually arrives at this step because the fit Jackson et al. (2009), and Boomsma et al. (2012) dis-
of their initial model is poor. In the context of model cussed general principles for reporting results in SEM,
generation, now is the time to refer to that backup list of and there are works on how to apply SEM in various
theoretically acceptable changes I suggested when you disciplines, including travel and tourism, social and
specified the initial model. If there is no such list—or administrative pharmacy, and marketing, among oth-
if the researcher has exhausted their backup list and yet ers (respectively, Nunkoo et al., 2013; Schreiber, 2008;
the respecified model still has poor fit—it may be bet- Richter et al., 2016).
Pt1Kline5E.indd 37 3/22/2023 3:52:37 PM

OPTIONAL STEPS Gender differences in consumer brand engagement

were noted by Tunca (2019) as a possible explanation
Two optional steps could be added to the six basic ones for differences in results compared with those reported
just described (i.e., steps 1–6): by Hollebeek et al. (2014).
Issues concerning the external replication of SEM
7. Replicate the results. studies are summarized next; see Porte (2012) for more
8. Apply the results. information about general issues in replication: Recall
that unstandardized estimates of the same param-
The requirement for large samples in SEM complicates eter should generally be compared over independent
replication. This is because it may be difficult enough samples, not standardized results. How to simultane-
to collect a single sample large enough for SEM analy- ously fit the same structural equation model to the data
ses much less collect twice as many cases or so to also from multiple samples is described in Chapter 12. This
have a replication sample. Poor reporting of results in process involves specifying equality constraints that
SEM studies may also contribute to the problem: If should be imposed on unstandardized estimates for the
the sample, measures, model, and analysis are not all same parameter over different samples as a way to test
for group differences. This method (among others) is
clearly described, replication efforts could be stymied.
used in CFA studies of measurement invariance, and
That fact that many, and perhaps most, claims in SEM
it could also be applied by a researcher who collects
studies are made with little evidence for their general-
new data to test the same model from a prior study by a
izability outside the original sample(s) should give us
different researcher. There are formal measures of the
all pause, though.
similarity of factor solutions over different samples,
A bright spot about replication in SEM concerns the
such as the Tucker coefficient of congruence, a stan-
evaluation of measurement invariance, or whether a set
dardized measure of proportionality in factor loadings
of indicators measures the same theoretical concepts
over different groups, that can be applied in CFA—see
over different populations, times, methods of adminis- Moss et al. (2015) for an example.
tration, and so on. Measurement invariance can be esti- There are hundreds, probably thousands, of studies
mated in CFA by testing whether a measurement model where SEM has been applied to test theories (Thelwall
has similar fit over data collected from samples drawn & Wilson, 2016). One has to look much harder to find
from different populations, such as women and men, practical applications of SEM, but such studies exist.
among other variations described later in this book For example, the latest revision of the Stanford–Binet
(Chapter 22). There are now many CFA studies of mea- Intelligence Scales, Fifth Edition (Roid, 2003) was
surement invariance (Dong & Dumas, 2020), and most influenced by results of factor analytic studies, includ-
involve replication over different samples collected by ing CFA findings, about its internal structure (DiSte-
study authors. fano & Dombrowski, 2006). Rebueno et al. (2017) used
There are fewer examples of replication over dif- the technique of path analysis to evaluate the attributes
ferent samples collected by different researchers, or of a pre-graduate clinical training program in a ran-
external replication. One is Tunca (2019), who con- domly selected sample of nursing students. Verdam et
ducted a pre-registered replication of an earlier SEM al. (2017) described effect sizes that can be computed
study by Hollebeek et al. (2014). The original analysis in SEM analyses on different types of response shifts
involved a statistical model of customer brand engage- in reported outcomes over time among patients in
ment conceptualized as three dimensions—cognitive treatment. Effect sizes are derived from decomposing
processing, brand-related affection, and behavioral observed change into elements that include three dif-
interaction—presumed to affect self–brand connection ferent types of response shift—including (1) change in
and brand usage intent (Hollebeek et al., 2014, p. 157). patients’ internal standards (recalibration), (2) change
Tunca (2019) fitted the same model to data from a new in patients’ values about the importance of particu-
sample. Although some of the original findings were lar outcomes (reprioritization), and (3) change in the
consistent with the original and replication samples, meaning of the outcome to patients (reconceptualiza-
evidence for the discriminant validity of customer tion)—versus estimated true change in the underlying
brand engagement dimensions was more problematic. construct.
Pt1Kline5E.indd 38 3/22/2023 3:52:38 PM

REPORTING STANDARDS ency of Health Research (EQUATOR) network, which

offers a searchable database.2
Reporting standards are envisioned as sets of best prac- Summarized in Table 3.1 are the APA JARS-Quant
tices established by authority figures, such as experi- standards for SEM analyses. Authors should explain
enced researchers, journal editors (who also tend to the theoretical bases for model specification, includ-
be experienced researchers), and methodologists, pre- ing directionalities of presumed effects. The analysis
sented with a rationale that also allows for discretion in context—for example, whether an initial model will be
their application (Cooper, 2011). They are not intended respecified (model generation)—should also be stated.
as requirements, or mandatory practices that must be Describe the model in complete detail, including its
adopted by all, but they are also not merely recommen- parameters, associations between observed and latent
dations, or endorsements of good reporting practices variables, and whether means were analyzed along
with no obligation to actually carry them out. Instead, with covariances. Verify that the model is identified,
the goal of reporting standards is to improve the quality and explicitly tabulate the model degrees of freedom
of journal articles through clear, accurate, and trans- (df M). How to do so for different kinds of models is
parent reporting, while simultaneously not interfering described in later chapters, but this basic reporting is
too much with researcher creativity and the normal problematic in too many studies: Cortina et al. (2017)
workings of science (Appelbaum et al., 2018). The hope reviewed nearly 800 models described in 75 published
is that better reporting will help readers to more clearly SEM studies. They found that the information needed
understand the bases for claims made in articles and to calculate df M was reported about 75% of the time, but
facilitate study replication. reported df M was matched by the article information
A drawback to reporting standards is that there is not only 62% of the time. Shah and Goldstein (2006) found
always consensus about exactly what to report in cer- a similar problem: In 31 out of a total of 143 reviewed
tain research areas. This is true in SEM in which there articles detailing SEM studies, the model described in
is longstanding debate about the adequacy of various the text failed to match the statistical results presented
statistics about model fit, especially about the useful- in tables or text. Such discrepancies raise questions
ness of significance tests of global model fit versus about the model described versus the model actually
continuous measures of fit (Vernon & Eysenck, 2007). analyzed.
When experts in the field disagree, no reporting stan- State whether samples are empirical or generated,
dards can resolve the issue. The approach taken in this such as in a computer simulation study, and describe
book is to (1) address directly the areas of disagreement any loss of samples due to technical problems, such as
beginning especially in Chapter 10 about model testing inadmissible solutions (i.e., illogical estimates) or fail-
and (2) offer what I believe are principled arguments for ure of iterative estimation to converge—see Table 3.1.
best practices, including reporting. I also believe that Give the rationale for the sample size, including the
the existence of reporting standards for SEM helps to parameters for a power analysis. Power analysis and
focus the conversation among both experienced practi- other methods to specify target sample size in SEM
tioners and students. are described in Chapter 10. Describe all measures in
The original APA journal article reporting standards detail, including psychometrics, for scores from psy-
(JARS) did not include SEM analyses (APA Publica- chological tests analyzed in the researcher’s sample.
tions and Communications Board Working Group on Describe the data screening in complete detail,
Journal Article Reporting Standards, 2008). Hoyle and including (1) the extent, patterns, and assumptions about
Isherwood (2013) developed SEM reporting standards missing data; (2) the status of distributional assump-
for the journal Archives of Scientific Psychology in the tions; (3) whether any other problems, such as outliers,
form of questions to be addressed by authors or review- were detected; and (4) corrective actions taken, if any
ers. Their guidelines were reworded as statements, (Table 3.1). A strong demonstration of transparency
slightly edited for consistency, and then incorporated occurs when both the syntax for the analysis and the
into the revised APA standards, JARS-Quant for quan- data file are accessible to other researchers. Access to
titative studies (Appelbaum et al., 2018). Hundreds original data is part of the open-science movement in
of reporting standards for health-related research are
described on the Enhancing the Quality and Transpar- 2 https://www.equator-network.org/
Pt1Kline5E.indd 39 3/22/2023 3:52:38 PM

TABLE 3.1. Summary of JARS-Quant Reporting Standards for Structural Equation
Modeling Analyses
Section Topic and description
Abstract If a model is retained, report values of at least two global fit statistics with brief comment on local fit, and
state whether the initial model was respecified.
Introduction Justify directionality assumptions (e.g., X → Y instead of the reverse).
State whether respecification is planned, if the initial model is rejected.
Specification Describe context (i.e., model generation, model comparison, strictly confirmatory).
Give a full account for model specification, including latent variables and indicators (i.e., measurement
model), status of parameters (i.e., free, fixed, or constrained), or how the model or analysis deals with
score non-independence.
Describe the mean structure, if means are analyzed.
Justify error correlations in the model, if present.
Verify that tested models are identified.
Explicitly tabulate numbers of observations, free parameters, and dfM.
Methods State whether data were simulated or collected from actual units of study (i.e., cases).
Describe measures (e.g., taken from a single vs. multiple questionnaires), specify whether they are items or
scales (i.e., total scores), and report psychometrics.
Outline how sample size was determined (e.g., power analysis, accuracy in parameter estimation, resource
limitations), give details for power analysis (e.g., H0, H1, a, effect size).
Indicate whether results or samples were discarded due to nonconvergence or inadmissible solutions.
Data screening Describe data loss patterns (i.e., MCAR, MAR, MNAR) and corrective actions (e.g., single or multiple
and summary imputation, FIML).
State distributional assumptions (e.g., multivariate normality), report evidence that assumptions were met or
describe actions taken to address violations.
Report sufficient summary statistics that allow secondary analysis (e.g., covariance matrix), or make the raw
data file available.
Estimation State the computer tool (including version) used in the analysis and the estimation method used, or make the
syntax available.
Describe default criteria that were changed to obtain a converged solution.
Report whether the solution is admissible (e.g., negative error variances) and describe corrective actions.
Model fit Interpret values of global fit statistics according to evidence-based criteria.
Describe local fit, or the residuals (e.g., standardized, normalized, correlation).
Explain the decision to retain or reject any model.
State criteria for evaluating parameter estimates, including whether results are compared over groups.
If comparing alternative models, state criteria for selecting a preferred model.
Respecification Indicate whether respecifications were a priori or post hoc (i.e., arrived at before or after examining the data).
Give a theoretical rationale for any parameters fixed or freed in respecification.
Estimates Report unstandardized and standardized estimates with standard errors for all free parameters, if possible.
Report standardized and unstandardized indirect effects with standard errors, outline analysis strategy.
Report estimates of interactions with standard errors, describe follow-up analyses.
Discussion Summarize changes to the initial model and rationale, if a model is retained.
Justify the preference for retained models over equivalent models or near equivalent models that explain the
same data just as well or nearly so.
Note. JARS-Quant, journal article reporting standards for quantitative studies; MCAR, missing completely at random; MAR, missing at random;
MNAR, missing not at random; FIML, full information maximum likelihood; dfM, model degrees of freedom. Underlining emphases the most
essential parts for each standard. From “Journal Article Reporting Standards for Quantitative Research in Psychology: The APA Publications and
Communications Board Task Force Report,” by Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., & Rao, S. M. (2018).
Journal article reporting standards for quantitative research in psychology: The APA Publications and Communications Board Task Force report.
American Psychologist, 73(1), 3–25. https://doi.org/10.1037/amp0000191. Copyright © 2018 American Psychological Association. All rights re-
served.
40
Pt1Kline5E.indd 40 3/22/2023 3:52:38 PM

which all components of research, including data, meth- degree of correspondence of empirical estimates with
ods, peer review, and publications, are freely available. theoretical predictions, or other reasons for the decision
Increasing numbers of journals encourage the sharing (Table 3.1). If a model is retained, a common mistake
of the data, and some journals award badges to authors is to report only the standardized solution. Although
for open-science practices (Chambers, 2018). standardized estimates are easier to interpret, they are
A fact of life in SEM is that computer analyses can, generally less useful for comparing results for the same
and sometimes do, go wrong. How to deal with vari- model and variables but fitted to data in a different sam-
ous technical problems in the analysis are addressed ple, especially if samples, such as men versus women,
throughout the book, but the nature of such problems differ in their variances (Grace & Bollen, 2005). Thus,
and corrective steps taken to resolve them should be also report the unstandardized solution with stan-
reported—see Table 3.1. Reassure readers that the solu- dard errors. If analyzing indirect effects or interaction
tion is admissible, that is, the estimates do not include effects, then describe the analysis strategy, including
anomalous results that would invalidate the solution. follow-up analyses. Another common mistake is the
Model fit should be described at two different levels, failure to acknowledge the existence of equivalent ver-
global and local. Global fit concerns the overall or aver- sions of any retained model with an explanation about
age correspondence between the model and the sample why the researcher’s version would be preferred.
data matrix. Just as averages do not indicate variability, As a reviewer of SEM studies submitted to research
it can and does happen in SEM that models with appar- journals, I see many examples of deficient, poor report-
ently satisfactory global fit can have problematic local ing, so much so that over the years I developed a set of
fit, which is measured by residuals computed for every boilerplates, or standardized text, that I can adapt to a
pair of observed variables in the model. Residuals are particular manuscript. Presented in Topic Box 3.2 are
the differences between observed (sample) and pre- examples of some of my review templates. I hope that you
dicted (by the model) covariances or correlations, and can learn a few things about better, more complete report-
as absolute residuals increase, local fit becomes worse ing from these cautionary tales on inadequate reporting.
for those pairs of variables. Evidence for problematic
local fit goes against retaining the model.
The analogy in regression analysis is the differ- REPORTING EXAMPLE
ence between R2, or overall explanatory power (global
fit), and regression residuals, or differences between Several colleagues and I wrote an SEM study with the
observed and predicted scores at the case level, not (at the time) upcoming APA reporting standards for
at the variable level as in SEM. Aberrant patterns of SEM in mind. The study, data, research aim, and model
regression residuals, such as severely nonnormal distri- are introduced next. Do not worry at this you point if
butions or heteroscedasticity, indicate a problem even if you do not understand everything about the model or
the value of R2 is reasonably high—see Fox (2020). Just its specification because we focus on issues related to
as reports about regression analyses without descrip- reporting next. Analysis results for this example are
tion of the residuals are incomplete, so too are reports described in Chapter 15, so you’ll get to see this model
in SEM in which only global fit is described. This is a again along with the data described here.
common shortcoming in many, if not most, published In a sample of 193 patients with first-episode psy-
SEM studies. For example, only 17 out of 144, or 12%, chosis (FEP) (schizophrenia, schizoaffective disorder)
of SEM studies published in organizational or manage- or recurrent multiple-episode psychosis (MEP), Sauvé
ment research journals reviewed by Zhang et al (2021) et al. (2019) administered the CogState Schizophrenia
contained information about local fit. The remedy is Battery (CSB) (Pietrzak et al., 2009), a computerized
straightforward: Present a matrix of residuals in the test of cognitive ability intended to measure domains
article, or at least describe the pattern of residuals in affected by psychosis (e.g., working memory, attention-
the article and make the residuals available in supple- vigilance, social reasoning), and the Scale to Assess
mentary materials. Without the details about local fit, Unawareness of Mental Disorder (SUMD) (Amador
readers cannot understand the whole picture about et al., 1993), an interview-based measure of the rela-
model–data correspondence. tive absence of understanding about one’s own mental
Explain the basis for a decision to either reject or condition, also called lack of illness insight or anosog-
retain a model. That is, state the statistical criteria, nosia. Higher scores on the SUMD indicate greater
Pt1Kline5E.indd 41 3/22/2023 3:52:38 PM

TOPIC BOX 3.2
Review Boilerplates for Common Problems in SEM Manuscripts

I have reviewed hundreds of manuscripts of SEM studies for about 50 different journals or publishers. After
seeing repetitions of similar problems in reporting, I wrote a set of more-or-less standardized comments
that could be adapted to the specifics of a particular work. These boilerplates cover incomplete reporting
about model fit, the interpretation of values of global fit statistics in ways that are not supported by recent
evidence, or the lack of a rationale for retaining a model (see Table 3.1), among other problems. Some of
these comments refer to global fit statistics that are defined in Chapter 10 on model testing and indexing,
but issues in their misuse are raised next:
1. Failure to describe local fit. It is inappropriate to base model evaluation solely on values of global
fit statistics without also considering local fit, or the residuals, which provide additional details
about model fit. It can and does happen that models can generate favorable values of global fit
statistics even though there is evidence for appreciable model–data discrepancy at the level of the
residuals. Recent reporting standards for SEM call on authors to describe both global and local
fit—see Appelbaum et al. (2018), Greiff and Heene (2017), and Vernon and Eysenck (2007) for
more information.
2. Failure to report unstandardized estimates. It is a mistake to omit the reporting of unstandardized
parameter estimates, which is generally the basis for comparing results from the same model and
variables over different samples—see Grace and Bollen (2005). Please also report the unstan-
dardized estimates with their standard errors.
3. Incorrect model degrees of freedom. The reported value of df M is impossible, given the variables
and effects represented in the model diagram. I suspect something is wrong here, ranging from
a simple typographical error to something more serious, including the analysis of a model that
differs from the model described in the manuscript. Please explicitly tally the numbers of observa-
tions, freely estimated model parameters, and df M. Next, resolve any discrepancies in the presen-
tation.
4. Ignored failed significance tests for the whole model. The model failed the chi-square test, which
signals covariance evidence against the model, but this outcome is discounted by the incorrect
statement that the chi-square test is “biased” by sample size. There are two problems here: The
sample size in this study is not large for SEM analyses, and the chi-square test is affected by sam-
ple size only when the model is incorrect. This problem is compounded by the failure to describe
the residuals, which could indicate poor local fit for certain pairs of variables—see Hayduk (2014)
and Ropovik (2015).
5. No rationale for the decision to retain a model. Why the respecified model is retained was not
explained. Instead, the author(s) report(s) values of global fit statistics with no interpretation. That
is, how do the author(s) get from these results in the text of the Results section, or
c2 (272) = 587.52, p < .01, CFI = .97, TLI = .946, RMSEA = .057
to the unjustified conclusion that the model has “acceptable” or “excellent” fit? There is no con-
nection, argument, or logic that links the two. It is possible that the model does not actually fit the
data at level of the residuals but, as mentioned, the reader is told nothing about this area.
Pt1Kline5E.indd 42 3/22/2023 3:52:38 PM

anosognosia. The main goal of Sauvé et al. (2019) was arrowheads at each end. These symbols represent cor-
to estimate the association between level of general related errors, or the hypothesis that the correspond-
cognitive ability and symptom unawareness where both ing indicators share something unique to that pair. For
concepts were modeled as latent variables. The context instance, the International Shopping List Test is a list
is model generation. of items read aloud to examinees. Two scores from this
Presented in Figure 3.1 is the final model retained by task were analyzed, immediate versus delayed recall of
Sauvé et al. (2019). A total of six tasks from the CSB list items or, respectively ISL and ISLR in Figure 3.1.
are represented as indicators of a general cognitive Because they come from the same task, indicators ISL
ability factor. Each measured variable is represented in and ISLR may covary even after controlling for their
the figure with a rectangle, a standard graphical con- common ability factor. This same could be true for the
vention in SEM. The lines with single arrowheads that two scores from the Groton Maze Chase Test, GML
point from the cognitive ability factor to the indicators and GMR (see Figure 3.1).
signal the assumption of reflective measurement, or the The SUMD is represented in Figure 3.1 as the single
hypothesis that general ability as a theoretical variable indicator of a symptom unawareness factor. This speci-
affects task performance. The factor as a theoretical— fication acknowledges that (1) an observed variable is
and thus unmeasured—variable is represented in the not identical to a theoretical concept and (2) scores on
figure with an oval, another oft-seen graphical symbol. the SUMD are assumed to be affected by measurement
Each indicator has an error term designated in the fig- error (i.e., the score reliability coefficient is < 1.0). The
ure by a line with a single arrowhead oriented at a 45° constant in the figure that appears next to the symbol for
angle that points to that indicator, and the error term the error term of the SUMD, .360, is a value related to
represents the unique variation in each indicator, such empirical reliability coefficients reported in the litera-
as that due to measurement error, not explained by their ture for this measure; how the particular value of “.360”
common ability factor. was computed for the SUMD is explained in Chapter
The errors for two pairs of indicators are connected 15. The line with the single arrowhead that connects
by the symbol for a covariation, or curved lines with the factors in the figure represents the presumed causal
ISL .360
ISLR SUMD
1
1
GML
Cognitive Symptom
Unaware
GMR
OCL
CPAL
FIGURE 3.1. Final model of cognitive capacity and symptom unawareness. ISL, International Shopping List; ISLR, Interna-
tional Shopping List Immediate Recall; GML, Groton Maze Learning task; GMR: Groton Maze Learning task Delayed Recall;
OCL, One-Card Learning task; CPAL, Continuous Paired Associate Learning task; SUMD, Scale to Assess Unawareness of
Mental Disorder. From “Cognitive Capacity Similarly Predicts Insight into Symptoms in First- and Multiple-Episode Psychosis,”
by G. Sauvé et al., 2019, Schizophrenia Research, 206, p. 239. Copyright © 2019 Elsevier B.V. Adapted with permission.
Pt1Kline5E.indd 43 3/22/2023 3:52:38 PM

effect of cognitive ability on symptom unawareness. Its is relatively small for SEM, N = 193, but the population
value is estimated controlling for measurement error base rate of psychotic disorders is relatively low, about
not only in the SUMD variable but also in all six indi- 1–2% or so, so even this sample size is reasonably large
cators of the cognitive ability factor. The outcome vari- among comparable studies of insight in psychotic dis-
able in this analysis, the symptom unawareness factor, orders (e.g., Phahladira et al., 2019). Briefly, the results
itself has an error term called a disturbance that repre- indicated that cognitive capacity explains about 6%
sents variation not explained by cognitive ability—see of the variation in symptom unawareness and that, as
Figure 3.1. expected, lower levels of cognitive ability predict less
The features of reporting in Sauvé et al. (2019) that illness awareness. Follow-up regression analyses indi-
respected SEM reporting standards are listed next. cated that the relation between cognitive capacity and
Some of these details are reported in the main text; symptom unawareness did not vary appreciably over
others are available in the supplementary materials for patients with first-time versus multiple-episode psycho-
the article: sis.
1. The methodology for preparing the data for analy-

sis under the assumption of normality is described, SUMMARY
including normalizing transformations applied to
scores from individual cognitive tasks. Distribu- Specification is the most important step in SEM because
tional characteristics are verified. the quality of the ideas that underlie the model affects
2. The rules by which the initial single-factor mea- all subsequent steps, including the analysis phase. Mod-
surement model for cognitive capacity was respeci- els should be evaluated for identification, or for whether
fied are stated. Briefly, each indicator was required model parameters can be expressed as functions of the
to share at least 30% of its variance with the com- elements in a data matrix, in the study planning stage.
mon factor, and it is predicted that error terms for This is because adding variables to the model is one
two scores from the same task, such as immediate way to rectify underidentified causal effects, but doing
versus delayed recall of the same stimuli (e.g., ISL, so may be difficult if the data are already collected.
ISLR in Figure 3.1), might covary. The context of model generation is the most common
in SEM studies, and respecification should be guided
3. The syntax, data, and complete output files for all by the same principles as the specification of the origi-
SEM analyses in lavaan are available to readers. nal model. This means that changes to an initial model
Thus, readers can reproduce the original analyses require substantive justification; otherwise, a risk is that
or fit models to the same data not considered by respecification will generate a model that fits the data in
Sauvé et al. (2019). a particular sample but does not replicate. The state of
6. The fact that all solutions are admissible is reported. reporting in too many SEM studies is deficient such that
Model fit is described at both the global and local readers are not given enough information about model
levels. Full matrices of residuals are available so specification, respecification, or fit to the data to fully
readers have full access to complete information comprehend the findings. For example, both global fit
about fit. and local fit, or the residuals, should be described in
5. Unstandardized estimates with standard errors are written reports. The existence of reporting standards
reported for all model parameters, and standard- for SEM analyses should help authors to write better,
ized estimates are reported, too. more complete summaries of SEM analyses. The next
chapter covers preparation of the data.
6. An equivalent model is directly acknowledged.
Other aspects of reporting by Sauvé et al. (2019) are LEARN MORE

not ideal. For example, we did not preregister the analy-
sis plan. The model degrees of freedom are not explic- Recommendations for reporting SEM results as part of JARS-
itly tabulated (they are df M = 12 for Figure 3.1), and Quant reporting standards by the APA are described in
the sample size was not determined based on a priori Appelbaum et al. (2018), the classic work by MacCallum
considerations, such as power analysis. The sample size (1995) about model specification offers helpful advice, and
Pt1Kline5E.indd 44 3/22/2023 3:52:38 PM

Schreiber (2008) describes the information that should be MacCallum, R. C. (1995). Model specification: Procedures,
reported in SEM analyses in clear, accessible terms. strategies, and related issues. In R. H. Hoyle (Ed.), Struc-
tural equation modeling: Concepts, issues, and applica-
Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., tions (pp. 16–36). Sage.
Nezu, A. M., & Rao, S. M. (2018). Journal article report-
Schreiber, J. B. (2008). Core reporting practices in struc-
ing standards for quantitative research in psychology:
tural equation modeling. Research in Social and Admin-
The APA Publications and Communications Board Task
istrative Pharmacy, 4(2), 83–97.
Force report. American Psychologist, 73(1), 3–25.
Pt1Kline5E.indd 45 3/22/2023 3:52:38 PM

4
Data Preparation
Besides good ideas (hypotheses), effective SEM practice requires careful data preparation and screening.
Just as in other types of multivariate analyses, data preparation is critical in SEM for three reasons: First,
it is easy to make a mistake entering data into computer files. Second, some estimation methods in SEM
make specific distributional assumptions about the data. These assumptions must be taken seriously because
violation of them could result in bias. Third, data-related problems can make SEM computer tools fail to
yield a logical solution. A researcher who has not carefully screened the data could mistakenly believe
that the model is at fault, and confusion ensues. How to deal with missing data is a problem in many, if not
most, empirical studies. Classical (i.e., obsolete) methods for handling incomplete data sets are contrasted
with modern techniques such as multiple imputation, for which a tutorial is offered for readers who are less
familiar with the method. Other data screening issues addressed in this chapter include outliers, extreme
collinearity, and distributional characteristics. We begin with review of basic options for inputting data to
SEM computer tools.
FORMS OF THE INPUT DATA data, replicate the original analyses or estimate alterna-
tive models not considered in the original work (i.e.,
Most primary researchers—those who conduct origi- conduct a secondary analysis).
nal studies—analyze raw data files, but sometimes the Most SEM computer tools accept either a raw data
raw data themselves are not required. For example, file or a matrix summary. If raw data are analyzed, the
when analyzing continuous outcomes with methods computer will create its own matrix, which is then ana-
that assume normal distributions, a matrix of summary lyzed. Consider the issues listed next when choosing
statistics can be input to an SEM computer tool instead between raw data and a matrix summary as program
of the raw data. In fact, you can replicate most of the input:
analyses described in this book using the matrix sum-
maries that accompany them. This is a great way to 1. Certain kinds of analyses require raw data files.
learn because you can make mistakes using someone’s One situation is when distributions for continuous out-
data before analyzing your own. Some journal articles comes are severely nonnormal and the data are ana-
about the results of SEM analyses contain enough lyzed with a method that assumes normality, but stan-
information, such as correlations, standard deviations, dard errors and model test statistics are computed that
and means, to create a matrix summary, which can then adjust for nonnormality. A second situation involves
be submitted to a computer program for analysis. Thus, missing data. Both classical and modern techniques for
readers of these works can, with no access to the raw missing data require the analysis of raw data files. The
46
Pt1Kline5E.indd 46 3/22/2023 3:52:38 PM

Data Preparation 47
third case is when outcome variables are categorical, TABLE 4.1. Raw Data and Matrix Summaries
either unordered (nominal) or ordered (ordinal). Such in Lower Diagonal Form
outcomes can be analyzed in SEM, but raw data files X W Y
are generally needed.
Raw scores
2. Matrix input offers a potential economy over raw
3 65 24
data files. Suppose that 1,000 cases are measured on 10 8 50 20
continuous variables. The data file may be 1,000 lines 10 40 22
(or more) in length, but a matrix summary for the same 15 70 32
data might be only 10 lines long. 19 75 27
3. Sometimes a researcher might “make up” a data
Covariances
matrix using theory or results from a meta-analysis,
so there were never raw data, only a matrix summary. 38.5000
42.5000 212.5000
Submitting a made-up data matrix to an SEM com-
17.5000 51.2500 22.0000
puter tool is a way to diagnose certain kinds of techni-
cal problems that can occur when analyzing a complex Correlations, standard deviations
model. This idea is elaborated later in the book.
1.0000
.4699 1.0000
If means are not analyzed, there are two basic sum- .6013 .7496 1.0000
maries of raw data for continuous variables—covari- 6.2048 14.5774 4.6904
ance matrices and correlation matrices with standard
deviations. For instance, listed in the top part of Table Note. MX = 11.0000, MW = 60.0000, MY = 25.0000.
4.1 are scores on three continuous variables for five
cases. Presented in the lower part of the table are the
two summary matrices just mentioned in lower diago- researcher submits just a correlation matrix for analy-
nal form, where only unique values are reported in the sis. There are special methods for analyzing correlation
lower left-hand side of the matrix. Both the covariance matrices without standard deviations, but they are not
matrix and the correlation matrix with standard devia- available in all software. Thus, it is generally safer to
tions in the table encapsulate all covariance informa- analyze a covariance matrix (or a correlation matrix
tion in the raw data. Computer tools for SEM generally with standard deviations). The pitfalls of analyzing cor-
accept for input lower diagonal matrices as alternatives relation versus covariance matrices are the reason why
to full ones, with redundant entries above and below you must state in written reports the specific kind of
the diagonal, and can “assemble” a covariance matrix data matrix analyzed and the estimation method used.
given just the correlations and standard deviations. Matrix summaries of raw data must consist of
Four-decimal accuracy is recommended for matrix covariances and means whenever means are analyzed.
input to minimize rounding error. Exercise 1 asks you For example, either the correlation matrix with stan-
to reproduce the covariance matrix from the correla- dard deviations or the covariance matrix in Table 4.1
tions and standard deviations in Table 4.1. would be submitted with an extra row for the means of
It may be problematic to submit just a correlation all three variables, if means are analyzed. Even if your
matrix with no standard deviations for analysis, specify analyses do not concern means, you should neverthe-
that all standard deviations equal 1.0 (which standard- less report the means of all continuous variables. You
izes the variables), or convert raw scores to normal may not be interested in analyzing means, but someone
deviates (z scores) and then submit the data file of stan- else might be. Always report sufficient descriptive sta-
dardized scores. This is because most SEM estimation tistics (including means) so others can reproduce your
methods assume the analysis of unstandardized vari- results. If the analysis requires raw data, make those
ables. If the variables are standardized, then the results data available to readers, such as including the data
can be incorrect, including wrong values for standard file in the supplemental materials for a journal article.
errors or model test statistics. Some SEM computer Some journals host data files for articles in databases
tools issue warning messages or terminate the run if the that are accessible to all readers.
Pt1Kline5E.indd 47 3/22/2023 3:52:38 PM

POSITIVE DEFINITENESS determinant equals zero, and (3) there is some pattern
of perfect collinearity that involves two variables (e.g.,
The data matrix that you submit—or the one calculated rXY = 1.0) or ≥ 3 variables in a more complex pattern
by the computer from your raw data—should be posi- (e.g., RY•XW = 1.0). Perfect collinearity means that the
tive definite (PD), which is required for most estima- denominators of some calculations will be zero, which
tion methods. A data matrix that lacks this character- results in “illegal” (undefined) fractions in computer
istic is nonpositive definite (NPD); therefore, attempts analysis (estimation fails). Near-perfect collinearity,
to analyze such a matrix would probably fail. A PD such as rXY = .97, manifested as near-zero eigenvalues
data matrix has the properties summarized next: or determinants, can cause the same problem.
Negative eigenvalues (< 0) may indicate a data
1. The matrix is nonsingular, which means that it has matrix element—a correlation or covariance—that
an inverse. A matrix with no inverse is singular. is out of bounds. It is mathematically impossible for
such an element to occur if all elements were calculated
2. All eigenvalues for the matrix are positive (> 0),
from the same cases with no missing data. For exam-
which says that the matrix determinant is also posi-
ple, the value of rXY, the Pearson correlation between
tive.
two continuous variables, is limited by the correlations
3. There are no out-of-bounds correlations or covari- between these two variables and a third variable W. It
ances. must fall within the range defined next:
In most kinds of multivariate analyses (SEM 2

(rXW × rWY ) ± (1 − rXW 2
)(1 − rWY ) (4.1)
included), the computer needs to derive the inverse of
the data matrix as part of its linear algebra operations. If For example, given a value of rXW = .60 and of rWY = .40,
the matrix is singular, these operations fail. If v equals the value of rXY must fall within the range
the number of observed variables, the computer should
also be able to generate v linear combinations of the .24 ± .73, or from –.49 to .97
variables that (1) are pairwise uncorrelated (orthogo-
nal) and (2) reproduce all the covariance information in Any other value for rXY would be out of bounds. Equa-
the original data matrix. The v weighted combinations tion 4.1 specifies a kind of triangle inequality for val-
are called eigenvectors, and the amount of variance ues of correlations among three variables. In a geomet-
explained in the original data matrix by each eigenvec- ric triangle, the length of a given side must be less than
tor is its corresponding eigenvalue.1 It is impossible to the sum of the lengths of the other two sides but greater
derive more eigenvectors than the number of observed than the difference between the lengths of the two sides.
variables because no information remains once all v In a PD data matrix, the maximum absolute value
eigenvectors are extracted from the data matrix. of covXY, the covariance between two continuous vari-
If all v eigenvalues are > 0, the matrix determinant ables X and Y, must respect the limits defined next:
will be positive, too. The determinant is the serial
22 22
product (the first times the second times the third, and XY ≤ s XX × sYY
max cov XY (4.2)
so on) of the eigenvalues. If all eigenvalues are posi- 2 2
tive, the determinant is a kind of matrix variance, or where s X and sY are, respectively, the sample variances.
the volume of the multivariate space “mapped” by the In other words, the maximum absolute value for the
observed variables.2 If any eigenvalue equals zero, then covariance between two variables is less than or equal
(1) the data matrix has no inverse (it is singular), (2) the to the square root of the product of their variances. For
example, given
1 Inprincipal components analysis, a type of exploratory factor
2 2
analysis where the factors are linear combinations of observed covXY = 13.00, s X = 12.00, and sY = 10.00
variables (principal components), eigenvectors correspond to
principal components and eigenvalues are the variances of those The covariance between X and Y is out of bounds
components. because
2 Thereis a good illustration at https://en.wikipedia.org/wiki/
Determinant 13.00 > 12.00 × 10.00 =
10.95
Pt1Kline5E.indd 48 3/22/2023 3:52:39 PM

Data Preparation 49
which violates Equation 4.2. The value of rXY is also In the real world, missing values occur in many, if
out of bounds because it equals 1.19, given these vari- not most, data sets, despite best efforts at prevention.
ances and covariance. Exercise 2 asks you to verify A few missing values, such as < 5% in the total data
this fact. set, may be of little concern. This is because selection
Because the matrix determinant is the serial prod- among alternative methods to deal with missing data
uct of the eigenvalues, the determinant will be nega- tends to make little difference when rates of missing
tive if some odd number of eigenvalues (1 or 3 or 5, data are low (Schafer, 1999). Higher rates of data loss
etc.) is negative. A matrix with a negative determinant present greater challenges, especially if the data loss
may have an inverse, but the whole matrix is neverthe- mechanism is not random (or at least predictable).
less NPD, perhaps due to out-of-bounds correlations or There is no critical rate of missing data that signals a
covariances. See Topic Box 4.1 for more information severe problem, but as that rate exceeds about 10% or
about causes of NPD data matrices and possible solu- so, there is an increasing likelihood of bias. In this case,
tions. the choice of method can affect the results. Best prac-
Before analyzing either raw data or a matrix sum- tices involve the steps listed next (Johnson & Young,
mary, the original data file should be screened for the 2011; Little et al., 2012):
problems considered next. Some of these difficulties are
causes of NPD data matrices, and most are also con- 1. Report the extent of missing data with a tabular
cerns in data screening for other kinds of multivariate summary or diagram, such as a participant flow.
statistical analyses—see Tabachnick and Fidell (2019). Describe procedures used to prevent or mitigate
If scores from psychological tests are to be analyzed, data loss.
the reliabilities of those scores should be estimated 2. Diagnose missing data patterns, which are related
in the researcher’s samples—see the Psychometrics to assumptions about data loss mechanisms. Note
Primer on this book’s website. that different variables in the same data set can be
affected by different data loss mechanisms.
3. Use a modern, principled statistical method for ana-
MISSING DATA
lyzing incomplete data, one that takes full advan-
tage of the structure in the data and does not rely on
The topic of how to deal with missing observations is
implausible assumptions.
complex, and entire books are devoted to it (Enders,
2010; Graham, 2012; Rubin & Little, 2020). There are 4. Acknowledge the reality that certain assumptions
also articles or chapters about dealing with incomplete about data loss are untestable unless new data are
data in SEM (Enders, 2013, 2023; Jia & Wu, 2019), collected.
which is fortunate because it is not possible here to 5. Conduct a sensitivity analysis, where the extant
give a comprehensive account. The goal instead is to incomplete data are reanalyzed using a different
describe obsolete versus modern options and explain method. If the results differ appreciably from the
their relevance in SEM. original findings, they are not considered robust
Ideally, researchers would always work with com- compared to alternative assumptions about missing
plete data sets; otherwise, prevention is the best strat- data.
egy. For example, questionnaire items that are clear and
unambiguous may prevent missing responses, and com-
Data Loss Mechanisms
pleted forms should be reviewed for missing responses
and Missingness Graphs
before participants submit a computer-administered
survey. Little et al. (2012) offered suggestions for reduc- Rubin (1976) described three basic data loss mecha-
ing missing data in clinical trials, including routine nisms. The least troublesome in terms of bias—and
follow-up after treatment discontinuation, allowing for perhaps also the most unlikely in real data sets—is
flexible treatment that accommodates side effects or missing completely at random (MCAR). This means
differences in efficacy, offers of monetary value or other that missingness (i.e., missing yes or no) on a variable
incentives for completeness of records, and targeting an is unrelated to (1) all other variables in the data set and
underserved population, which provides an incentive to (2) the variable itself with no missing observations.
remain in the study. The second point just stated means that the propensity
Pt1Kline5E.indd 49 3/22/2023 3:52:39 PM

TOPIC BOX 4.1
Causes of Nonpositive Definite Data Matrices and Solutions

The causes of NPD data matrices described by Wothke (1993) are listed next. Some causes can be
detected through data screening:
1. Extreme bivariate or multivariate collinearity among observed variables.

2. The presence of outliers that force the values of correlations to be extremely high.
3. Sample correlations or covariances that barely fall within the limits defined by Equations 4.1 and
4.2 but nevertheless cause the analysis to fail with a warning or error message about an NPD
matrix.
4. Pairwise deletion of missing data.
5. Making a typing mistake when transcribing a data matrix from one source, such as a table in
a journal article, to another, such as a syntax file for computer analysis, can result in an NPD
data matrix. For example, if the value of a covariance in the original matrix is 15.00, then typing
150.00 in the transcribed matrix could generate an NPD matrix.
6. Plain old sampling error can generate NPD data matrices, especially in small or unrepresentative
samples.
7. Sometimes matrices of estimated Pearson correlations, such as polyserial or polychoric correla-
tions derived for observed variables that are not continuous, can be NPD.
Here are some tips for diagnosing whether a data matrix is PD before submitting it for analysis to an
SEM computer tool: Copy the full matrix (with redundant entries above and below the diagonal) into a text
(ASCII) editor, such as Microsoft Windows Notepad. Next, point your Internet browser to a free, online
matrix calculator (many are available) and then copy the data matrix into the proper window on the cal-
culating page. Finally, select options on the calculating page to derive the eigenvalues, eigenvectors, and
determinant. Look for results that indicate an NPD matrix, such as near-zero, zero, or negative eigenvalues.
A useful matrix and vector calculator is available at https://www.symbolab.com/solver/matrix-calculator.
An alternative is to analyze the matrix in R using its native matrix algebra functions (i.e., no special pack-
age is needed). An example follows.
Suppose that the covariances among variables X, W, and Y, respectively, are
1.00 .30 .65

.30 2.00 1.15
.65 1.15 .90
The R syntax that turns scientific notation off, defines and displays the covariance matrix just listed, and
generates the eigenvalues and eigenvectors, the determinant, and the inverse (if any) is given here. The last
command converts the covariance matrix to a correlation matrix:
(continued)
Pt1Kline5E.indd 50 3/22/2023 3:52:39 PM

Data Preparation 51
options(scipen = 999)
a <- matrix(c(1,.3,.65,.3, 2, 1.15,.65, 1.15,.9),
nrow = 3, byrow = TRUE)
a
eigen(a)
det(a)
solve(a)
cov2cor(a)
The eigenvalues generated using the R code just listed for the covariance matrix are
(2.918, .982, 0)
The third eigenvalue is zero, so the covariance matrix has no inverse and the determinant equals 0. Let us
inspect the weights for the third eigenvector, which for X, W, and Y, respectively, are
(–.408, –.408, .816)
Some online matrix calculators report the eigenvector weights as
(–1, –1, 2) or (–.5, –.5, 1)
but the values just listed are proportional to the weights computed in R. None of these weights equals zero,
so all three variables are involved in perfect collinearity. The pattern for these data is
RY•XW = RW•XY = RX•YW = 1.0
To verify this pattern, you should calculate the multiple correlations just listed from the bivariate correlations
for X, W, and Y, respectively, computed in R and presented next in lower diagonal form—see the Regres-
sion Primer on this book’s website for the equations:
1.0
.2121 1.0
.6852 .8572 1.0
The LISREL program offers an option for ridge adjustment, which multiples the diagonal elements in a
covariance matrix by a constant > 1.0 until negative eigenvalues disappear (the matrix becomes PD). These
adjustments increase the variances until they are large enough to exceed any out-of-bounds covariance
element in the off-diagonal part of the matrix. This technique “fixes up” a data matrix so that the necessary
algebraic operations can be performed (Wothke, 1993), but parameter estimates, standard errors, and fit
statistics are biased after ridge adjustment. A better solution is to try to solve the problem of nonpositive
definiteness through data screening. There are other contexts where you may encounter NPD matrices in
SEM, but these generally concern (1) matrices of parameter estimates for your model or (2) matrices of
correlations or covariances predicted from your model. A problem is indicated if any of these matrices is
NPD. We will deal with these contexts in later chapters.
Pt1Kline5E.indd 51 3/22/2023 3:52:39 PM

for an observation to be missing does not depend on a that the missing values depend on information that is
case’s true status on that variable. Thus, there is no sys- not directly available in the analysis. Thus, (1) results
tematic process anywhere that would make some data based on the complete cases only can be severely biased
more likely to be missing than other data. If so, then when the data loss pattern is MNAR, and (2) the choice
the observed incomplete data are just a random sample of methods to deal with the missing data can make a
of scores that the researcher would have analyzed if the big difference in the results. Some of this bias may be
data were complete (Enders, 2010). Results based on reduced if other measured variables happen to covary
the complete cases only should not be biased, although with unmeasured causes of data loss, but whether this is
power of significance tests may be reduced due to a true in a particular sample is typically unknown.
smaller effective sample size. Mohan and Pearl (2021) described missingness
Data are missing at random (MAR) when missing- graphs or m-graphs that visually represent the data
ness is (1) correlated with other variables in the data set, loss mechanisms just outlined. An m-graph is a spe-
but (2) does not depend on the level of the variable itself cial type of directed acyclic graph (DAG) in which uni-
with no missing observations. This means that data are directional causal effects are represented with arrows
actually missing conditionally at random, or after con- that point from presumed causes to their outcomes.
trolling for values of other measured variables. Sup- Presented in Figure 4.1(a) is causal DAG for variables
pose that men are less likely to report on their health with no missing data, where Xo and Wo are specified
status in a particular area than women, but there is no as uncorrelated causes of Yo. Figures 4.1(b)–4.1(d) are
real difference in health between men with no missing m-graphs in which Ym is an incomplete variable with
data versus men who opted not to disclose. After con- missing scores for some, but not all, cases. Variable R
trolling for gender, the data loss pattern is random. The in the m-graphs is the missingness mechanism that
dependence of missingness solely on other observed determines whether a particular score is missing or not
variables explains why Mohan and Pearl (2021) used missing on the observed variable Y*, which is a proxy
the term v-MAR, where “v” stands for “observed for Ym. The relation among Yo, R, and Y* can be sum-
variables.” Information lost due to an MAR process is marized as follows:
potentially recoverable through imputation in which
missing scores are replaced by values predicted from Y if R = 0
Y* =  0 (4.3)
other variables in the data set. Options for imputation if R = 1
m
are described momentarily, but statistical methods suit-
able for the MAR data loss pattern can also be applied where m means “missing.” That is, when R = 0, the true
when the pattern is MCAR. The reverse is not true, value for Y is observed, but when R = 1 that value is
though: Methods that assume MCAR are not generally hidden (not observed).
appropriate when the missing data mechanism is MAR. In Figure 4.1(b), the data loss mechanism is MCAR
The third pattern is missing not at random (MNAR), because missingness occurs with no relation to any vari-
also known as nonignorable missing data, where the able whether it is completely measured (Xo, Wo) or par-
probability of missingness is related to the true level tially measured (Ym). The independencies just stated are
of the variable itself even after controlling for other represented in Figure 4.1(b) by the absence of arrows
variables in the data set. That is, there is a structure that point to R from all other variables. The MAR pat-
to the missing data, and it cannot be treated as though tern is depicted in Figure 4.1(c), where missingness is
it were random. Unlike the MAR pattern, which is a specified as being caused by the fully observed variable
recoverable process, the MNAR pattern is not because Wo but not by the incomplete variable Ym. For example,
it is latent (not directly measured) (Little, 2013). An if variable Wo in the figure is gender, then the whole
example of an MNAR missing data pattern is when m-graph says that although the likelihood of missing
men who fail to answer questions about their health data varies by gender, there is no systematic difference
status are more likely to be ill than men with no miss- between responders and nonresponders on the partially
ing responses on the same variable. Another is when measured variable after controlling for gender. The
respondents with either very low or very high incomes m-graph in Figure 4.1(d) represents an MNAR pattern,
are less likely to answer questions about their income. where missingness is caused by the partially measured
Thus, the challenge presented by the MNAR pattern is variable itself (i.e., Ym → R), and controlling for any
Pt1Kline5E.indd 52 3/22/2023 3:52:39 PM

Data Preparation 53
(a) Complete (b) MCAR (c) MAR (d) MNAR
Xo Wo Xo Wo Xo Wo Xo Wo
Yo Ym R Ym R Ym R
Y* Y* Y*
FIGURE 4.1. No missing data (a). Missingness graphs (m-graphs) under conditions of missing completely at random (MCAR)
(b), missing at random (MAR) (c), and missing not at random (MNAR) (d). Variables Xo, Wo, and Yo are observed in all cases;
Ym has missing data for some cases; Y* is the proxy variable that is actually measured; and R is the missingness mechanism.
If R = 0, then Y* = Yo, but if R = 1, then Y* = missing. From “Graphical Models for Processing Missing Data,” by K. Mohan
and J. Pearl (2021), Journal of the American Statistical Association, 116(534), p. 1025. Copyright © 2021 by Taylor & Francis.
Adapted with permission.
fully measured variable, Xo or Wo, can neither break nor variables in the data set while controlling for their
disrupt this association. intercorrelations. In the bivariate case, where missing
data are confined to a single variable and the other vari-
able is continuous, the Little MCAR Test reduces to
Diagnosing Missing Data
an independent-samples t test. Assuming normal distri-
It is not easy in practice to determine whether the data butions and homoscedasticity, the test statistic for the
loss mechanism is random or systematic, especially Little MCAR Test is distributed over large samples as
when each variable is measured once. One reason is a central chi-square statistic. A significant result (e.g.,
that all three data loss mechanisms—MCAR, MAR, p < .05 when testing at the .05 level) means that the null
and MNAR—can be involved in causing case attri- hypothesis of MCAR is rejected; that is, the data loss
tion or nonresponse, and their relative influence can pattern is either MAR or MNAR. A problem with the
change over variables in the same data set. Another test is that its power can be low in small samples (i.e.,
reason is that while there is a way to determine the null hypothesis of MCAR is retained too often), and
whether the assumption of MCAR is reasonable, it can flag trivial departures from MCAR as significant
there is no specific test that provides direct evidence in large samples.
of either MAR or MNAR if the hypothesis of MCAR Estimating effects sizes in comparisons of com-
is rejected. The only way to distinguish MAR and plete versus incomplete cases over other variables can
MNAR data loss mechanisms is to measure the miss- help the researcher to interpret results from the Little
ing data. For example, if survey nonrespondents are MCAR Test. One tactic involves creating a binary
later followed up by phone to get the missing infor- (dummy) variable, where a score of “1” indicates a
mation, then respondents and nonrespondents in the missing observation and a score of “0” means not miss-
first round can now be compared. If these two groups ing (complete). Next, the means for the two groups just
differ appreciably on the recovered data, there is evi- defined are compared to other variables that are con-
dence for MNAR; otherwise, the hypothesis of MAR tinuous. Magnitudes of univariate group mean differ-
may be viable. Whenever the recovery of missing data ences can be estimated using either standardized mean
is impractical, analysis of the original incomplete data differences, d, or point-biserial correlations, rpb (Kline,
is the only option. 2013a, chap. 5). In large samples, a significant Little
Little (1988) described a multivariate significance MCAR Test result when magnitudes of group differ-
test for the hypothesis of MCAR. It compares cases ences on other variables are considered trivial would
with observed versus missing observations on all other bolster the hypothesis of MCAR. The opposite pattern
Pt1Kline5E.indd 53 3/22/2023 3:52:39 PM

in a small sample—a nonsignificant Little MCAR Test CLASSICAL (OBSOLETE) METHODS

when magnitudes of group differences on other vari- FOR INCOMPLETE DATA
ables are meaningful—would provide evidence against
the hypothesis of MCAR. Classical techniques for handling missing data are
Observing differences between complete versus simple to understand and easy to implement in statisti-
incomplete cases on other variables is helpful for a cal software, but they are increasingly seen as obsolete
method that imputes scores for the incomplete cases because such methods
when predicting from those other variables, assuming
the data loss pattern is MAR (Figure 4.1(c)). A strat- 1. Assume that the data loss mechanism is MCAR,
egy that anticipates this pattern is to measure auxiliary and results with these methods can be seriously
variables that may not be of theoretical interest except biased when this rather improbable assumption
they are expected to be potential causes or correlates does not hold.
of missingness on other variables. For example, family 2. Take little or no advantage of structure in the data
socioeconomic status (SES) could be a potential aux- when investigating missing data patterns.
iliary variable in a longitudinal study, if it is expected
that lower SES families are more likely to drop out of 3. Basically ignore or minimize the problem by allow-
the study. Including family SES in the analysis may ing an incomplete data file to be produced.
help to reduce bias due to differential attrition related
to this variable. An auxiliary variable could also be one Classical methods are briefly described next so that
that simply covaries appreciably with variables that readers can better understand their limitations, but I
have missing observations or with their causes whether can’t recommend their use. Now, to be fair, if the rate
or not they are related to the missingness mechanism. of missing data is very low (e.g., 1%), then it doesn’t
Suppose that information on family SES is not avail- really matter whether classical or more modern tech-
able, but addresses for places of residence are mea- niques are used because the results will be similar. But
sured. If place of residence, or neighborhood, is related as the rate of data loss increases, the trustworthiness of
to SES, then including residence in the analysis may classical methods decreases, especially if the data loss
recover some of the information about SES as a miss- pattern is not MCAR.
ing data mechanism, thus reducing bias related to SES. There are two broad categories of classical tech-
Auxiliary variables require care in their selection. niques: available case methods, which analyze avail-
For example, including too many such variables in able data through removal of incomplete cases from
small samples can decrease precision and create down- the analysis, and single-imputation methods, which
ward bias in estimates of regression coefficients, espe- replace each missing score with a single calculated
cially if absolute correlations between auxiliary and (imputed) score. Available case methods include list-
other variables are < .10 (Hardt et al., 2012). Enders wise deletion in which cases with missing scores on
(2010) recommended that auxiliary variables should any variable are excluded from all analyses. The effec-
have absolute correlations of about .40 with incomplete tive sample size with listwise deletion includes only
variables, although that particular value is not a golden cases with complete records, and this number can be
rule. Ideally, there should be no missing observations much smaller than the original sample size if miss-
on auxiliary variables; otherwise, their potential role in ing observations are scattered across many records.
recovering information due to data loss on other vari- An advantage is that all analyses are conducted with
ables is diminished (Dong & Peng, 2013). Thoemmes the same cases. This is not so with pairwise deletion
and Rose (2014) described situations where auxiliary in which cases are excluded only if they have missing
variables might actually increase bias. One case is values on variables involved in a particular analysis.
when the auxiliary variable is an outcome of a partially Suppose that N = 300 for an incomplete data set. If
measured variable and where controlling for the aux- 250 cases have no missing scores on variables X and Y,
iliary variable induces a spurious association between then the effective sample size for covXY is this number.
the missing mechanism and the partially measured If fewer or more cases have valid scores on X and W,
variable, although Rubin (2009) speculated that the however, the effective sample size for covXW will not
pattern just described would be rare in real data sets. be 250. This property of pairwise deletion can give
Pt1Kline5E.indd 54 3/22/2023 3:52:39 PM

Data Preparation 55
rise to NPD data matrices, which is demonstrated in replaces missing scores with those on the same
Exercise 3. variable from the nearest complete record.
Mean substitution, where the overall sample (grand)
mean replaces any missing score on that variable, is All the methods just listed have limitations: Pattern
the simplest single-imputation method. A variation is matching and RHDI require large samples. The LOCF
group-mean substitution, where the missing score is method assumes that patients in treatment generally
replaced by a group mean on that variable, such as the improve and that the last measurement before dropout
mean for men when the record for a male participant is is a conservative estimate of eventual outcome. But if
incomplete. Neither method takes account of the infor- patients drop out because they are becoming more ill
mation about the individual case except group member- despite treatment or even die, the LOCF approach can
ship for group-mean substitution. Another problem is grossly overestimate treatment efficacy (Liu, 2016).
that mean substitution distorts the score distribution by
reducing variability. This happens because the imputed
score equals the mean, so they contribute nothing to MODERN METHODS
sum of squares (numerator) of the variance while FOR INCOMPLETE DATA
increasing the degrees of freedom (denominator); that
is, variance decreases after imputation. Consequently, Modern techniques for handling missing data include
covariances are weakened and error variances can be multiple imputation (MI), which generates multiple
underestimated (Liu, 2016). predicted scores for each missing observation, and a
Regression substitution is a bit more sophisticated special version of full information maximum likeli-
because each missing score is replaced by a predicted hood (FIML) estimation for incomplete data files that
score from regression analyses based on variables with neither imputes missing observations nor deletes cases.
no missing data. The method uses more information Both methods assume that the data loss mechanism
than mean substitution, but it assumes that incom- is MAR, which is less strict than the assumption of
plete variables can be predicted reasonably well from MCAR by classical techniques. Depending on the SEM
other variables in the data set. Error variance is still computer tool or procedure, one or both of the modern
underestimated in this method because (1) the imputed methods just mentioned will be available. How the spe-
score for cases having the same values on the predic- cial FIML option deals with missing data is described
tors is a constant, and (2) sampling error affects pre- in Chapter 9.
dicted scores, too, and this uncertainty is not estimated The technique of MI uses variables of theoretical
in single-imputation methods. A variation is to add a interest and optional auxiliary variables to generate for
randomly sampled error term from the normal distri- each missing observation a set of k ≥ 2 imputed scores
bution or other user-specified distribution in stochastic from predictive distributions that model the data loss
regression imputation. mechanism. The result is a total of k data sets, each
Other single-imputation methods include with a unique imputed value for any missing observa-
tion. Next, all k imputed data sets are analyzed with
1. Last observation carried forward (LOCF), where standard statistical techniques, such as fitting the same
in clinical trials the last observation is the most structural equation model to each generated data set.
recent score for participants who drop out of the Finally, the resulting k different estimates of each model
study. parameter are pooled, and the corresponding standard
errors reflect sampling error due both to case selection
2. Pattern matching, in which the computer replaces and imputation, and these standard errors are typically
a missing observation from a case with the most larger than those from single imputation. The three
similar profile on other variables. basic steps of MI just summarized are described in more
3. Random hot-deck imputation (RHDI), which detail in Appendix 4.1. Readers already very familiar
separates complete from incomplete cases, sorts with MI can skip this appendix; otherwise, the presen-
both sets so that cases with similar profiles on back- tation there is somewhat technical, but it is worth the
ground variables are grouped together; randomly effort to learn more about this modern and statistically
interleaves the incomplete and complete cases; and principled approach to dealing with incomplete data
Pt1Kline5E.indd 55 3/22/2023 3:52:39 PM

files under the assumption of MAR data loss patterns. where R 2j is the proportion of variance in the jth pre-
The technique has widespread application in SEM and dictor explained by all other predictors. Tolerance is
in many other kinds of statistical analyses, too. 1 – R 2j , or the proportion of variance that is unique, or
not explained by other predictors, so VIF is the recipro-
cal of tolerance, and vice versa. Thompson et al. (2017)
OTHER DATA SCREENING ISSUES described how to calculate VIF in secondary analyses
given just a correlation matrix with no raw data.
Considered next are ways to deal with extreme collin- Values for the VIF range from 1.0, which indicates
earity, outliers, violation of distributional assumptions, all predictor variance is unique, to increasingly positive
and very heteroscedastic data matrices. values with no theoretical upper bound, which indicates
higher and higher levels of collinearity. Again, there
Extreme Collinearity is no gold standard but some authors have suggested
that VIF > 10.0 signals possible extreme collinearity
Extreme collinearity—also called extreme multi- (Chatterjee & Price, 1991). This threshold corresponds
collinearity—refers to high levels of interdependence to R 2j > .90 and tolerance < .10. Others have expressed
among predictors of the same outcome. It can lead to skepticism about whether any single threshold for the
inflation of standard errors, instability in the results VIF (10.0, 20.0, or even higher) that ignores other
such that small changes in sample covariance patterns factors, such as sample size, is meaningful (O’Brien,
can generate very different solutions, or analysis fail- 2007). Thompson et al. (2017) reminded us that any
ure due to linear dependence. Extreme bivariate col- cutting point applied to the VIF should not be treated
linearity can occur because what appear to be separate as a hard dichotomy in the sense that, for instance, one
variables really measure the same thing. Suppose that falsely believes that VIF = 9.9 versus VIF = 10.1 makes
X measures accuracy and W measures speed for the a practical difference.
same task. If rXW = .97, for example, then X and W are
redundant notwithstanding their different labels (i.e.,
accuracy is speed, and vice versa). Either one could be Outliers
included in the same regression equation, but not both. Outliers are scores that are very different from the rest.
Although there is no gold standard, it seems to me that A univariate outlier is a score on a single variable that
absolute bivariate correlations > .90 signal a potential
falls outside the expected population values (Mowbray
problem, but others have suggested even lower thresh-
et al., 2019). There is no single definition of a univariate
olds, such as .80 (Abu-Bader, 2010).
outlier for continuous variables. For example, Tabach-
Researchers can inadvertently cause extreme col-
nick and Fidell (2019) suggested that scores that are
linearity when total scores (composites) and their
more than 3.29 standard deviations above or below
constituents are analyzed together. Suppose that a life
the mean are possible outliers. Expressed in terms of
quality questionnaire has five individual scales and a
normal deviates, | z | > 3.29 describes this guideline.
total score that is summed across all scales. Although
Elsewhere I suggested the more conservative heuristic
the bivariate correlation between the total score and
of | z | > 3.0 (Kline, 2020a), but there is no magic cutoff
each of the individual scales may not be very high, the
multiple correlation between the total score and the five point. A limitation is that outlier detection based on nor-
scale scores must equal 1.0 when there is no missing mal deviates is not robust against extreme scores—see
data, which is multivariate collinearity in the extreme. Topic Box 4.2 for a description of an alternative method.
A straightforward method to detect extreme col- More important than any particular numerical
linearity among three or more continuous variables threshold for detecting univariate outliers is that the
is based on the variance inflation factor (VIF). It is researcher investigates extreme scores, which can arise
often available in regression diagnostic procedures of due to
computer programs for general statistical analyses. It
is computed as 1. Mistakes in data entry or coding, such as typing
“95” instead of “15” for a score or failing to specify
1 that “999” means the observation is missing.
VIF = (4.4)
1 − R 2j 2. Intentional distortion or careless reporting, such as
Pt1Kline5E.indd 56 3/22/2023 3:52:39 PM

Data Preparation 57
when research participants respond randomly to can be determined that a case with univariate outli-
questionnaire items as a covert way to be uncoop- ers is not from the same population, such as a gradu-
erative or lie in response to questions about socially ate student who completes a survey while auditing an
undesirable behaviors such as cheating or drug use. undergraduate class, then it is best to remove that case.
3. Administration of measures in ways that violate It is more difficult when extreme scores come from the
standardization, such as giving examinees hints target population; that is, although infrequent, such
that are not part of task instructions. scores arise naturally, so removing them could affect
the generalizability of the results. The basic options are
4. Selection of samples that are unrepresentative of (1) do nothing; (2) remove the outlier from the analy-
the target population or under faulty distributional sis; or (3) minimize its influence through substitution,
assumptions. such as converting extreme scores to a value that equals
5. Natural variation within a population, or an extreme the closest scores not considered extreme (e.g., within
score belongs to a case selected from a different 3.0 standard deviations from the mean); or (4) apply a
population (Osborne, 2013). monotonic transform that pulls extreme scores closer to
the center of the distribution. Another option is a sensi-
The last point just mentioned assumes that an tivity analysis in which results based on different deci-
extreme score is correct (e.g., it is not invalid). If it sions about extreme scores are explicitly compared.
TOPIC BOX 4.2
Robust Univariate Outlier Detection

Suppose that scores for five cases are 19, 25, 28, 32, and 10,000. The last score (10,000) is obviously
an outlier, but it so distorts the mean and standard deviation that even the more conservative | z | > 3.00
rule fails, also called masking:
10,000 − 2,020.80
M = 2,020.80 SD = 4,460.51
= and z = 1.79
4, 460.51
A more robust decision rule for detecting univariate outliers is
| X − Mdn |
> 2.24 (4.5)
1.4826 (MAD)
where Mdn designates the median—which is more robust against outliers than the mean—and MAD is the
median absolute deviation of all scores from the median. The product of MAD and the scaling factor
1.4826 is an unbiased estimator of s in a normal distribution. The whole ratio is the distance between a
score and the median expressed in robust standard deviation units. The constant 2.24 is the square root of
the approximate 97.5th percentile in a central chi-square distribution with a single degree of freedom. A
potential outlier thus has a value on Equation 4.5 that exceeds 2.24.
For the five scores in this example, Mdn = 28.00, and the absolute values of median deviations are,
respectively, 9.00, 3.00, 0, 4.00, and 9,972.00. The median of the deviations just listed is MAD = 4.00,
and so for X = 10,000 we calculate
9,972.00
= 1,681.51
1.4826 (4.00)
which clearly exceeds 2.24 and thus detects the score of 10,000 as an outlier. See Rousseeuw and Hubert
(2018) for additional methods of robust outlier detection.
Pt1Kline5E.indd 57 3/22/2023 3:52:39 PM

A multivariate outlier has extreme scores on ≥ 2 Majewska (2015) described graphical displays for mul-
variables or an atypical pattern of scores. For example, tivariate outliers based on robust D2 statistics. Because
a case may have scores between 2–3 standard devia- interpretation of graphical displays can be rather sub-
tions above the mean on all variables. Although no jective, they are not substitutes for numerical methods.
individual score might be considered extreme, the case
could be a multivariate outlier if this pattern is unusual. Distributions
Here are some options for detecting multivariate outli-
ers with no univariate outliers: The default method in most SEM computer tools is a
form of ML estimation for either complete raw data
1. Some SEM computer tools, such as IBM SPSS files or summary matrices (e.g., Table 4.1) that assumes
Amos and EQS, list cases that contribute the most multivariate normality, also called multinormality,
to multivariate nonnormality, and such cases may for continuous outcome (endogenous) variables. This
be multivariate outliers. means that
2. Calculate for each case its squared Mahalonbis dis-
1. All the univariate frequency distributions are nor-
tance, D 2, which indicates the distance in variance
mal.
units between the profile of scores and the vector of
sample means, or centroids, adjusting for correla- 2. All joint distributions of any pair of variables are
tions among the variables. In large samples, D2 is bivariate normal; that is, each variable is normally
distributed as a central chi-square with degrees of distributed for each value of every other variable.
freedom equal to the number of variables. A rela- 3. All bivariate scatterplots are linear with homo
tively high value of D 2 and low p value may lead scedastic residuals.
to the rejection of the null hypothesis that the case
comes from the same population as the rest. A con- Other variations on the definition of multivariate nor-
servative level of statistical significance is usually mality in SEM are described in Chapter 9 on global
recommended for this test, such as .001. Leys et al. estimation methods. Because it is often impractical to
(2018) described a robust version of the multivariate examine all joint frequency distributions, it can be dif-
test just mentioned. ficult to assess all aspects of multivariate normality.
Fortunately, many instances of multivariate nonnor-
Visual methods to detect univariate outliers include mality can be detected through evaluation of univariate
the inspection of histograms or box plots (Mowbray frequency distributions. This is because univariate nor-
et al., 2019). In both types of displays, extreme scores mality is necessary but insufficient to guarantee multi-
are represented as further away from the main body variate normality (Pituch & Stevens, 2016).
of scores, or the rest of the distribution. Tukey (1977) Early SEM computer programs had few estimators
developed box plots, also called box-and-whisker other than default ML, but the situation today is very
plots, as a way to graphically display the spread of the different: Most SEM computer programs now offer
data throughout their whole range. The outer parts of multiple estimators that accommodate different kinds
the “box” are defined by the hinges, which correspond of outcome variables, such as continuous, ordinal,
approximately to the first quartile (Q1), or the 25th binary, count, or censored variables, with various types
percentile, and the third quartile (Q3), or the 75th per- of distributions, such as normal or nonnormal distri-
centile. The second quartile (Q2), or the median (50th butions for continuous outcomes and Poisson distribu-
percentile), is represented by a line in the box that is tions for count variables, and so on.3 A robust option for
parallel to two hinges. The whiskers are lines that con- ML estimation for continuous outcomes with nonnor-
nect the hinges with the lowest and highest scores that mal distributions is described in Chapter 9, and other
do not exceed 1.5 times the positive difference between 3A count variable is the number of times a discrete event happens
the hinges (i.e., approximately 1.5 times the interquar- over a period of time, such as the number of hospitalizations over
tile range, or Q3 – Q1). Any scores that fall outside of the past 5 years. In a Poisson distribution, the mean and variance
the limits just stated—the lower and upper fences—are are equal. A censored variable is one for which values occur out-
represented as outliers. Exercise 4 asks you to generate side the range of measurement, such as a scale that registers the
a boxplot for a small data set with an extreme score. value of weight between 1 and 300 pounds only.
Pt1Kline5E.indd 58 3/22/2023 3:52:39 PM

Data Preparation 59
options for noncontinuous outcomes are covered later sum of deviations raised to the same power for scores
in the book. Thus, normality assumptions are less criti- below the mean, so m3 = g1 = 0 in a normal curve. In
cal in modern SEM. Instead, the challenge is for the unimodal distributions where most of the scores are
researcher to select an appropriate estimator, given below the mean such that the distribution has a longer
the distributional characteristics of their data. right tail than the left tail, then m3 > 0 and g1 > 0, which
But there are still occasions in SEM when data indicates positive skew. But if most scores are above
should be screened for multivariate normality. One is the mean such that the left tail is longer than the right
when using the default ML method. Note that raw data tail, then m3 < 0 and g1 < 0, which indicates negative
are required to evaluate multivariate normality. If just a skew.
covariance matrix is submitted for analysis with default In unimodal distributions, kurtosis concerns the
ML, it must be assumed that the original distributions combined weight of the both tails relative to the center
for continuous outcomes are multivariate normal. of the distribution. Thus, kurtosis measures the rela-
Other occasions arise when using a method of MI that tive presence of outliers in both tails of the distributions
assumes multivariate normality or when applying the compared with a normal distribution. It is the fourth
original Little MCAR Test. Quantitative measures of standardized moment, which in the population distri-
nonnormality are described next, with the assumption bution is
that readers already know how to inspect scatterplots,
m
distributions of regression residuals, or other kinds of g 2 = 44 (4.8)
graphical displays or numerical summaries used with s
bivariate linearity and homoscedasticity; otherwise, where m4 is the fourth moment about the mean,
see Cohen et al. (2003, chap. 4). defined by Equation 4.7 for r = 4. The expected value
Significance tests intended to detect multivariate for g2 in a normal distribution is 3.0. Excess kurtosis
nonnormality, such as Mardia’s (1970) test, or detect is defined as
univariate nonnormality, such as the Kolmogorov-
Smirnov (K-S) test, among others (Oppong et al., 2016), g2 – 3 (4.9)
have limited usefulness. One reason is that slight depar-
tures from normality could be significant in large sam- so that its value equals 0 in a normal distribution. Thus,
ples, and power in small samples may be low, so larger positive excess kurtosis, or
departures could be missed. An alternative for assessing
univariate normality are quantitative measures of skew- g2 – 3 > 0
ness and excess kurtosis, the two ways a distribution can
be nonnormal, and they can occur either separately or means that the tails of the distribution are heavier rela-
together in the same variable. A normal distribution has tive to a normal curve, and such distributions are called
zero skewness and excess kurtosis. leptokurtic. Negative excess kurtosis, or
Skewness is the degree of asymmetry in a probabil-
ity distribution. It is defined as the third standardized g2 – 3 < 0
moment, which in the population is
means just the opposite (lighter tails), and distribu-
m
g1 = 33 (4.6) tions with this characteristic are called platykurtic.
s The term mesokurtic refers to distributions with zero
where s is the population standard deviation and m3 is excess kurtosis, such as the normal curve. All refer-
the third central moment, a particular instance of the ences to “kurtosis” from this point concern excess kur-
rth moment defined as tosis. Skewed distributions are generally leptokurtic.
Joanes and Gill (1998) described the three sample
mr =
∑ ( X − m)r (4.7) estimators for skew and kurtosis listed in Table 4.2
N that could be printed by software for general statisti-
where m is the population mean and r is an integer ≥ 1. cal analyses. Statistics g1 for skew and g2 are based on
In Equation 4.7, r = 3 for the third moment. In symmet- sample moments and standard deviations computed
rical distributions, the sum of deviations raised to the as S with N in the denominator. Other statistics in the
third power for scores above the mean will balance the table feature adjustments for small sample size in the
Pt1Kline5E.indd 59 3/22/2023 3:52:40 PM

computation of sample standard deviations as s with cating severe nonnormality in some computer simula-
N – 1 in the denominator (e.g., b1, b2) or in calculations tion studies, but exceptions are easy to find. For exam-
for central moments (e.g., G1, G 2). In normal popula- ple, Lei and Lomax (2005) treated absolute skewness
tion distributions, all three skewness statistics in Table and kurtosis values > 2.30 as indicating severe nonnor-
4.2 are unbiased, but only G 2 for kurtosis is unbiased mality. The point is that there is no magic demarca-
in such distributions. Also shown in the table are the tion between trivial and appreciable nonnormality that
relative values of error variances for each statistic in will fit all models and data sets, but the assumption of
normal distributions. Estimators G1 for skew and G 2 normality becomes increasingly less plausible as there
for kurtosis tend to have the greatest expected variation is more and more skewness or kurtosis. Exercise 5
over random samples, but differences among their val- addresses a common misinterpretation in this area. Sig-
ues and error variances narrow for samples sizes > 100. nificance testing where skewness or kurtosis statistics
In very small samples, though, differences among these are divided by their standard errors is another method,
estimators can be striking, including results for kurto- but it is problematic for reasons already stated (e.g., low
sis that are both positive and negative in the same vari- power in small samples).
able—see Cain et al. (2017) for more information. Normalizing transformations (normalization), or
Just as there are no golden rules for detecting outliers, monotonic arithmetic operations that compress some
there are also no universal absolute values for skewness parts of a distribution more than others while preserv-
or kurtosis statistics that indicate severe nonnormal- ing rank order so that the transformed scores are more
ity. Finney and DiStefano (2013) noted that absolute normally distributed, are an option, but you should
univariate skewness and kurtosis values greater than, think about the variables of interest. Some variables,
respectively, 2.0 and 7.0, have been described as indi- including reaction times, reports of alcohol or drug use,
TABLE 4.2. Estimators of Skewness and Kurtosis and Relative

Error Variances for Normal Samples
Statistic Equation Unbiased Error variance rank
Skewness
g1 m3 Yes 2
S3
G1 N ( N − 1) Yes 1
g1
N −2
b1 m3 Yes 3
s3
Kurtosis
g2 m4 No 3
−3
S4
G2 N −1 Yes 1
(( N + 1)g2 + 6)
( N − 2)( N − 3)
b2 m4 No 2
−3
s4
Note. Error variances are ranked from highest to lowest; m3 = S (X – M)3/N; m4 = S (X – M)4/N;
and S and s are the sample standard deviations computed with, respectively, N or N – 1 in the
denominator.
Pt1Kline5E.indd 60 3/22/2023 3:52:40 PM

Data Preparation 61
and health care costs, are expected to have nonnormal Transformation means that the original meaningful
distributions (Bono et al., 2017). Normalizing an inher- metric is lost, which could be a sacrifice. Described in
ently nonnormal variable could mean that the target Topic Box 4.3 are types of normalizing transformations
variable is not actually studied. Another consideration that might work—there is no guarantee—with practical
is whether the metric of the original variable is mean- suggestions for using them. Exercise 6 asks you to find
ingful, such as postoperative survival time in years. a normalizing transformation for a small data set.
TOPIC BOX 4.3
Normalizing Transformations
Three kinds of normalizing transformations are described next with suggestions for their use:
1. Positive skewness. Before applying these transformations, add a constant to the scores so that
the lowest modified score is 1.0. A basic operation is the square root transformation, or X1/2,
which works by compressing differences between scores in the upper end of the distribution
more than the differences between lower scores. Logarithmic transformations are another option.
A logarithm is a power (exponent) to which a base number must be raised to get the original
number, such as 102 = 100, so the logarithm of 100 in base 10 is 2. Distributions with extremely
high scores may require a transformation with a higher base, such as log10 X, but a lower base
may suffice for less extreme cases, such as the natural base e (approximately 2.7183) for the
transformation loge X = ln X. The inverse function 1/X is an option for even more severe skewness.
Because inverting scores reverses their order, (1) reflect (reverse) the original scores (multiply them
by –1.0) and (2) add a constant to the reflected scores so that the maximum score is at least 1.0
before taking the inverse.
2. Negative skewness. All the transformations just mentioned also work for negative skewness when
they are applied as follows: First, reflect the scores, and then add a constant so that the lowest
score equals 1.0. Next, apply the transformation, and reflect the scores again to restore their
original order.
3. Other types of nonnormality. Odd-root functions (e.g., X1/3) and sine functions tend to bring in
outliers from both tails of the distribution toward the mean. Odd-powered polynomial functions,
such as X3, may help for negative kurtosis. If the scores are proportions, the arcsine square root
transformation, or arcsin X1/2, may help to normalize the distribution.
There are other types of normalizing functions, and this is one of their problems: It can be difficult to
find a transformation that works with a particular distribution. The Box–Cox transformations (Box &
Cox, 1964) may require less trial and error. The most common form is defined next for positive scores only:
X l −1
(l )  , if l ≠ 0;
X = l (4.10)
 log X , if l =0.

where the exponent l is a constant that normalizes the scores. Computer software for Box–Cox transforma-
tions attempts to find the optimal value of l for a particular distribution. There are other variations of the
Box–Cox transformation, some of which can be applied in regression analyses to deal with heteroscedas-
ticity (Osborne, 2013).
Pt1Kline5E.indd 61 3/22/2023 3:52:40 PM

Relative Variances fer by a factor of > 27,000, so the covariance matrix is

ill-scaled. I have seen older SEM computer tools fail to
In an ill-scaled covariance matrix, the ratio of the
analyze this matrix due to this characteristic. To pre-
largest to the smallest variance is greater than, say,
vent this problem, I multiplied the original variables by
100.0. Most estimation methods in SEM are iterative,
the constants listed in the table (including 1.0; i.e., no
which means that initial estimates are derived by the
change) in order to make their variances more uniform.
computer and then modified through subsequent cycles
Among the rescaled variances, the largest variance is
of calculation. The goal is to derive better estimates
only about 13 times greater than the smallest variance.
at each stage, estimates that progressively improve
The rescaled matrix is not ill-scaled.
the fit between model and data. When improvements
from step to step become sufficiently small (i.e., they
fall below the convergence criterion), iteration stops SUMMARY
because the solution is stable. But if the estimates do
not settle down to stable values, the process may fail. The 80/20 rule of data analysis is a variation on the
One cause is variances that are very different: When Pareto principle, named after the Italian economist Vil-
the computer adjusts the estimates from one step to the fredo Pareto, that 80% of a nation’s wealth is owned by
next in an iterative method, the sizes of these changes 20% of the people. In the context of data analysis, it
may be huge for variables with small variances but means that researchers should invest more time (i.e., at
trivial for variances with large variances. The whole least four times as much) screening and preparing their
process may head toward worse rather than better fit. data than actually conducting the substantive analy-
To prevent the problem just described, variables with ses. Data screening in SEM includes the evaluation of
extremely low or high variances can be rescaled by missing data patterns and extent; detection of outliers,
multiplying their scores by a constant, which changes extreme collinearity, and whether data matrices are ill-
the variance by a factor that equals the squared con- scaled; assessment of whether distributional assump-
stant. For example, tions for particular estimators are consistent with the
data; and complete reporting in written summaries of
s X2 = 12.0 and sY2 = .12 the results on how particular problems were dealt with
in the analysis. Making the data file available to other
so their variances differ by a factor of 100.0. Using the researchers is a strong statement of transparency.
constant .10, we can rescale X as follows:
2
s.10 X = .10 × 12.0 = .12
2 LEARN MORE
so now variables .10X and Y have the same variance, or Dong and Peng (2013) offer clear and accessible descrip-
.12. Next, we rescale Y so that it has the same variance tions of techniques for missing data, Manly and Wells (2015)
as X, or 12.0, by applying the constant 10.0, or describe best practices for reporting about the use of MI,
and van Ginkel et al. (2020) review common misunderstand-
2
s10.0Y = 102 × .12 = 12.0 ings about MI.
Dong, Y., & Peng, C.-Y. J. (2013). Principled missing data

Multiplying a variable by a constant is a linear transfor-
methods for researchers. SpringerPlus, 2(1), Article 222.
mation that changes its average and variance but not its
correlation with other variables. This is because linear Manly, C. A., & Wells, R. S. (2015). Reporting the use of
transformations do not alter relative distances between multiple imputation for missing data in higher educa-
scores. An example with real data follows. tion research. Research in Higher Education, 56(4),
Roth et al. (1989) administered measures of exercise, 397–409.
hardiness (resiliency, tough mindedness; referred to as van Ginkel, J. R., Linting, M., Rippe, R. C. A., & van der
“hardy” from this point), fitness, stress, and level of ill- Voort, A. (2020). Rebutting existing misconceptions
ness in a sample of 373 university students. Table 4.3 about multiple imputation as a method for handling
provides a summary matrix of these data. The largest missing data. Journal of Personality Assessment, 102(3).
and smallest variances in this matrix (see the table) dif- 297–308.
Pt1Kline5E.indd 62 3/22/2023 3:52:40 PM

Data Preparation 63
TABLE 4.3 Example of an Ill-Scaled Data Matrix

Variable 1 2 3 4 5
1. Exercise —
2. Hardy –.03 —
3. Fitness .39 .07 —
4. Stress –.05 –.23 –.13 —
5. Illness –.08 –.16 –.29 .34 —
Original M 40.90 0.00 67.10 4.80 716.70
Original s2 4,422.25 14.44 338.56 44.89 390,375.04
Constant 1.00 10.00 1.00 5.00 .10
Rescaled M 40.90 0.00 67.10 24.00 71.67
Rescaled s2 4,422.25 1,440.00 338.56 1,122.25 3,903.75
Rescaled SD 66.50 38.00 18.40 33.50 62.48
Note. These data (correlations, means, and variances) are from Roth et al. (1989); N = 373. Note that
low scores on the hardy measure used by these authors indicate greater hardiness. To avoid confusion
due to negative correlations, the signs of the correlations that involve the hardy measure were reversed
before they were recorded in this table.
EXERCISES
1. Reproduce the covariance matrix in the middle of Locate in your figure values for the extreme of the
Table 4.1 from the correlations and standard devia- lower whisker, lower hinge (H1), median, upper
tions at the bottom of the table. hinge (H2), extreme for the upper whisker, and any
outlier beyond 1.5 × (H2 – H1) from its respective
2. Given covXY = 13.00, s X = 12.00, and sY = 10.00,
2 2 hinge:
show that the corresponding correlation is out of
bounds. Score Frequency Score Frequency
10 5 15 5
11 15 16 4
3. Listed next as (X, W, Y) with missing observations 12 14 17 1
indicated by “—” are scores for 7 cases: 13 13 27 1
(42, 13, 8), (34, 12, 10), (22, —, 12), (—, 8, 14) 14 5
(24, 7, 16), (16, 10, —), (30, 10, —)
5. A researcher finds for a continuous variable that
Compute the covariance matrix for these data using
skewness is 1.97 and kurtosis is 6.90, and concludes
pairwise deletion. Show that the resulting covari-
that the distribution is normal. Comment.
ance matrix is NPD and also that the corresponding
correlation matrix has an out-of-bounds value.
6. Find a normalizing transformation for the data in
Exercise 4. Do not remove the outlier.
4. Generate a box plot for the N = 63 scores listed next.
Pt1Kline5E.indd 63 3/22/2023 3:52:40 PM

Appendix 4.A from among cases with observed scores similar to the
predicted score for an incomplete case. The number of
complete cases in the donor set can be specified by the
researcher with about 5–10 cases as a typical range. In
Steps of Multiple Imputation small samples, though, 10 cases in the donor set may
include too many dissimilar records, especially for few
predictors (Allison, 2015).
The three basic steps in MI—the imputation step, the
A method for multivariate data loss is fully condi-
analysis (estimation) step, and the pooling (combina- tional specification, also called multivariate imputa-
tion) step—and corresponding decisions required at tion by chained equations (MICE) and sequential
each point are summarized next. It is not possible in regression multivariate imputation. It generates a
this overview to give detailed descriptions about sta- series of conditional models, one for each incomplete
tistical options, especially in the imputation step, but variable in the data set, and the method requires speci-
see Allison (2012), Dong and Peng (2013), van Buuren fication of the imputation model for each incomplete
(2018), or van Ginkel et al. (2020) for more informa- variable. Because the method is based on the separate
tion. distributions of each incomplete variable, it can be
applied to continuous variables with nonnormal dis-
tributions or to variables that are not continuous, such
STEP 1. IMPUTATION as ordinal data. Initial imputed values are randomly
selected from the observed data, and these values are
The imputation step involves the analysis of the vari- subsequently improved for each incomplete variable
ables that make up the imputation model and also through iterative selection from conditional distribu-
the method by which random variation is modeled tions estimated by the observed and imputed scores.
and incorporated in generating imputed scores. That The whole process then cycles through all incomplete
method should match the level of measurement for variables. The method is flexible but can converge to
incomplete variables (e.g., continuous vs. categorical), incompatible conditional models depending on the
whether data loss is univariate or multivariate (occurs order of the univariate imputation steps (van Buuren,
on a single vs. multiple variables), and whether the pat- 2018).
tern is monotone or general. Monotone missing data Another multivariate method for arbitrary data loss
means that variables can be ordered and data loss at patterns is based on the Markov Chain Monte Carlo
a particular point means that all subsequent observa- (MCMC) approach that randomly samples from theo-
tions are missing. Dropout in a longitudinal study is an retical probability distributions, in this case from pre-
example of monotone data loss; any other pattern is a dictive distributions for the missing data, and these
general (nonmonotone) missing data. draws become the imputed scores. A Markov Chain is
An option for univariate data loss is the regres- a probability model in which the likelihood of an event
sion method, which is based on linear regression and depends only on the state of the previous event. That
assumes multivariate normal distributions for continu- is, the probability of a future event can be estimated
ous variables. It works by regressing an incomplete just as well from the current state of the event as when
variable on other variables with complete data. Next, knowing the full history of events. A “chain” thus con-
imputed regression coefficients are selected from the cerns multiple simulated random draws from the same
sampling distributions for the coefficients in the first distribution.
analysis, and imputed scores are generated from the In the application of the MCMC method in MI, it
values of the imputed coefficients and a random error is assumed that the underlying complete data follow a
term. These steps are repeated for k times, the num- multivariate normal distribution, and the computer sim-
ber of imputed data sets. A variation is the predic- ulates draws from such distributions over an iterative
tive mean matching method, which does not require sequence of paired steps called the I-step (imputation,
normality and imputes values “borrowed” from other but not meaning the first step in MI) and the P-step
records called donor cases. This method generates pre- (posterior). At the I-step, imputed scores are drawn for
dicted scores for all cases, including those with com- each incomplete case from the predictive distribution
plete data, and randomly selects as an imputed score in the current iteration. Next, in the P-step, the param-
Pt1Kline5E.indd 64 3/22/2023 3:52:40 PM

Data Preparation 65
eters of the predictive distribution are updated by draws cal infinite number of imputations (Patrician, 2002).
from the posterior distribution. The “chain” consists of Greater numbers of imputations, such as k = 100–200,
I-step/P-step pairs over iterations. Some implementa- are recommended in more recent works (Graham et al.,
tions allow for a burn-in period, or default number 2007; Little, 2013) based on statistical power and the
of iterations (e.g., 200) before the first set of imputed fraction of missing information (FMI), which is not
values is drawn. One rationale for this option is to dis- the rate of missing data. The FMI is instead the propor-
sipate the effects of the distribution from the prior iteration of variation in parameter estimates due to nonre-
tion that may differ appreciably from those of the target sponse. That is, it quantifies the amount of parameter
distribution, but burn-in is not an inherent feature of information lost to nonresponse (Lang & Little, 2018),
the MCMC method, and there are other methods to so in this way the FMI is analogous to an R2 statistic for
find good starting points for drawing imputed scores missing data (Enders, 2010). For example, if FMI = .03
(Geyer, 2011). for a particular parameter, then the loss of efficiency
The MCMC method is generally robust against mul- due to missing data is 3%; that is, the estimate based
tivariate nonnormality in large samples (Demirtas et on incomplete data is 97% as efficient compared to
al., 2008), but its use can be problematic with categori- what it would have been with no missing data (Sava-
cal variables. For example, the practice of imputing lei & Rhemtulla, 2012). Wagner (2010) suggested that
values based on normal distributions and then rounding researchers should consistently report the value of the
to the nearest integer or to the nearest plausible value FMI when analyzing incomplete data sets.
can yield very biased results for categorical variables The FMI is a function of k and the ratio of the
(Allison, 2012). The chained equations method (MICE between-imputation variance over the total variance,
algorithm) described earlier is an alternative. Another or the sum of between-imputation variance and the
possibility is to use a method for the imputation step within-imputation variance. The within-imputation
based on logistic, loglinear, or other statistical model variance is the average error variance associated with
for categorical variables—see Audigier et al. (2017) for parameter estimates within all imputed data sets. Con-
more information and examples. ceptually, it estimates what the error variance would be
Yet another multivariate method for arbitrary miss- if there were no missing scores. The between-imputa-
ing data patterns is the expectation–maximization tion variance is the variation in estimates over the k
(EM) algorithm, which is a general purpose itera- imputations. As more information is lost due to missing
tive procedure to find ML estimates for parameters data, the between variance will increase, but when the
that involve latent variables. In the context of MI, the data set includes good predictors of the variables with
method alternates between the E-step (expectation), missing data, both the between variance and the FMI
in which missing scores are imputed from predicted are expected to decrease (Wagner, 2010). In general, the
distributions for the missing data, and the M-step greater the covariances among the observed variables,
(maximization), where parameters for the distribution the lower the value of the FMI (Little, 2013). Dong and
of both the missing and observed data, such as means Peng (2013, p. 5) described equations for computing the
and covariances, are estimated using ML before the FMI, and some computer procedures for MI print the
cycle repeats over the E- and M-steps until there are k value of this statistic in the output.
imputed scores for each missing observation. Because
the EM algorithm is a general method, it can be used
outside of MI to directly estimate parameters for latent STEP 2. ANALYSIS
variables without imputing scores for individual cases
(Dempster et al., 1977). Estimation of standard errors After k complete data sets are generated in the impu-
may be more accurate in the FIML method for incom- tation step, next comes the analysis step, which con-
plete data sets that is described in Chapter 9 (Dong & cerns the analysis model. Parameters are estimated for
Peng, 2013). effects of substantive interest in each of the k imputed
Besides the algorithm, the researcher must also data sets. Ideally, the variables in the imputation model
specify the number of scores to be imputed for each would be the same as those in the analysis model except
missing observation, or k. Suggestions in older works for auxiliary variables (if any). But as the imputation
were roughly k = 3–10 based on the relative efficiency and analysis models are based on increasingly smaller
of imputing these numbers of times against a theoreti- sets of overlapping variables, then results from the
Pt1Kline5E.indd 65 3/22/2023 3:52:40 PM

imputation step may not be very meaningful in the same model and incomplete data file can and will
analysis step. change each time the procedure is repeated unless a
The imputation model should also be general enough random seed that determines the starting point for
in terms of assumptions about distributions or func- generating random numbers is specified.
tional relations to reflect the data in the analysis model. 2. Indeterminacy also applies to the data: The research-
Suppose that normality is assumed for a variable in the er’s model is analyzed in k imputed data sets, so
imputation model, but the actual distribution is severely there is no definitive summary data matrix. Thus,
nonnormal. Results in the analysis model based on the results could not be reproduced in a secondary
means and covariances, such as regression coefficients, analysis conducted with a single data matrix (all k
may not be grossly inaccurate, but estimates of p val- matrices would be required). But the researcher can
ues or bounds for confidence interval could be severely still make available the original raw data file.
distorted (Schafer, 1999). Enders (2010) offers sugges-
3. There is little doubt that MI generates better esti-
tions for representing interactive effects of continuous
mates than classical techniques in reasonably large
or categorical variables in both models.
samples when the data loss pattern is MAR instead
of MCAR (Schafer & Graham, 2002): It might also
reduce bias compared with classical techniques
STEP 3. COMBINATION
when data loss patterns are both MAR and MNAR,
if variables can be added that measure the miss-
Results from the analysis step from each of the k
ing data mechanism, but there is no guarantee (van
imputed data sets are synthesized in the combination
Ginkel et al., 2020).
(pooling) step. The final parameter estimate is the
average over the k results for that parameter, and its
There are special statistical techniques for analyzing
standard error is estimated based on both the within-
MNAR data (Rubin & Little, 2020), but they are not yet
and between-imputation variances. You should also be
widely used and are more difficult to apply than meth-
aware of the issues listed next:
ods that assume MAR. Methods for MNAR data loss
mechanisms are becoming available in SEM computer
1. There is no unique set of results in MI: This is
tools such as Mplus. Galimard et al. (2016) described
because imputed scores are generated through
adaptations of MI for MNAR mechanisms.
simulated random sampling. Thus, results for the
Pt1Kline5E.indd 66 3/22/2023 3:52:40 PM

5
Computer Tools
Two categories of computer tools for traditional SEM are described in this chapter: freely available software
and commercial software. Free software includes packages for conducting SEM analyses in the R comput-
ing environment (e.g., lavaan, semTools) and stand-alone computer tools with a graphical user interface
(GUI) that require no larger software environment (e.g., JASP, Wnyx). Free computer tools have become
increasingly capable to the point where they can replace commercial tools for basically all but the very most
advanced kinds of analyses. Modern SEM computer programs—both free and commercial—are generally
easier to use than their predecessors. But greater user-friendliness of contemporary SEM computer tools
should not lull the researcher into thinking that SEM is easy or requires minimal conceptual understanding.
Features of computer programs can change quickly with new versions, so check the sources listed next for
the most up-to-date descriptions. Computer tools for nonparametric SEM are described in Chapter 6, and
Chapter 16 covers software options for composite SEM. In a work describing the ideas of the Canadian
communication theorist Herbert Marshall McLuhan, Culkin (1967, p. 70) wrote, “We shape our tools and
thereafter they shape us.” I hope that computer use sharpens, rather than dulls, your ability to think critically
about SEM.
EASE OF USE, NOT SUSPENSION by drawing it in onscreen using geometric symbols

OF JUDGMENT such as boxes, circles, and lines with arrowheads on
one or both ends. Next, the program automatically
The first widely available SEM computer tool was translates the model graphic into lines of code, which
LISREL III (Jöreskog & Sörbom, 1976). At the time, are then executed. Thus, (1) the user need not know
LISREL and related applications were not easy to use very much (if anything) about how to write syntax in
because they required the generation of rather arcane order to run an SEM analysis, and (2) the role for tech-
code, and were available only on mainframe comput- nical programming skills is reduced. For researchers
ers with stark command-line user interfaces. The abun- who understand the basic concepts of SEM, this devel-
dance of relatively inexpensive, yet capable, personal opment can only be a plus—anything the reduces the
computers greatly changed things. Statistical software drudgery and gets one to the results quicker is a benefit.
with a GUI is generally easier to use than their char- But there are potential drawbacks to push-button
acter-based counterparts. Indeed, user-friendliness in modeling. For example, no- or low-effort program-
modern SEM computer tools is a near-revolution com- ming could encourage the use of SEM in uninformed
pared with older programs. or careless ways. This is why it is more important than
Most SEM computer tools still permit the user to ever to be familiar with the conceptual and statistical
write code in that application’s native syntax. Some bases of SEM. Computer programs, however easy to
programs offer the alternative of specifying the model use, should only be the tools of your knowledge and
67
Pt1Kline5E.indd 67 3/22/2023 3:52:40 PM

not its master. Steiger (2001) noted that emphasis on (Arbuckle, 2021), although Amos does offer a separate
ease of use of statistical computer tools can give begin- syntax editor. Although drawing editors are popular
ners the impression that SEM is easy, but the reality is with beginners, there are potential drawbacks—see
that things can and do go wrong in SEM. Beginners Topic Box 5.1.
often quickly realize that analyses fail because of tech-
nical problems, including a terminated program run
with cryptic error messages or uninterpretable output. TIPS FOR SEM PROGRAMMING
These things happen because (1) actual research prob-
lems can be quite technical, and the availability of user- Listed next are suggestions for using SEM computer
friendly software does not change this fact. Also, (2) tools; see also Little (2013, pp. 25–27):
computer tools are not perfect and, thus, are incapable
of detecting or preventing all failure conditions. That 1. Learn from the examples of others. Annotated syn-
is the reason why this book places so much emphasis tax, data, and output files for all detailed analyses
on conceptual knowledge instead of teaching you how can be downloaded from the book’s website. Read-
to use a particular computer tool: In order to deal with ers can open these files on their own computers
problems in the analysis when—not if—they occur, without installing any special software. There are
you must understand what went wrong and why. additional online resources with syntax examples
for lavaan,1 Mplus,2 LISREL,3 and other SEM
computer tools.
HUMAN–COMPUTER INTERACTION 2. Annotate your syntax files. Comments are usu-
ally designated by special symbols, such as *, #,
There are three basic ways to interact with SEM com-
or !, that are ignored by the computer. Use com-
puter tools:
ments to describe the specified model, data, and
requested output. Explain the creation of any new
1. Batch processing, where the user writes syntax
variable. Such information is useful for colleagues
that specifies the model, data, analysis, and output.
or students who did not conduct the analysis, but
Next, the syntax is executed through some form of
who need to understand it. Annotation also helps
a “run” command.
researchers to know just what they did in a particu-
2. Drawing editor, where the user draws the model on lar analysis days, weeks, months, or even years ago.
screen. When the diagram is finished, the analysis Without sufficient comments, one quickly forgets.
is run in the GUI.
3. Keep it simple. Sometimes beginners try to analyze
3. Templates or menus, where the model and analy- models that are too complicated, and such analyses
sis are specified as the user clicks with the mouse are more likely to fail. With more syntax or screen
on interface elements such as text fields, pull-down space in a drawing editor for a complex model,
menus, or radio buttons. there are also more opportunities for making mis-
takes. It can also be hard to tell whether a very com-
Batch mode is for users who know the syntax for a plex model is identified. If the researcher does not
particular SEM computer tool. In contrast, knowledge know that their model is not really identified, the
of syntax is generally unnecessary when using a draw- failure of the analysis may be wrongly attributed to
ing editor or templates. But even here some knowledge a syntax error or problem with the data.
of syntax can help. In LISREL, for example, both its
drawing editor and template-based mode of interac- 4. Build it up. Start instead with a simpler model that
tion automatically write syntax in a window that must you know is identified. Try to get the analysis of
be run by the user. That syntax can be edited, and the initial model to successfully run. Then build up
sometimes a problem in model specification is appar- 1 https://www.lavaan.ugent.be/tutorial/index.html
ent in the syntax, which the user can correct, if they
2 https://www.statmodel.com/ugexcerpts.shtml
understand the syntax. In contrast, the drawing edi-
tor in IBM SPSS Amos analyzes the model and data 3 https://ssicentral.com/index.php/products/lisrel/lisrel-
without generating editable (or even viewable) syntax examples/
Pt1Kline5E.indd 68 3/22/2023 3:52:40 PM

Computer Tools 69
TOPIC BOX 5.1
Graphical Isn’t Always Better

The potential drawbacks of graphical editors in SEM computer tools are outlined next. They explain why
some researchers switch from a drawing editor when first learning about SEM to working in batch mode
as they gain experience:
1. It can be tedious to draw onscreen a complex model with many variables, such as numerous
repeated measures variables in a longitudinal design or dozens of items on a questionnaire in a
factor analysis. This is because the screen quickly fills up with graphical elements. The resulting
visual clutter can make it difficult to keep track of what you are doing.
2. Specifying analyses where models are simultaneously fitted to data from two or more samples can
be difficult. This is because it may be necessary to look through different screens or windows in
order to find information about data or model specification for each sample.
3. It is easier to annotate the analysis by putting comments in a syntax file compared with working
in a graphical editor, which may not support user-supplied comments. It is so easy to lose track of
what you have done in an analysis without detailed comments. Thus, using a drawing editor that
does not allow annotations can engender carelessness in record keeping (Little, 2013).
4. Debugging a syntax file after a failed analysis is generally simpler than doing so in a graphical
editor, which may require clicking on multiple graphical elements to inspect specifications in sepa-
rate dialogs boxes or windows. In contrast, everything that specifies the analysis (i.e., commands)
can be viewed in a single place when editing a syntax file.
5. The format of files generated in graphical editors is typically proprietary, which means that they
generally cannot be opened or edited with a different computer tool. In contrast, syntax files are
usually text (ASCII) files that can be opened or edited in any basic text editor, including word
processors. Thus, sharing an analysis with other researchers is generally easier with syntax files.
Including syntax files in supplemental materials for journal articles supports both transparency in
reporting and accessibility for readers.
6. Certain kinds of advanced analyses or options may be available in syntax only. One reason is
that although model diagrams for more basic SEM analyses are more or less standard, this is
less true for more advanced applications such as multilevel analyses, for which there is no single
graphical standard (Curran & Bauer, 2007).
7. There is evidence that graphical user interfaces can help novices who are not trained in informa-
tion technology or in computer science to learn and perform basic tasks faster and in fewer steps.
But graphical interfaces can also impede or slow down expert users compared with text-based
interfaces, and just because an interface is graphical does not guarantee ease of use or reduced
cognitive load (Chen & Zhang, 2007).
8. It seems that it would be easy to generate a publication-quality diagram in an SEM drawing edi-
tor, but this is not exactly true. Drawing editors may use a fixed symbol set that does not include
special symbols that you want to appear in the diagram. There may be limited options for adjust-
ing the appearance of this diagram (e.g., changing font or line widths). Graphs generated by
drawing editors may be rendered in relatively low resolution that is fine for the computer monitor
but not for display in a printed (paper or virtual) document. There are R packages that can gener-
(continued)
Pt1Kline5E.indd 69 3/22/2023 3:52:41 PM

ate high-quality model diagrams, including semPlot (Epskamp, 2022), but they can be frustrating
to use: The diagram is specified in syntax, but the diagram generated by executing that syntax
cannot be directly edited, so if something is wrong, then it is back to the syntax (repeat).
Another option is a professional diagramming and graphics computer tool, such as Microsoft Visio,
that features templates and tools for drawing predefined shapes, such as circles, squares, or lines with
optional arrowheads on one or both ends. A drawback is that professional graphics software can be
relatively expensive, although there are free alternatives like LibreOffice Draw with generally more modest
capabilities. But using comprehensive drawing software to create diagrams of structural equation models
can seem like overkill. This is because only a small fraction of the program functionalities is used to create
model diagrams, which are composed of just a few kinds of graphical elements.
Here is a trade secret I’ll share with you: All model diagrams in this book were created using nothing
other than Microsoft Word Shapes (rectangles, ovals, text boxes, etc.) that are grouped together. Maybe
I am biased, but I think these diagrams are not too shoddy. Sometimes you can do a lot with a simple but
flexible tool. Yes, it takes a lot of time to make a publication-quality model diagram in Word—or in any
graphical computer tool—but once you make a few examples, you can reuse graphical elements, such as
those for common factors and indicators, in future diagrams. Mai et al. (2022) described semdiag, a free,
open source web application for drawing model diagrams in SEM. An online version of semdiag is avail-
able at https://semdiag.psychstat.org/ There is also a free, open-source diagramming application that can
be used online through a web browser or as a downloaded application for Windows, Apple (macOS),
Linux, or Google Chrome OS platform computers available at https://www.diagrams.net/
the model by adding parameters that reflect your COMMERCIAL VERSUS FREE
hypotheses until the target model is eventually COMPUTER TOOLS
specified. If the analysis fails after adding a par-
ticular parameter, the reason may be identification. I am often asked, what is the best software package
5. Comment out instead of delete. In analyses where for SEM? My answer has two parts: (1) There is no
models are progressively simplified (trimmed) single “best” package because there are now many
instead of built, the researcher can comment out capable options available. (2) What is available to you
part of the syntax, or deactivate it, by designating at the lowest cost? For example, many universities have
site licenses that permit researchers and students to
those commands as comments in the next analysis.
use commercial SEM computer tools free of charge.
This method preserves the original code as a record
Advantages of commercial SEM programs include reg-
of the changes.
ular updates, user support including a single point of
6. Lend the computer a hand. Sometimes iterative contact for problems, and complete program manuals
estimation fails because the computer needs better and documentation, sometimes with numerous analysis
starting values, or initial estimates of model param- examples and data sets.
eters, to find a converged solution. Most SEM com- Commercial applications are less advantageous for
puter tools generate their own starting values, but users without institutional or grant funds for software.
computer-derived values do not always lead to con- As with other specialized software for multivariate sta-
verged solutions. Fortunately, SEM computer tools tistical analyses, costs to purchase or license SEM com-
generally allow users to specify starting values that puter tools can be rather expensive and can range from
override computer defaults. Suggestions for speci- several hundred to over $1,000 a year or more, although
fying starting values for different types of models discounts may be offered for academic users. There are
are offered later in the book. also researchers who simply prefer using free or open-
Pt1Kline5E.indd 70 3/22/2023 3:52:41 PM

Computer Tools 71
source software. Open source means that software umentation. Although there are articles, chapters, or
development is decentralized by making the source even entire books written for applied researchers who
code available so that individual users can modify the use lavaan for SEM, standard R package documenta-
software and redistribute or publish their version back tion is written in a technical way that assumes knowl-
to the community. Most, but not all, open-source soft- edge of object-oriented programming. Specifically,
ware is free, and just because a computer tool is free the typical “manual” for an R package consists of an
does not mean that it is also open source. alphabetical listing of functions with terse references to
Among researchers in the behavioral sciences, the objects, methods, and data types. Such documents can
best-known computing environment for data manage- be pretty cryptic for researchers without strong pro-
ment, statistical analysis, and graphics that is both free gramming skills. It is also true that numerical output
and open source is R (R Core Team, 2022). A basic in R is not always “pretty,” that is, the output is format-
installation of R has many of the same capabilities for ted in ways that are not easy to read at first glance. An
statistical analysis as commercial products such as example is scientific notation in which very small or
SPSS or SAS/STAT, and thousands of free packages very large numbers are represented in a simpler form
extend its range even further. This includes several (e.g., 1.0E-4 for the number .0001). For researchers
packages for SEM, some of which can analyze a wide unwilling to deal with these challenges, a commercial
range of structural equation models with capabilities tool may be a better option, if cost is no issue.
similar to those of commercial software. This is espe-
cially true for lavaan, described momentarily. Its
extensive capabilities are why I used lavaan in most R PACKAGES FOR SEM
detailed analysis examples described in this book.
There are potential obstacles to using the noncom- Listed in Table 5.1 are major R packages for SEM anal-
mercial software programs just mentioned. One is doc- yses with citations and descriptions. All packages in
TABLE 5.1. Major R Packages for SEM

Name Citation Description
General model fitting
sem Fox et al. (2022) Basic but broad capabilities for analyzing structural
equation models
lavaan Rosseel et al. LAtent VAriable ANalysis, estimates a wide range of

(2023) models with capabilities that rival commercial tools
OpenMx Boker et al. Matrix processor and numerical optimizer that can be
(2022) used with multicore computers or networked computers
piecewiseSEM Lefcheck (2020) Single-equation estimation of path models with global fits
test based on predicted conditional independencies
Utilities and tools

semTools Jorgensen et al. Extends lavaan capabilities, includes utilities for power
(2022) analysis and other special types of analyses
systemfit Henningsen & Estimates simultaneous linear equations with ordinary

Hamann (2022) least squares and instrumental variable methods
bmem Zhang & Wang Estimators and bootstrapped confidence intervals for
(2022) indirect effects with incomplete data
Pt1Kline5E.indd 71 3/22/2023 3:52:41 PM

the table except OpenMx were used in analyses for this The eponymously named package piecewiseSEM
book. The sem package was probably the very first for (Table 5.1) supports the method of piecewise SEM,
analyzing a wide range of structural equation models, described in Chapter 8. Briefly, the method features
but there are no plans for major development beyond (1) single-equation (local) estimation of observed vari-
its current form.4 It is still quite usable, though, and able path models and (2) a global test of model–data
it offers a range of estimators, including instrumental correspondence (i.e., goodness-of-fit) based on Pearl’s
variable methods. The sem package can also be inte- (2009) approach to nonparametric structural equation
grated with external R packages for multiple imputa- modeling covered in the next chapter. Local estimation
tion (MI), bootstrapping, and estimating polychoric means that the equation for each outcome variable is
correlations, which further extend its capabilities (Fox separately analyzed instead of the computer attempt-
et al., 2022). ing to simultaneously estimate all model parameters,
The lavaan package for SEM (see Table 5.1) has or global estimation. Piecewise SEM is best known in
extensive analysis capabilities and has been updated biology and ecology, but it offers potential benefits for
several times since it was first released in 2011. The researchers in the behavioral sciences, too.
lavaan package has estimators for continuous, binary, Listed in the lower part of Table 5.1 are “toolbox”
and ordinal data, and there are capabilities for item R packages for special kinds of analyses in SEM and
response theory (IRT), latent class, mixture model- other statistical techniques. The semTools package
ing, and multilevel analyses. Modern options for han- extends the capabilities of lavaan to exploratory fac-
dling missing data include full information maximum tor analysis (EFA) or estimation of interactive effects of
likelihood (FIML) and MI, but use of an external R latent variables. The same package can also estimate
power, conduct Monte Carlo simulations, correct sig-
package, such as semTools, is required for the latter.
nificance test results for small sample sizes, estimate
Resources for using lavaan include books (Beaujean,
the reliability of factor measurement, and combine
2014; Gana & Broc, 2019), journal articles (Andersen,
results from MI. It has a special function that uses the
2022; Svetina et al., 2020), a website,5 and a discussion
Kaiser and Dickman (1962) method to generate a raw
group.6
data file of a specified size from an input covariance
The OpenMx package is a powerful matrix proces-
matrix, where all descriptive statistics of generated
sor and numerical optimization tool with capabilities raw scores exactly match those of the specified cova-
for analyzing structural equation models, multilevel riance matrix (i.e., the scores are generated with no
models, mixture models, and models of genetic relat- added sampling error). This capability is handy when
edness—see Table 5.1. Models can be interactively the researcher has a summary covariance matrix but
built one step at a time, which allows code debugging no raw scores, such as from a journal article, and the
in smaller steps. The package supports multicore com- researcher wants to conduct a secondary analysis with a
puters, where multiple central processing units in the computer tool that requires raw scores. A GitHub wiki
same computer operate in parallel on an analysis, and for semTools is available.9
distributed processing over separate computers work- The R package systemfit (Table 5.1) estimates sys-
ing in parallel clusters. These capabilities support the tems of linear equations for observed variables. It has
analysis of extremely large data sets. Missing data are capabilities for OLS and instrumental variable methods.
handled through the FIML method. There is an online Versions of the latter method can be used for an analysis
user guide,7 and the OpenMx community wiki offers called seemingly unrelated regressions (SUR), which
tutorials, example models, and forums.8 The syntax is is actually a misnomer because the regressions for sep-
relatively complex but highly flexible once mastered— arate outcome variables are correlated due to overlap-
see Neale et al. (2016) for examples. ping (correlated) error terms. In contrast, error terms
for multiple outcomes in standard OLS regression are
4 J. Fox (personal communication, November 20, 2020). assumed to be independent. There are also functions in
5 https://www.lavaan.ugent.be/ systemfit that evaluate the overall fit of regression mod-
6 https://groups.google.com/g/lavaan els to the data and conduct diagnostic tests for instru-
7 https://vipbg.vcu.edu/vipbg/OpenMx2/docs//OpenMx/latest/
mental variable methods. The bmem package supports
8 https://openmx.ssri.psu.edu/wiki/main-page 9 https://github.com/simsem/semTools/wiki
Pt1Kline5E.indd 72 3/22/2023 3:52:41 PM

Computer Tools 73
the generation of confidence intervals for estimates of The JASP program is an open-source, integrated
indirect causal effects in mediation analyses when there application for both traditional (frequentist) and Bayes-
are missing data (see the table). It also has capabilities ian statistical analyses (JASP Team, 2022).11 There
for estimating statistical power for tests of indirect are versions for Windows, MacIntosh, and Linux plat-
effects in a variety of mediational analyses, including form computers.12 It features a GUI that is intuitive
ones where the intervening variable is modeled as a even for relatively inexperienced users of statistical
latent variable. computer tools. Statistical capabilities include uni-
variate and multivariate analysis of variance (ANOVA,
MANOVA), regression analysis (linear, linear mixed,
FREE SEM SOFTWARE logistic) and EFA, among others. There is a special
WITH GRAPHICAL USER INTERFACES module for SEM, for which users enter lavaan syn-
tax to specify the model in a special text window, but
The :nyx (pronounced “onyx”) program for SEM runs output options are controlled in the JASP user inter-
under the Java Runtime Environment (version 1.6 or face. The SEM module also offers separate options for
later) on Windows, Apple (macOS), or Unix/Linux plat- analyzing mediation models and latent growth models.
form computers (von Oertzen et al., 2015). It is a graph-
ical computing environment for creating and analyzing
structural equation models that can be freely down- COMMERCIAL SEM COMPUTER TOOLS
loaded.10 There is no native programming language
in :nyx. Instead, the user draws the model onscreen. Listed in the top part of Table 5.2 are stand-alone com-
After associating a data file with the diagram, estima- mercial programs for SEM that do not require a larger
tion of model parameters automatically begins. Miss- computing environment. All four of these computer
ing data are handled by the FIML estimator. There is a tools—EQS, IBM SPSS Amos (hereafter just “Amos”),
post-analysis option to synthesize raw data files where LISREL, and Mplus—allow the user to work in batch
cases are selected with sampling error from a hypo- mode (syntax) or specify the model through a draw-
thetical population where model parameters equal the ing editor or templates. The drawing editors in EQS,
sample estimates. The program can also automatically
11 The
generate syntax for Mplus, sem, lavaan, or OpenMx acronym stands for Jeffreys’s Amazing Statistics Program,
that specifies the model represented in the diagram. named after the British mathematician Harold Jeffreys for his
work on Bayesian probability theory.
10 http://onyx.brandmaier.de/ 12 https://jasp-stats.org/
TABLE 5.2. Major Commercial Software for SEM

Software Environment needed Batch (syntax) Drawing editor Template or menu
Stand-alone programs
EQS — 9 9 9
IBM SPSS Amos — 9 9 9
LISREL — 9 9 9
Mplus — 9 9 9
Procedures or commands in larger environments

Builder, sem, gsem Stata 9 9 9
CALIS SAS/STAT 9
SEPATH STATISTICA 9 9
CFA SYSTAT 9
RAMONA SYSTAT 9
LISREL, and Mplus automatically write program syn- Amos can also analyze mixture models with latent cat-
tax in a separate window that can be edited, run, or egorical variables either with training data, where some
saved. The EQS and LISREL programs can be used in cases are already classified but not the rest, or without
all stages of the analysis from data entry and screening training data. Books by Blunch (2013), Byrne (2016),
to exploratory analyses to SEM. The Amos and Mplus and Collier (2020) support Amos users.
programs have somewhat more limited capabilities for The LISREL (Linear Structural Relations) program
manipulating raw data files, so users of these applica- is the forerunner to all SEM computer programs.15
tions may elect to prepare their data using other com- Available for Windows platform computers, LISREL
puter tools for general statistical analyses. is actually a suite of applications that includes PRE-
There are versions of EQS (Equations) for Windows, LIS (“pre LISREL”), which generates and prepares raw
Apple (Mac/Unix), and Linux platform computers data files for analysis. It also has capabilities for diag-
(Bentler & Wu, 2020).13 Its syntax is straightforward nosing missing data patterns, MI, bootstrapping, and
and based on the Bentler–Weeks representational Monte Carlo simulation (Jöreskog & Sörbom, 2021).
system, in which model parameters are regression The FIML method for missing data is also available in
coefficients for effects on dependent variables and the LISREL. There are additional bundled applications for
variances and covariances of independent variables analyzing multilevel models and for fitting generalized
when means are not analyzed. All types of models linear models to data from complex survey designs.
are thus set up in a consistent way. Special features of There are two LISREL programming languages, its
EQS include the availability of estimators for ellipti- original syntax based on matrix algebra and SIMPLIS
cal distributions with varying degrees of kurtosis, but (“simple LISREL”), which is not based on matrix alge-
not skew. Options for missing data include FIML for bra nor does it require familiarity with the classic syn-
either normal or nonnormal data or a method based on tax except to specify certain output options. The clas-
the expectation–maximization (EM) algorithm. Other sic syntax is not easy to use until one has memorized
features include bootstrapping, the ability to correctly the whole system, but it is efficient: One can specify
analyze a correlation matrix with no standard devia- a complex model with relatively few lines of code.
tions, EFA capabilities with parallel analysis and bifac- Some advanced capabilities, such as nonlinear param-
tor rotation, and special syntax for multilevel analyses. eter constraints, are unavailable when using the SIM-
Books by Blunch (2016) and Dunn et al. (2020) are for PLIS language. Recent books about LISREL include
EQS users. Jöreskog et al. (2016) and Viera (2011).
The Amos (Analysis of Moment Structures) pro- The Mplus program for Windows, Apple (macOS),
gram is for Windows platform computers and does not and Linux platform computers is divided into a base
require the IBM SPSS environment to run (Arbuckle, program and three optional add-on modules (Muthén
2021).14 It has two main parts, Amos Graphics, in which & Muthén, 1998–2017).16 The Base Program for SEM
users control the analysis by drawing the model on the can analyze models with outcomes that are any combi-
screen, and Amos Basic, its syntax editor that works in nation of dichotomous, nominal, ordinal, censored, or
batch mode. Amos Basic is also a language interpreter count variables. It can also analyze discrete- and con-
and debugger for Microsoft Visual Studio VB.NET tinuous-time survival models. There are capabilities for
or C# (“C-sharp”). Users with programming experi- conducting exploratory factor analysis, bootstrapping,
ence can write VB.NET or C# scripts that modify the Monte Carlo simulation, Bayesian estimation, and IRT
functionality of Amos Graphics. Other utilities include analyses. Special syntax supports the specification of
a file manager, a random seed manager for bootstrap- sampling weights—also called survey weights—that
ping, and viewers for data and output files. The FIML correct for systematic differences in probability sam-
method is used for incomplete data files. Amos has pling between target and sample proportions of cases
additional capabilities for Bayesian estimation, includ- with specific demographic or other characteristics.
ing the generation of graphical posterior distributions. There is also special syntax for specifying latent growth
models and the analysis of indirect casual effects. Both
13 https://mvsoft.com/
15 https://ssicentral.com/
14 https://www.ibm.com/products/structural-equation-
modeling-sem 16 https://www.statmodel.com/
Pt1Kline5E.indd 74 3/22/2023 3:52:41 PM

Computer Tools 75
MI and FIML methods are available for handling miss- related to EQS’s original language, among others. The
ing data under the assumption of MAR. Special meth- missing data method in CALIS is FIML, but MI is also
ods in Mplus for data missing not at random (MNAR) available through the larger SAS/STAT environment.
are described in Chapter 9. The diagram for the analyzed model can be drawn
The Multilevel Add-On estimates multilevel ver- onscreen, but the researcher must specify the diagram
sions of the kinds of models analyzed in the Base Pro- in syntax. O’Rourke and Hatcher (2013) describe exam-
gram, and the Mixture Model Add-On estimates mix- ples of SEM analyses in SAS/STAT.
ture model versions, where the data are assumed to be J. Steiger’s SEPATH (Structural Equation model-
sampled from a mix of subpopulations that correspond ing and Path Analysis) is the SEM module in Statis-
to levels of a latent categorical variable. The Combina- tica, an integrated environment for data visualization,
tion Add-On contains all the features of the multilevel simulation, and statistical analysis.19 There is a desk-
and mixture model analyses just mentioned. Of all SEM top version for Windows platform computers (TIBCO
computer tools, Mplus can analyze perhaps the widest Statistica, 2022) and an enterprise version with compa-
range of statistical models, and some of the very newest nywide server support. Models are specified using the
analysis capabilities can show up first in Mplus. Recent PATH1 programming language based on a represen-
books about Mplus include Finch and Bolin (2017), tational system for SEM by McArdle and McDonald
Geiser (2021), Heck and Thomas (2015), and Wickrama (1984) introduced in Chapter 7. There are also tem-
et al. (2022). See Geiser (2023) for examples of SEM plate-based options that are preprogrammed sequences
analyses conducted with both Mplus and lavaan. of graphical dialog boxes for specifying common types
Listed in the bottom part of Table 5.2 are SEM pro-
of structural equation models—see Table 5.2. Special
cedures or functions within larger software environ-
features include the capabilities to correctly analyze a
ments. Builder is the drawing editor for SEM in Stata,
correlation matrix with no standard deviations, gener-
and the commands sem and gsem are for specifying
ate simulated random samples for Monte Carlo studies,
models in syntax (StataCorp, 1985–2021).17 The sem
and precisely control parameter estimation. A separate
command analyzes models with continuous outcomes,
power analysis module (also by J. Steiger) estimates the
and the gsem (generalized SEM) command analyzes
power of various significance tests in SEM.
outcomes that are continuous, dichotomous, categorical
(ordered or unordered), count, or censored variables. There are two SEM procedures in SYSTAT for
The gsem command also has capabilities for multilevel Windows platform computers (Systat Software Inc.,
modeling in an SEM framework and for analyzing 2018).20 The user interacts with RAMONA (Reticular
models based on IRT. Stata automatically generates for Action Model or Near Approximation) by submitting
Builder diagrams the corresponding syntax, which can batch files in the general SYSTAT environment. Syn-
be edited and saved as a text file. Special symbols in tax for RAMONA is relatively straightforward and
Builder designate the underlying distribution (Gauss- involves only two parameter matrices, one for covari-
ian, Poisson, etc.) for observed variables and the cor- ances between independent variables and the other for
responding link function (logit, probit, etc.). Missing direct effects on dependent variables. A second pro-
data are handled by FIML. The book by Acock (2013) cedure is CFA, where the user specifies measurement
covers SEM in Stata. models and analysis options through graphical dialogs
The procedure CALIS (Covariance Analysis of Lin- or templates (Table 5.2). A special feature of both pro-
ear Structural Equations) in SAS/STAT for Windows cedures just described includes the ability to correctly
and Unix platform computers (SAS Institute, 2022) fit a model to a correlation matrix only. There is a
is for SEM.18 It works in batch mode, and users can “Restart” command that automatically takes parameter
specify their models using one of seven different pro- estimates from a prior analysis as starting values in a
gramming languages, including LISMOD, a matrix- new analysis. This capability is convenient when evalu-
based syntax that corresponds to LISREL’s original ating whether a complex model is identified. There
language, and LINEQS, an equation-based syntax are relatively few other advanced features for SEM in
17 https://www.stata.com/ 19 https://www.statistica.com/
18 https://www.sas.com/ 20 https://systatsoftware.com/
Pt1Kline5E.indd 75 3/22/2023 3:52:41 PM

SYSTAT including the capability to simultaneously fit 3. The Matlab program is a computing environment
a model to data from multiple groups. and programming language for data analysis, visu-
alization, and simulation (Mathworks, 2022).23
Williams (2021) describes the Toolbox for SEM, a
SEM RESOURCES FOR OTHER set of functions for estimating structural equation
COMPUTING ENVIRONMENTS models with continuous outcomes.
Resources for conducting SEM analyses in more spe-

cialized computing environments are listed next: SUMMARY
1. The freely available semopy (Structural Equation Modern SEM computer tools are generally no more
Models Optimization in Python) package for the difficult to use than other computer programs for mul-
object-oriented programming language Python tivariate statistical analyses. The capability to specify
relies on lavaan-like syntax, features relatively a model by drawing it onscreen helps beginners to be
fast processing times, and has capabilities for impu- productive right away, but with experience they may
tation of missing values, analysis of ordinal out- find that specifying models in syntax is actually more
comes, and estimation of random coefficients mod- straightforward. Problems can be expected in the anal-
els (Igolkina & Meshcheryakov, 2020).21 A second ysis of complex models, and no amount of user-friend-
edition of semopy is recently available (Meshch- liness in the interface of a computer tool can negate this
eryakov et al., 2021). fact. When things in the analysis go wrong, you need,
2. Mathematica is a software system for technical and first, to have a good understanding of the problem and,
symbolic computation with capabilities for inter- second, basic computer skills to correct the problem.
facing with programs written in other languages You should not let ease of computer tool use lead you
(Wolfram Research, 2022).22 Oldenburg (2020) to carry out unnecessary analyses or select analytical
describes Mathematica code for estimating mod- methods or options you do not understand. The con-
els in both the traditional way through covariance cepts and tools covered in Part I of this book set the
matrix-based computations and through methods stage for considering the specification and analysis of
based on least squares optimization that can also be basic kinds of structural equation models in Part II.
applied to nonlinear models.
21 https://semopy.com/
22 https://www.wolfram.com/ 23 https://www.mathworks.com/
Pt1Kline5E.indd 76 3/22/2023 3:52:41 PM

Part II
Specification, Estimation,
and Testing
Pt2Kline5E.indd 77 3/22/2023 3:43:59 PM

Pt2Kline5E.indd 78 3/22/2023 3:43:59 PM
6
Nonparametric Causal Models
The principal concepts in Pearl’s (2009) nonparametric approach to SEM (i.e., the structural causal model)
are introduced in this chapter. A causal model as described next corresponds to a directed acyclic graph
(DAG), which depicts hypotheses of unidirectional causation or temporal ordering among variables in graph-
ical form. A DAG does not allow for causal loops, where ≥ 2 variables are specified to have direct or indirect
effects on each other. In contrast, a directed cyclic graph (DCG) includes at least one causal loop, but we
will deal with such models later in the book. A directed graph as a nonparametric causal model implies two
things: (1) The researcher makes no commitment to distributional assumptions for any variable. (2) A direct
causal effect represents all forms of the functional relation between a putative cause and effect. If variables
X and Y are both continuous, for example, the specification X → Y in a nonparametric model represents the
linear and all curvilinear trends of the causal effect of X. But in parametric causal models—introduced in the
next chapter—the specification X → Y represents just the linear trend for continuous variables.
Another difference is that there are methods and computer programs for analyzing directed acyclic
graphs before any data are collected. Such analyses can alert the researcher to the presence of confounding,
including whether it would be necessary to measure additional variables in order to estimate any specific
causal effect with less bias. It is easier to address the problem of omitted variables when the study is being
planned than after the data are collected. Even if no new variables are required, analysis of the graph can
indicate options for estimating a target causal effect, including which covariates or instruments to select, in
the presence of confounding. A tutorial on instruments (i.e., instrumental variables) is offered in this chapter.
Analysis of a DAG can also help researchers to find testable implications of their causal hypotheses. No
special software is needed to analyze the data. This is because testable implications for linear models with
continuous variables can be evaluated using partial correlations, which can be estimated with standard com-
puter tools for statistical analysis. Thus, researchers who understand nonparametric causal models are better
prepared for all stages of SEM, from model specification through data collection to analysis and reporting
of the results. So if the ideas considered next seem at first unfamiliar, trust me, it is worth persevering. The
payoff will be apparent in later chapters when we apply the ideas outlined next.
GRAPH VOCABULARY pair of variables is adjacent if they are connected by

AND SYMBOLISM an edge; otherwise, that pair is nonadjacent. An arrow
or directed edge represents a presumed direct causal
Variables in directed graphs are also called nodes or effect between two variables, such as
vertices. Some variables are connected by arcs, also
known as edges or links, that designate presumed
functional or statistical dependencies in the graph. A X→Y (6.1)
79
Pt2Kline5E.indd 79 3/22/2023 3:43:59 PM

80 Specification, Estimation, and Testing
where X is a presumed cause of Y. In contrast, a bidi- causal effects from variables at the beginning of the
rectional edge, often symbolized with an arc rendered path to “downstream” variables at the end the path. A
as a dashed line instead of a solid line with arrowheads directed path is also called a front-door path because
at each end, such as it starts with an arrow pointing away from the cause,
and all subsequent arrows in the path are oriented
X Y (6.2) in the same direction. An undirected path is a path
where the arrows do not all point in the same direc-
designates a spurious (noncausal) association between tion. Undirected paths might convey statistical associa-
X and Y due to ≥ 1 unmeasured (latent) common causes. tion, but not causation between variables at either end
An alternative symbolism in a causal DAG is of the path. The goal of specifying a causal DAG is the
same as for any parametric model in SEM: The graph
X UC Y (6.3) represents all hypothesized connections, causal or non-
causal, between any pair of variables in the model.
where UC explicitly represents all unmeasured com-
mon causes of X and Y, and the arrows are rendered as
dashed lines because the corresponding causal effects CONTRACTED CHAINS
on both X and Y are not directly observed. AND CONFOUNDING
By convention, error terms for outcome variables are
not included in a causal DAG. This is in part because Presented in Figure 6.1(a) is the smallest causal struc-
it is assumed that all variables, causal or outcome, ture, the contracted chain X → Y (i.e., Equation 6.1).
will have idiosyncratic error due to unobserved fac- The two variables in a contracted chain are uncondi-
tors that can vary over time or units (cases, settings, tionally dependent because there are no intervening
regions, etc.). Also, error terms for outcome variables variables that could disrupt or block the causal coor-
are implied by arrows in the graph that point to them dination between them. Represented in Figure 6.1(a) is
from other variables, such as in Equation 6.1, where Y the total causal effect—hereafter called just the total
is the outcome. The hypothesis of overlapping or cor- effect—of X on Y. Besides the hypothesis about direc-
related error is represented by Equation 6.2, where tionality, the figure also assumes there are no unmea-
a bidirectional edge connects two variables X and Y sured confounders, or latent common causes of both X
(Equation 6.3 is an alternative specification). Corre- and Y. A variation on this assumption is that all omitted
lated errors are always explicitly represented in a DAG; causes of Y are uncorrelated with X. For this reason,
otherwise, error terms for outcomes are assumed to be variable X can be described as an exogenous regressor
independent. in the prediction of Y.
The direct causes of a variable in a DAG are it par- If the assumptions for Figure 6.1(a) just stated are
ents, and all direct or indirect causes of a variable are correct, the coefficient from the regression of Y on X
its ancestors. All variables directly caused by a given would estimate the total effect without bias. Because
variable are its children, and its descendants include the model is nonparametric, though, no particular
all variables directly or indirectly caused by that same regression technique can be specified. For instance, if
variable. All parents in a DAG are ancestors just as all both X and Y were continuous and their relation strictly
children are descendants. A variable with no parents linear, the method would be bivariate ordinary least
is exogenous, and a variable with at least one parent is squares (OLS) regression. But if Y were dichotomous,
endogenous, just as in parametric structural equation the regression method could be logistic regression or
models. probit regression and, if time-to-event data were also
A path is a sequence of adjacent edges that con- collected for varying risk periods, a proportional haz-
nect ≥ 2 variables regardless of the directions of those ards model, such as a Cox model, could be specified,
edges (i.e., unidirectional or bidirectional). It passes among other possibilities for binary outcomes (Lee et
through any variable along the path just once. In a al., 2009). The point is that the choice of regression
directed path, all edges are unidirectional arrows that technique depends on assumptions about the functional
point away from a cause toward an outcome at the end form of the causal effect and on the level of measure-
of the path through possibly ≥ 1 intervening variables ment for both X and Y, but nonparametric models are
(i.e., indirect effects). Thus, directed paths transmit not concerned with such details.
Pt2Kline5E.indd 80 3/22/2023 3:43:59 PM

Nonparametric Causal Models 81
(a) Contracted (b) Unmeasured Unmeasured confounding bias is implied in Figure

chain intermediary 6.1(c) by the bidirectional edge that connects X and Y,
and that bias is explicitly depicted in Figure 6.1(d) by
UM the biasing path—also called a back-door path—that
X Y starts with an arrow pointing toward X and ends with
X Y an arrow pointing toward Y with UC as their latent com-
mon cause (i.e., Equation 6.3). A confounder does not
lie on a directed path between a cause and an outcome;
Unmeasured common cause that is, a confounder is not an intervening variable
(c) Implied (d) Explicit along an indirect effect—compare Figures 6.1(b) and
6.1(d). The biasing path in Figure 6.1(d) transmits non-
UC causal association between X and Y, which will be con-
founded with the total causal effect when Y is regressed
on X (i.e., the coefficient will estimate both together).
X Y X Y
The degree of bias depends on the magnitudes and
directions of the effects of unmeasured confounders
Identifying X Y on observed variables in the model, and that bias can
(e) Proxy (P) (f) Instrument (Z) be substantial, especially if there are ≥ 2 unmeasured
confounders that are uncorrelated (Fewell et al., 2007).
UC P Figures 6.1(c) and 6.1(d) can also be seen as imply-
ing a correlation between a causal variable X and the
Z X Y
error term of its outcome Y. The terms endogeneity and
X Y
simultaneity describe the same basic idea. A measured
causal variable that we presume overlaps with unmea-
FIGURE 6.1. Contracted chains and identification of sured causes of the outcome is called an endogenous
causal effects in the presence of unmeasured common regressor. VanderWeele (2019) described two basic
causes through covariate selection or instrumental variables. methods for dealing with unmeasured confounders: (1)
to control confounding bias in the analysis phase and
also (2) to identify the causal effect of interest.1 These
methods are covariate selection and the instrumental
The total effect of X on Y could also be estimated variable method, as described next.
without bias if there are ≥ 1 unmeasured intervening
variables that lie along a directed path between the two
variables and there are no unmeasured confounders COVARIATE SELECTION
of X and Y. This situation is depicted in Figure 6.1(b),
where UM is a latent intermediary variable along the In covariate selection, a proxy for a latent confounder is
path measured and specified as a covariate in the regression
of the outcome on both the cause and the proxy. This
X UM Y (6.4) analysis statistically controls for the proxy of an omit-
ted confounder when the coefficient for the presumed
In Figure 6.1(b), variable X has both direct and indirect cause is computed. Suppose that patient frailty is sus-
effects on Y, but the indirect effect through UM is not pected to be a latent confounder for both treatment and
observed. In the regression of Y on X, the coefficient outcome such that (1) physicians are less likely to use
will estimate both the direct and indirect effects but certain treatment options for frail patients, and (2) frail
with a single numerical value. This is not a problem in patients are also expected to have worse outcomes in
terms of the total effect of X on Y—the estimate is unbi-
ased—but the coefficient for X would not approximate 1 Thereare also design-based approaches to dealing with unmea-
just the direct effect of this variable, again assuming sured confounders, such as case-crossover designs for estimat-
correct specification of directionality and no omitted ing short-term effects of intermittent or time-varying exposures,
confounders for X and Y. among other variations described by Uddin et al. (2016a).
Pt2Kline5E.indd 81 3/22/2023 3:43:59 PM

general. If frailty increases with age, patient age could but, for reasons explained momentarily, a poor choice
be measured as a proxy for frailty, given no other of covariates can actually increase bias. But knowing
direct measure of this variable (Stürmer et al., 2007). something about directed acyclic graphs as nonpara-
Another example is the measurement of pretreatment metric causal models can give the researcher an edge
blood pressure as a proxy for smoking among patients in covariate selection. Exercise 1 asks you to (1) draw
among whom histories of tobacco use are unmeasured a new version of Figure 6.1(e) assuming that residual
or poorly measured (Uddin et al., 2016a). confounding is not zero (i.e., P does not perfectly mea-
Covariate selection as a method to identify a causal sure UC); (2) generate all paths between X and Y in the
effect is depicted in Figure 6.1(e), where P is the proxy modified DAG; and (3) comment on expected effects of
for UC, a latent confounder of both X and Y. The figure controlling for P in this likely more realistic scenario.
assumes an ideal proxy such that P captures essentially
all aspects of UC that confound X and Y. That is, the
figure assumes there is zero residual confounding INSTRUMENTAL VARIABLES
because analyzing P as a covariate entirely controls for
all effects of UC. This assumption is reflected in the fig- A second way to reduce unmeasured confounder bias
ure by the absence of arrows from UC to both observed in the analysis phase is the instrumental variable
variables, given P. method, which is represented in Figure 6.1(f). An
There are two paths between X and Y in Figure 6.1(e): instrumental variable or instrument Z is a measured
variable that
X←P→Y (6.5)
X→Y 1. Influences the causal variable X.
2. Has no direct effect on outcome Y except through X
The first path in Equation 6.5 is a back-door path that (i.e., Z does not affect Y, if X is held constant).
reflects measured confounding bias because P, the 3. Is independent of unmeasured confounding (Baioc-
proxy, is measured, not latent. Regressing Y on both X chi et al., 2014).
and P breaks the spurious association between X and
Y. Another way to describe the same analysis is that The first condition just listed refers to relevance, the
conditioning on P closes (deactivates, blocks) the bias- second condition is the exclusion restriction, and third
ing path, which renders X and Y statistically indepen- condition is called exogeneity, or the requirement that
dent except for their causal relation, or the second path Z shares no causes with Y (Swanson & Hernán, 2013).
in Equation 6.5. That directed path between X and Y Another way to describe all three conditions is that Z
remains open after controlling for P because the proxy is correlated with X but is uncorrelated with the error
is not part of this path. Thus, the goal of covariate term for Y (Bollen, 2012). Exercise 2 asks you to explain
selection in causal modeling can be succinctly stated: the difference between an instrument and a confounder.
Conditioning on the covariate(s) should close all non- Conceptually, instrumental variable estimation
causal paths that transmit spurious association while works by (1) using the instrument Z to extract variation
leaving causal paths open.2
from the causal variable X that is free of unmeasured
It is unlikely that any single proxy would perfectly
confounders. Next, (2) the outcome Y is regressed on
measure a latent confounder. For example, classify-
the adjusted (confounder-free) version of X, designated
ing age as simply “young” or “old” would be a crude
here as X Z , and the coefficient for X Z estimates the
proxy for chronological age, and residual confounding
causal relation (Baiocchi et al., 2014). The basic logic
might be substantial, if age as a continuous variable is
just outlined corresponds to the method of two-stage
a confounder. It is a common strategy to measure mul-
least squares (2SLS), where “problematic” causal
tiple covariates as proxies for ≥ 1 latent confounders
variables—those thought to covary with unmeasured
2 Otherways to condition on a variable include stratification, sub-
confounders—are replaced with versions corrected by
group analysis, or sampling from populations with certain values instruments and then specified as predictors of the out-
on key variables (e.g., a survey of employed mothers). Condition- come. Extending this conceptual definition of the 2SLS
ing can also happen inadvertently due to missing data (Elwert, method to Figure 6.1(f), variable X is regressed on
2013). instrument Z, and then outcome Y is regressed on the
Pt2Kline5E.indd 82 3/22/2023 3:43:59 PM

part of X that is explained by Z, or X Z . Assuming that of using typical antipsychotic drugs were comparable
Z satisfies the requirements for an instrument, the cor- with or slightly higher than the risk of death associated
rected predictor X Z is independent of the unmeasured with atypical antipsychotic medications (respectively,
confounder, and regressing Y on X Z should estimate 14.1% vs. 9.6%). Somewhat higher mortality was asso-
the causal effect of X with less bias than the coefficient ciated with haloperidol, the most frequently prescribed
obtained by regressing Y on X in a standard (one-stage) typical antipsychotic drug, compared with risperidone,
OLS regression analysis. the most frequently prescribed atypical medication.
Computer implementations of 2SLS are actually car- In a national cohort of over 97,000 patients 50 years
ried out in a single step, where the researcher specifies and older who suffered a hip fracture and were not pre-
causal variables, their instruments, the outcome, and viously treated with osteoporosis medication, Desai et
not over two separate analysis steps. Standard errors al. (2018) estimated the effect of initiating osteoporosis
for instrument-corrected predictors, such as for X Z in treatment or not treating on rates of subsequent non-
Figure 6.1(f), are derived using special methods that vertebral fractures over the following 10 years. The
simultaneously estimate sampling error in predicting instrumental variable with the strongest relation with
X from Z and also in predicting Y from XZ , including treatment was hospital-level rates of prescribing osteo-
methods that are also robust against heteroscedastic- porosis drugs controlling for patient age and gender.
ity (Baiocchi et al., 2014). The 2SLS method is strictly Other instruments included calendar year (i.e., chang-
a single-equation method, where each outcome vari- ing rates of prescriptions over time), access to special-
able has its own equation. There are additional types ists, and geographic region. All four instruments were
of instrumental variable methods, such as three-stage expected to directly affect treatment but should relate
least squares (3SLS), which allows for correlated to outcome only indirectly through treatment. Known
errors over multiple outcomes. This makes 3SLS more confounding variables, such as comorbid medical con-
like a simultaneous estimation method in that it takes ditions and history of falls, were also measured. That
account of the features in other outcome and predictor is, instruments and covariates were analyzed together
variables when estimating coefficients for each causal to address, respectively, unmeasured and measured
variable—see Angrist and Krueger (2001) and Bollen confounding. The results indicated that lower rates of
(2012) for more information. subsequent fractures by about 4.2 events per 100 per-
Because instrumental variable methods are not as son-years were associated with osteoporosis treatment
intuitive as covariate selection, two empirical examples initiation compared with nonuse of the treatment, an
are described next. In a cohort of about 37,000 people at effect size considered by Desai et al. (2018) to be clini-
least 65 years old, Schneeweiss et al. (2007) compared cally meaningful.
the 6-month all-cause mortality rate between patients An advantage of the instrumental variable method
given typical antipsychotic medications and those is that it does not require specific knowledge of unmea-
prescribed atypical antipsychotic medications.3 The sured confounders. This is because bias in estimates of
instrument was a doctor’s preference for prescribing causal effects will be reduced, if the requirements for
typical versus atypical medications. This was indicated instruments stated earlier are satisfied. This advantage
by how recently the doctor had prescribed a new anti- can be especially germane for researchers who ana-
psychotic drug. It was assumed that physician prefer- lyze large databases, such as from state, provincial, or
ence here is independent of patient characteristics that national governments, about population-level health-
could confound the association between drug type and care use and long-term outcomes. Such data are often
mortality and that doctor preference would affect mor- collected for reasons unrelated to research hypotheses
tality only through the type of antipsychotic medica- and thus may include relatively little information about
potentially important confounders (Stürmer et al., 2007).
tion actually prescribed to individual patients. Rates of
For example, smoking is an obvious possible confounder
death among elderly patients within the first 6 months
in health research, but accurate information about smok-
3 Typical (first-generation) medications (e.g., haloperidol) are ing history may not be available in archival data.
associated with potentially severe extrapyramidal symptoms, or A risk is selecting a weak instrument that has a low
drug-induced movement disorders. Atypical (second-generation) covariance with the presumed causal variable (i.e., rel-
antipsychotics (e.g., risperidone) are less likely to induce extra- evance is suspect). There is a no single definition of a
pyramidal side effects. “low” covariance, but as absolute covariances approach
Pt2Kline5E.indd 83 3/22/2023 3:43:59 PM

zero, results based on analyzing weak instruments can convenience. For example, the instrument in Figure
actually increase bias, especially if the other assump- 6.1(f) is exogenous, but instruments in larger graphs
tions for an instrument are also violated (Bound et al., can be endogenous, too, if they meet the essential
1995). Various rules of thumb based on significance requirements stated earlier. But it can be difficult in a
testing, such as partial F tests of the association between large graph to determine by eye whether one variable is
instruments and causal variables, and effect sizes have a proper instrument for another variable. So approach-
been proposed for detecting weak instruments (Davies ing the analysis with a complete, well-thought-out
et al., 2013). For example, in a multiple database study causal graph can facilitate both covariate selection
of the relation between use of antidepressant medica- and instrument selection.
tion and risk of hip fracture, Uddin et al. (2016b) used
a threshold of r < .15 to identify weak instruments,
but suitable minimum effect size thresholds may vary CONDITIONAL INDEPENDENCIES
over studies. In general, it is better to have a few strong AND OTHER TYPES OF BIAS
instruments that are relevant for the causal variable than
many weak instruments. Angrist and Krueger (2001) Presented in Figure 6.2(a) is a fork, or the smallest
described examples of instruments in both observa- model with two variables that have a common cause.
tional and randomized studies, and Uddin et al. (2016b) Variables X and Y have no causal relation, but the single
described challenges in finding good instruments for back-door path that connects them to A, their common
time-varying exposures, such as repeated cycles of drug cause, or
treatment and cessation over a period of time.
There are also ways to empirically check the require- X←A→Y
ments for instrument exclusion and exogeneity, but such
falsification tests are usually not conclusive because induces a spurious association. Conditioning on A
the full truth about omitted confounders is rarely would block the back-door path list and render X and
known. For example, observed covariances between Y unrelated. Both implications of the figure just men-
instruments and measured confounders (if any) should tioned can be summarized as follows:
be appreciably low; otherwise, the exogeneity assump-
tion is suspect. Estimates of direct effects of instruments X⊥Y (6.6)
on outcomes while holding causal variables constant
should also be relatively low in magnitude; otherwise, X⊥Y|A
the exclusion restriction may be untenable. Authors of
only about 20% of studies where instrumental vari- In other words, variables X and Y are dependent, if
able methods were used to estimate effects of medical we ignore their common cause A, but they are con-
interventions reviewed by Swanson and Hernán (2013) ditionally independent, if we control for A. Thus,
provided reports of falsification tests. There is also the (1) the conditioning set that renders X and Y indepen-
Hausman specification test of whether a particular dent is (A), and (2) the second expression in Equation
causal variable is characterized by endogeneity, that 6.6 is the implied conditional independence, given
is, whether that variable is an endogenous regressor Figure 6.2(a).
(Hausman, 1978). It compares unstandardized coeffi- Figure 6.2(b) is a chain, or the smallest model for
cients for the same predictor from standard (i.e., one- three variables with a single directed path, or
stage) regression analysis versus instrumental variables
regression, and the rejection of the null hypothesis of X→A→Y
coefficient equality indicates endogeneity (i.e., instru-
ments are needed). Unlike Figure 6.2(a), the relation between variables
Knowledge of nonparametric causal models depicted X and Y is causal; specifically, that causal relation is
as a DAG can also help the researcher to select proper entirely indirect through the intervening variable A.
instruments. This is because there are precise, graphi- For reasons explained in Topic Box 6.1, I would not use
cal methods to find instruments (if any) for putative either the term “mediation” to describe Figure 6.2(b) or
causal variables. These methods can be applied by the term “mediator” to describe variable A in the graph
computer programs for analyzing graphs, which is a without more information about the study context. This
Pt2Kline5E.indd 84 3/22/2023 3:44:00 PM

(a) Fork (b) Chain (c) Inverted fork

X┴Y X┴Y X┴Y
X┴Y|A X┴Y|A X┴Y|A
X X
A X A Y A
Y Y
FIGURE 6.2. Elementary structures in directed acyclic graphs with three observed variables and corresponding implied
dependencies and independencies. (a) Fork (common cause confounding), (b) chain (intervening variable), and (c) inverted
fork with a collider.
is because the hypothesis of mediation demands much also occur when conditioning on the descendant of
more than just a graph where at least three variables an intervening variable. Suppose that variable B were
are connected by a directed path. Thus, I will use more added to Figure 6.2(b) as a child of A (i.e., A → B).
neutral terms, such as “indirect effect” or “intervening Regressing Y on X and B (but not A) would partially
variable” when referring to chains like the one in the close the indirect pathway from X to Y through A
figure. by removing the overlap between A and its other
In Figure 6.2(b), only adjacent variables, such as X outcome, B.
and A, have direct causal dependencies. The nonadja- Figure 6.2(c) is an inverted fork, or the smallest
cent pair X and Y is dependent due to the directed path graph with an undirected path that includes a collider,
that runs through A, but this dependence can be bro- or
ken, if the intermediary A is deactivated. For example,
regressing Y on both X and A would render X and Y X→A←Y
statistically independent. This is because A is the sole
intervening variable between X and Y, and including where variable A is the collider. A collider is a com-
A as a covariate closes the directed path between this mon outcome that lies along an undirected path with
pair of variables. Both implications of Figure 6.2(b) just two arrows pointing into it. Note that a variable can be
mentioned are expressed by Equation 6.6. Because both a collider along one path but not a collider along a dif-
Figures 6.2(a) and Figure 6.2(b) imply the same con- ferent path in the same graph, but it is common to refer
ditional independence (i.e., X and Y are independent, to variables with at least two parents as colliders. The
given A), they are equivalent. Exercise 3 asks you to term collider suggests a pileup of causal forces. Any
find a third DAG for the same three variables that is path with a collider is blocked, closed, or inactive. This
equivalent to the two graphs just mentioned. is because a collider blocks any association (includ-
Inadvertently controlling for an intervening variable ing causal effects) between variables at either end of
along a directed path from cause to outcome can lead a path with a collider. In contrast, a path with no col-
to overcontrol (overadjustment) bias (Elwert & Win- lider is unblocked, open, or active, and thus potentially
ship, 2014), which discards some, or all, of the indirect conveys statistical association through the path. For
causal effect. For example, regressing Y on both X and example, the paths between X and Y in Figures 6.2(a)
A, the intervening variable, would completely sever the and 6.2(b) are open because they have no collider, but
association between X and Y, given Figure 6.2(b); that the path between X and Y in Figure 6.2(c) is blocked
is, cause X and outcome Y are rendered conditionally by the collider A (the path between X and Y is closed).4
independent after controlling for A, the intervening
variable. This is why the technique of multiple regres- 4 Figure 6.2(c) allows for the possibility that causes X and Y
sion assumes no causal effects among the predictors; interact in their effects on their common outcome. Nilsson et al.
that is, there is a single equation, that of the criterion. (2021) described explicit symbolism for representing the hypoth-
Less intuitive but still true is that overcontrol bias can esis of interaction in a causal DAG.
Pt2Kline5E.indd 85 3/22/2023 3:44:00 PM

TOPIC BOX 6.1
Mediation: Definition
Little (2013, p. 287) defined mediation as the “strict causal hypothesis . . . about the way in which . . .
one variable causes changes in another variable, which in turn causes changes in an outcome variable.”
Mediation as a causal hypothesis means that if the presumed model is not correct, results from data analy-
ses may have no meaningful interpretation; that is, mediation is not statistically defined (Kenny, 2021).
Instead, mediation is assumed (i.e., specified), and although analysis results can be used to evaluate that
assumption, mediation is not something that is “discovered” through analysis without a priori hypotheses.
This is why Pearl (2000, p. 136) reminded us that “causal assumptions are prerequisite for validating any
causal conclusion.”
The emphasis on changes in Little’s (2013) definition indicates the requirement for temporal pre‑
cedence, that is, causes precede their effects by some finite amount of time, however brief. Although
quantum causality allows for apparently simultaneous causation between particles that behave as a single
entity over great distances (entanglement, nonlocality; Carvacho et al., 2019), temporal precedence is part
of both classical mechanics at larger physical scales and every day, intuitive judgments about causation.
Without evidence for change, the only effects that can be supported are indirect effects, but not mediation.
In other words, mediation always involves indirect effects, but not all indirect effects automatically signal
mediation. This is especially true in cross-sectional designs, where all variables are measured at the same
occasion. Stronger designs for establishing mediation that feature temporal precedence are described
later in the book.
Perhaps the most basic assumption is that of modularity, which means that an indirect causal
process consists of parts that are potentially isolatable, and thus can be understood as separate entities
(cause, mediator, outcome) instead of as an organic, holistic, and inseparable whole (Knight & Winship,
2013). The basic nonparametric model of mediation presented next
X Y
assumes that (1) all specifications about directions of causal effects are correct, for example, that X affects
Y and not the reverse or that X and Y mutually influence each other (a causal loop). Also, (2) there are no
unmeasured confounders for any pair of variables. Instruments or proxies could be added to the basic
nonparametric model just presented, if the assumption of no unmeasured confounders for a particular pair
of variables is doubtful. A nonparametric model allows for the possibility that the cause and mediator
interact, for example, that the effect of X on Y changes with the level of M, and also that the effect of M on
Y is conditional on X. In a parametric model, introduced in the next chapter, the hypothesis of interaction
should be explicitly represented in the model. Parametric models of mediation make additional assump-
tions about measurement error, but nonparametric models consist of theoretical, not actual, variables, so
measurement error is a not an issue for such models.
Pt2Kline5E.indd 86 3/22/2023 3:44:00 PM

Listed next are the patterns of independence versus tors for COVID-19 disease and severity. Munafò et
dependence implied by Figure 6.2(c): al. (2018) discussed how sampling from special pop-
ulations or case attrition in longitudinal studies can
X⊥Y (6.7) amount to conditioning on a collider and thus distort
X⊥Y|A understanding of genetic or environmental causes of
health. Additional examples in epidemiology of poten-
The first prediction of the graph just listed is that X and tial bias due to conditioning on a collider are described
Y are independent causes of A; that is, they are unre- by Cole et al. (2010).
lated without controlling for any other variables (i.e., Perhaps even less intuitive but just as profound is
the conditioning set is empty). This assumption is based the fact that conditioning on the descendant of a col-
on the absence of any path that connects X and Y in Fig- lider also induces spurious association. For example, if
ure 6.2(c) that does not include a collider. But it is the variable B were added to Figure 6.2(c) as a child of the
second prediction in Equation 6.7 that corresponds to collider A (i.e., A → B), the revised graph would also
a special insight about colliders, one with many rami- predict
fications: If we now condition on the collider, such as
when regressing Y on both X and A, then the variables X⊥Y|B
X and Y become be related. That is, controlling for a
common outcome of two independent causes induces a Although variable A in the revised DAG does not lie
spurious association between them. Even if two causes along a path between X and Y, controlling for B will
are correlated, controlling for their common child adds nevertheless open a path between them (induces a spu-
a spurious component to their observed association. It rious association), where B is a descendant of the col-
does so by unblocking (opening, activating) the path lider A.
between them previously closed by the collider. A related idea is that an outcome is always a collider
Here is an intuitive example: Suppose that students on a path from a cause to the error term of that out-
in a private school were selected because of musical come. (Recall that error terms are implied in a DAG.)
giftedness or athletic prowess, which we assume are Thus, if the researcher conditions on a variable that
unrelated. If we know that a student has no musical is the descendant of the outcome, the estimate for the
talent, then by default we can say that the student is causal effect can be distorted (Schneider, 2020). In Fig-
an athlete, or vice versa. That is, the refutation of one ure 6.2(b), for example, outcome variable A is a collider
cause of admission (musicality) confirms the action of along the path between its cause, X, and its error term,
the other cause (athleticism), which induces a negative UA, shown explicitly next:
dependence of the two causes. Likewise, given confir-
mation of one cause eliminates the need to invoke the X→A UA
other, which is described as the explaining away effect
in the artificial intelligence literature (Pearl, 2009) Because A is a collider in the path just listed and there
and as Berkson’s paradox in the statistical literature is no other path between X and UA in the graph, then
(Berkson, 1946). In this example, musical and athletic A and UA are unrelated. But if variable A is regressed
abilities would be negatively correlated among students on X and also on Y, the descendent of collider A, then
in the school. The composition of the sample is a col- (1) the path between X and UA is opened, and (2) a spu-
lider independently caused by these two abilities, and rious association is induced between X and unmeasured
conditioning on that collider induces a negative correla- causes of A.
tion between its independent causes. Controlling for a collider—or for the descendant of
That controlling for a collider imparts spurious a collider—along an undirected path between a puta-
association between its causes is not just a theoretical tive cause and its outcome can lead to collider bias,
curiosity, but it is a real phenomenon of concern in dif- also called endogenous selection bias, where spurious
ferent research areas. For example, Griffith et al. (2020) associations are induced that may be falsely interpreted
described how collider bias induced by sampling from as evidence for causation. This problem is especially
patients admitted to hospitals or who volunteered to critical when researchers control for what they believe
participate can complicate the estimation of risk fac- to be background (causal or confounder) variables that
Pt2Kline5E.indd 87 3/22/2023 3:44:00 PM

are really outcomes with two or more parents. Elwert the symbol Z (e.g., Figure 6.1(f)), this phenomenon
and Winship (2014) describe endogenous selection bias is referred to by some methodologists as Z-bias. It
as representing the common structure of biases usu- happens because controlling for an instrument can
ally referred to with different names, including ascer- actually increase variation in the causal variable
tainment bias, which occurs when some members of due to an unmeasured confounder, which amplifies
the target population are less likely than others to be bias.
included in the final results, and homophily bias in
social network analysis, where causal effects in rela- Presented in Figure 6.3 are three directed acyclic
tionships are confused with social ties between people graphs based on examples by Howards et al. (2012),
seen as similar to one another, among others. That is, where S1 is the occurrence of a spontaneous abortion
collider bias is a concept that unifies types of biases (miscarriage) during a first pregnancy, S2 refers to the
described in various disciplines but called different same outcome during a second pregnancy, E is expo-
names. Schneider (2020) describes other subtle varia- sure to a presumed risk factor, and A is a physiologi-
tions on collider bias in economic history research. But cal abnormality. The goal is to estimate the total causal
the three main types of bias described to this point— effect of exposure on a second miscarriage. From a
confounding (common-cause), overcontrol, and col- regression perspective, it might seem reasonable to
lider bias—are fundamental identification problems in specify S1 as a covariate, but doing so would lead to
nonparametric models. Exercise 4 asks you to differen- bias in all cases represented in the figure. For exam-
tiate the three kinds of bias just mentioned. ple, Figure 6.3(a) depicts the hypotheses that exposure
causes an abnormality that is a common cause of mis-
carriage in both pregnancies and that miscarriage in
PRINCIPLES FOR the first pregnancy affects the same in the second preg-
COVARIATE SELECTION nancy. All paths between E and S2 in Figure 6.3(a) are
directed:
Summarized next are general principles for selecting
covariates when regressing an outcome on a presumed E → A → S2 (6.8)
cause, given a nonparametric model (i.e., a DAG) for E → A → S1 → S2
the problem (Ding et al., 2017; Elwert & Winship, 2014;
Shrier & Platt, 2008): Regressing S2 on both E and S1 would result in overcon-
trol bias because S1 is an intervening variable along the
1. It is appropriate to control for the confounding
second directed path in Equation 6.8 (i.e., that directed
effects of a common cause; otherwise, measured path is closed after controlling for S1).
confounding bias may occur. This idea is generally The hypotheses represented in Figure 6.3(b) are
well known in the behavioral sciences.
that exposure and a physical abnormality are unrelated
2. Inadvertently controlling for an intervening vari- common causes of miscarriage over two pregnancies,
able (or its descendant) along a directed path from which implies that the association between miscarriage
cause to effect can lead to overcontrol bias. in the first and second pregnancies is spurious. Listed
3. Controlling for a collider (or its descendant) along a next are both paths between E and S2 in the figure:
path linking cause and effect leads to collider bias,
or inducement of a spurious association between E → S2 (6.9)
cause and effect. Likewise, avoid specifying the E → S1 ← A → S2
descendant of an outcome variable as a covariate
for the same reason. Because S1 is a collider in the second path just listed, no
4. Avoid controlling for a variable that is an instru- association between E and S2 is transmitted along this
ment for the causal variable. This is because con- closed path, but regressing S2 on both E and S1 opens
ditioning on a variable that directly affects the the path and induces a spurious association between E
cause, but has no impact on the outcome except and S2, or collider bias.
indirectly through the cause, can actually increase The example in Figure 6.3(c) is more complex. This
bias. Because instruments are often designated by graph represents the hypotheses that (1) exposure has
Pt2Kline5E.indd 88 3/22/2023 3:44:00 PM

(a) Overcontrol (b) Collider (c) Collider and

overcontrol
E A E E A S1
S1 S2
S1 S2 A S2
FIGURE 6.3. Expected types of bias in estimating the causal effect of E on S2 when S1 is specified as a covariate; overcontrol
bias (a), collider bias (b), and both overcontrol and collider bias (c). E, exposure; A, physical abnormality; S1 and S2, spontane-
ous abortion at, respectively, first and second pregnancies.
both direct and indirect effects on miscarriage in the 1. Specify as a covariate any variable that is a proxy
second pregnancy through a physical abnormality; (2) for unmeasured causes of both exposure and out-
the physical abnormality directly affects miscarriage come.
over both pregnancies; and (3) miscarriages in both 2. Exclude from the set of covariates any variable
pregnancies have a common unmeasured cause. The known to be an instrument for exposure.
paths between E and S2 in this graph are listed next:
3. Control for each variable that is a direct cause of the
exposure, or of the outcome, or of both.
E → S2 (6.10)
E → A → S2 We already discussed the first two suggestions just
E → A → S1 S2 listed (e.g., Figure 6.1(e)). The third heuristic rule is the
disjunctive cause criterion, and it involves control-
The second path just listed is a directed path from E to ling for measured common causes when the covariate
S2 through A, but S1 is the descendant of A, an interven- is presumed to affect both exposure and outcome (e.g.,
ing variable, so regressing S2 on both E and S1 would Figure 6.2(a)). Covariates that directly cause either
partially close this path, which results in overcontrol exposure or treatment—but not both—would not be
bias. Variable S1 is also a collider that blocks the third, intervening variables in directed paths between expo-
undirected path in Equation 6.10, but treating S1 as a sure and treatment. Controlling for such variables may
covariate opens this path, which induces a spurious close biasing paths due to unmeasured confounders,
association between E and S2 through the unmeasured among other possibilities. Exercise 7 asks you to gener-
confounder; that is, collider bias. Whether expected ate two directed acyclic graphs, one where conditioning
overcontrol bias and collider bias in this example would on a direct cause of exposure (but not outcome) reduces
offset each other or be cumulative is unknown, but is bias, and another where controlling for a direct cause of
it best not to hope that the two different types of bias outcome (but not exposure) reduces bias.
exactly cancel out. Exercise 5 asks you to generate the
conditional independencies implied by the graphs in
Figure 6.3. Exercise 6 asks you to make a DAG where D‑SEPARATION AND BASIS SETS
it would be both appropriate and necessary to specify
S1 as a covariate while still assuming that E is a cause Before we can consider the issues of confounding and
of S2. identification of causal effects in larger graphs, we need
Selecting covariates that properly control for com- a deep dive into the concept of d-separation. Pearl’s
mon cause confounding while avoiding overcontrol bias (2009) d-separation criterion (d is for “directional”)
and collider bias can be greatly assisted by a complete locates conditional independencies in the data; that is,
causal graph for the problem but, alas, such graphs are it tells us which pairs of variables are made independent
not always available. Summarized next are suggestions by controlling for other variables in the graph. Doing so
by VanderWeele (2019) for selecting covariates under blocks the flow of information between a focal pair of
conditions of partial causal knowledge: variables due to indirect causal effects or to common
Pt2Kline5E.indd 89 3/22/2023 3:44:00 PM

causes (e.g., Figures 6.2(a)–6.2(b)). The same criterion refuted with data. The analogous case for paramet-
also warns against inducing spurious association by ric models occurs when the model has no degrees of
conditioning on a collider or on their descendants (e.g., freedom (df M = 0), which means that the model is as
Figure 6.3(c)). complex as the data the model is supposed to explain.
The d-separation criterion relies on the assumptions (Model df for parametric models is defined in more
listed next (Glymour, 2006): detail in the next chapter.) In fact, a necessary condition
for df M = 0 is that the graph implies no d-separations.
1. Markov assumption, or the representation of Let’s try two examples. Listed in the first column of
every unmeasured common cause by the symbol for a Table 6.1 are the three pairs of nonadjacent variables in
bidirectional edge. Another statement of this assump- Figure 6.4(a) that can be d-separated:
tion is that after controlling for the parents of any vari-
able, that variable has nothing to do with all other vari- 1. The nonadjacent pair X and B are independent,
ables that do not descend from it. given A, the sole intermediary between them. Because
Y is the child of B and thus contributes nothing to their
2. Faithfulness, or the assumption that direct and
association, variables X and B are also independent,
indirect effects of one variable on another do not per- given both A and Y.
fectly cancel each other out (sum to zero); otherwise,
the two variables would be statistically independent 2. The nonadjacent pair A and Y are independent,
despite their causal connection in the DAG (i.e., their given B, their sole intermediary. The same pair is also
association would be unfaithful to their presumed independent of both B and X, which follows from the
causal relation). Markov assumption that controlling for the parent of
Y, or B, shields Y from any the influence of any other
3. Negligible randomness, or the assumption that ancestor, or X. Thus, variables A and Y are unrelated,
the presence versus absence of statistical associations given both B and X.
are not due to sampling error (i.e., large samples are
required). 3. The nonadjacent pair X and Y at the end of the
directed path are rendered independent after condition-
Under the assumptions just listed, the graph only ing on any combination of the intervening variables, A
implies conditional independencies that are generated or B, for a total of three conditional independencies for
under the d-separation criterion (Elwert, 2013). this pair—see Table 6.1.
The d-separation criterion is defined next for pairs of
variables (Glymour, 2006, p. 394), but it applies to sets Not all of the 7 conditional independences in Table
of variables, too: 6.1 for Figure 6.4(a) are themselves pairwise indepen-
dent; that is, they are overlapping as a set, so not all are
RULE 6.1 A pair of variables in a DAG is needed. The basis set for a DAG is the smallest num-
d-separated by a set of covariates C if either ber of conditional independences that imply all others
located by the d-separation criterion. The size of the
1. one of the noncolliders on the path is in C; or basis set equals the number of pairs of nonadjacent vari-
2. there is a collider on the path, but neither the ables that can be d-separated. Any conditional indepen-
collider nor any of its descendants is in C dencies beyond this number are predicted by the basis
set. For example, there are three pairs of nonadjacent
A pair of variables is d-connected (unblocked, open), variables in Figure 6.4(a) that can be d-separated (see
if not every path between them is d-separated; that is, Table 6.1), so the size of the basis set is 3. Thus, a basis
there is at least one unblocked path between them. If set of 3 for this graph can explain all the rest.
a DAG is faithful, d-connectedness implies statistical There is more than a single way to generate a basis
dependence. Pairs of variables in parent–child rela- set for a DAG (Pearl, 2009), but the union basis set
tions (e.g., Figure 6.1(a)) are inherently d-connected. consists of the smallest number of conditional indepen-
In graphs where every pair of variables is connected dencies that are mutually orthogonal and can predict
by an edge, there are no d-separated sets of variables. all possible conditional independencies represented in
Because such graphs imply no conditional independen- the graph (Shipley, 2000). A straightforward method to
cies, they have no statistical implications that can be derive a union basis set is outlined next:
Pt2Kline5E.indd 90 3/22/2023 3:44:00 PM

(a) (b)
B
X A B Y X A Y
FIGURE 6.4. Larger directed acyclic graphs.
RULE 6.2 To generate the union basis set, mentioned is considered next. Two paths connect vari-
ables X and B:
1. list each pair of nonadjacent variables in the
graph that can be d-separated; next
X→A→B (6.12)
2. condition on the parents of both variables in each
X→A→D←Y→B
pair
The first (directed) path is open, but the second (undi-
Let’s apply Rule 6.2 to Figure 6.4(a): Three pairs of rected) path is blocked by collider D. Controlling for A
variables can be d-separated, so the size of the union alone closes the directed path between X and B while
basis set is 3. The parents of variables A, B, and Y are, leaving the undirected path between them blocked.
respectively, X, A, and B (variable X is exogenous). Thus, variables X and B are independent, given A.
Thus, the union basis set for the graph consists of the Including the collider D in a conditioning set would
three conditional independencies so designated in open the second path in Equation 6.12, but controlling
Table 6.1. This basis set explains all 7 conditional pos- for A would block the path again. Controlling next for
sible independencies implied by the graph. both A and Y leaves the undirected path closed because
Now let’s consider the larger graph in Figure 6.4(b). Y is not a collider in that path. Finally, variable Y along
The five nonadjacent pairs of variables that can be with both A and D also d-separate the pair X and B. The
d-separated are listed in the first column of Table 6.2. four conditional independencies for the pair X and B
The pair X and Y is independent with no covariates. are listed in Table 6.2 along with the four conditional
This is because every path that connects X and Y, or independencies for the pair X and D.
The two paths between variables B and D in Figure
X→A→B←Y (6.11) 6.4(b) are listed next
X→A→D←Y
is blocked by a collider, B or D. The same pair remains TABLE 6.1. Conditional Independencies Located
by the d-Separation Criterion in Figure 6.4(a)
independent if A is the sole covariate. This is because
and the Union Basis Set
controlling for variable A does not open any path in
Equation 6.11 that is closed by a collider. Condition- Nonadjacent Part of union
ing on any combination of colliders B or D without pair Conditional independencies basis set
also controlling for A would open at least one path in X, B X⊥B|A Yes
Equation 6.11 and thus induce a spurious association. X ⊥ B | A, Y
But including A in a conditioning set that includes B
or D would close the path again. All five conditional A, Y A⊥Y|B
independencies for the pair X and Y just described are A ⊥ Y | B, X Yes
listed in Table 6.2.
X, Y X⊥Y|A
The logic for generating conditional independencies Yes
X⊥Y|B
for the pair X and B and for the pair X and D in Figure
X ⊥ Y | A, B
6.4(b) is similar, so only the first pair of variables just
Pt2Kline5E.indd 91 3/22/2023 3:44:01 PM

B←A→D (6.13) union basis set for Figure 6.4(b) consists of the five
B←Y→D conditional independencies so designated in Table 6.2.
For example, variable X in the figure has no parents, the
Controlling for both of their common causes, A and Y, parents of B are variables A and Y, so the conditional
renders the pair B and D independent. The same pair independence in the table
will also be independent after conditioning on A, Y, and
X because controlling for the parent of B, or A, isolates X ⊥ B | A, Y
B from its only other ancestor, or X. Both conditional
independencies just described for this pair are listed in belongs to the union basis set for the graph. The five
Table 6.2. conditional independencies of the union basis set in
Finally, both paths that connect variables A and Y in Table 6.2 predict all 17 possible such independencies
Figure 6.4(b), or for Figure 6.4(b).
A→B←Y (6.14) Testable Implications

A→D←Y
Each d-separation statement in a causal DAG cor-
responds to a prediction that is potentially testable in
are blocked by colliders, B or D. Thus, the variables
sample data. If all variables are continuous in a lin-
A and Y are independent without covariates. The same
ear model, each conditional independence matches
pair is also independent given X, the parent of A. Both
up with a partial correlation that should equal zero. In
implied independencies, one unconditional and the
models with independent error terms, the whole set of
other conditional, are also listed in Table 6.2. The
vanishing partial correlations represents all testable
implications of the model. For example, the union basis
set for Figure 6.4(a) described in the previous section
TABLE 6.2. Conditional Independencies
implies for continuous variables the vanishing partial
Located by the d-Separation Criterion in Figure
6.4(b) and the Union Basis Set
correlations listed next (see also Table 6.1):
Nonadjacent Part of union rXB•A = rAY•BX = rXY•B = 0

pair Conditional independencies basis set
X, Y X⊥Y Yes If any of the predictions just listed is appreciably incon-
X⊥Y|A sistent with the data, the associated implied conditional
X ⊥ Y | A, B independence is not supported. This outcome may help
X ⊥ Y | A, D to diagnose misspecification in a particular part of the
X ⊥ Y | A, B, D graph. Suppose we observe in a sample that rXB•A =
.40, which contradicts the prediction of zero. In Figure
X, B X⊥B|A 6.4(a), there is an indirect causal pathway from vari-
X ⊥ B | A, D able X to B through the intermediary A, but perhaps
X ⊥ B | A, Y Yes the omission of a direct causal effect from X to B is a
X ⊥ B | A, D, Y mistake, among other possibilities for respecification.
X, D X⊥D|A
X ⊥ D | A, B GRAPHICAL IDENTIFICATION CRITERIA
X ⊥ D | A, Y Yes
X ⊥ D | A, B, Y Two basic strategies for identifying a total causal effect,
covariate selection and instrumental variable methods,
B, D B ⊥ D | A, Y Yes were discussed earlier for simple graphs with single out-
B ⊥ D | A, X, Y comes (e.g., Figures 6.1(e) and 6.1(f)). For larger graphs,
A, Y A Y
there are methods based on the concept of d-separation
for finding a sufficient (adjustment) set of covariates
A⊥Y|X Yes
that identify a particular causal effect, either total or
Pt2Kline5E.indd 92 3/22/2023 3:44:01 PM

direct. Controlling for a sufficient set removes spuri- ing paths just listed and thus identify the total effect
ous components by closing biasing (back-door) paths, of X on Y are (A) and (B, D). Because variable A is a
leaving just the causal relation. It also does not open common cause in both back-door paths, regressing Y
paths that are otherwise closed by colliders. If there is on both X and A will close all paths in Equation 6.15.
no sufficient set, the corresponding causal effect may Regressing Y on X, B, and D has a more complicated
be identified through instrumental variable methods. If effect: Conditioning on D will close the first path in
neither method can identify a causal effect, then adding Equation 6.15, where D is an intervening variable. But
variables, such as proxies for unmeasured confounders, variable D is a collider in the second path, so condition-
may yield a solution. ing on D alone would open that path, but including B as
The graphical rules considered next can be automati- other covariate will close the path again because B is a
cally applied by computer tools that analyze directed common cause.
acyclic graphs. This is a benefit for larger graphs, where Each conditioning set for the total effect of X on Y
locating by eye all possible paths between any pair of in Figure 6.5(a) just described, (A) and (B, D), is also
variables is prone to error. Computer assistance also a minimally sufficient (adjustment) set for which no
helps the researcher to avoid selecting inappropriate proper subset is itself a sufficient set. (A proper sub-
covariates that fail to remove common cause confound- set does not include the original set.) This means that
ing, block indirect causal pathways, or introduce new the covariates in each of the two minimally sufficient
biases such as those due to conditioning on a collider sets just listed are enough to block all biasing paths
or controlling for an instrument. Next, we consider the between X and Y. The larger adjustment set (A, B, D)
back-door criterion for identifying total causal effects is also sufficient, but it is not minimally sufficient.
(the sum of all direct and indirect effects) through This is because two of its proper subsets include the
covariate selection, the single-door criterion for iden- minimally sufficient sets (A) and (B, D). This example
tifying just direct effects also through covariate selec- shows that there can be multiple sets of covariates,
tion, and graphical rules to locate instruments for direct each of which identifies that same causal effect. This
effects. is a kind of overidentifying restriction in that esti-
mates based on different sufficient sets should all be
equal, if the graph is correct. Because the back-door
Back‑Door Criterion
criterion is sufficient, there is no need to identify any
An adjustment set of covariates (which may include of the individual direct or indirect effects that make up
none) is sufficient for identifying a total effect, if that the total effect of X on Y. There is a separate graphical
covariate set meets the back-door criterion (Pearl, method to identify direct effects through the single-
2009): door criterion, which is described shortly. Exercise 8
asks you to verify that (A, B) and (B, X) are each mini-
RULE 6.3 A set of covariates C is sufficient to mally sufficient sets that identify the total effect of D
identify the total causal effect of X on Y if on Y in Figure 6.5(a).
Exogenous variables have no ancestors; thus, there
1. no variable in C is a descendent of X; and
are no back-door paths when the causal variable is
2. C blocks all biasing paths between X and Y exogenous. In this case, the sufficient set has no covari-
ates. For example, variable A in Figure 6.5(a) is exoge-
Presented in Figure 6.5(a) is a DAG where variable nous and has two indirect effects on Y through variables
X has both a direct effect on Y and an indirect effect X, D, and E. There are no biasing paths between A and
on Y through the intervening variable E. Thus, the Y, so the sufficient set is empty. Thus, regressing Y on A
total effect of X on Y consists of the direct and indirect with no covariates would estimate the total effect of A
effects just described. In the figure, there are a total of on Y. Variable E in the figure has a direct effect on Y, but
two biasing paths between X and Y. They are there are no indirect effects, so the total effect of E on
Y is the same as the direct effect. In this case, applying
X←A→D→E→Y (6.15) either the back-door criterion for the total effect or the
X←A→D←B→Y single-door criterion for the direct effect would gener-
ate the same minimally sufficient adjustment sets for
Two sufficient sets of covariates that block all the bias- variables E and Y as explained next.
Pt2Kline5E.indd 93 3/22/2023 3:44:01 PM

(a) Original graph (b) Modified graph

X ┴ Y | B, E
X ┴ Y | A, D, E
X X
A A
D E Y D E Y
B B
FIGURE 6.5. An original directed acyclic graph (a). The graph modified by deleting the direct effect from X to Y in the origi-
nal graph and sets of variables that d-separate X and Y in the modified graph (b).
Single‑Door Criterion the original graph (Figure 6.5(a)). Exercise 9 asks you
to show that the minimally sufficient adjustment sets
In a linear model with unidirectional causal effects (B, X) and (D, X) each identify the direct effect of E on
among continuous variables, the single-door criterion Y in the original graph.
tells us whether the coefficient for a particular direct
effect is identified by covariate selection, and which
Instrumental Variables
variables should serve as the conditioning set (Pearl,
2009): Another way to identify direct effects in linear models
involves instruments (e.g., Figure 6.1(f)). It can be dif-
RULE 6.4 A set of covariates C is sufficient to ficult to determine proper instruments for a particular
identify the direct causal effect of X on Y if causal variable, especially in large models with many
1. no variable in C is a descendant of Y; and
variables and potential instruments that are endogenous
instead of exogenous. Summarized next is a rule based
2. C d-separates X and Y in the modified graph on d-separation for locating instruments in a graph (van
formed by deleting the direct effect between them der Zander et al., 2015):
in the original graph
RULE 6.5 The variable Z is an instrument relative to
Look back at Figure 6.5(a). We just demonstrated the direct effect of X on Y, if
that the total effect of X on Y is identified through
covariate selection. Is the direct effect of X on Y also 1. Z correlates with X (i.e., the two variables are
identified? Assuming a linear model for continuous d-connected); and
outcomes, we can apply the single-door criterion. First, 2. Z is d-separated from Y in the modified graph
we delete the directed edge from X to Y from the origi- formed by deleting the direct effect of X on Y from
nal model. Figure 6.5(b) is the graph with the modifica- the original graph
tion just described. There are no descendants of Y. Two
minimally sufficient sets d-separate variables X and Y In Figure 6.6(a), it is pretty easy to see that exogenous
in the modified graph, (B, E) and (A, D, E); that is, both variable A is an instrument for causal variable X even
without applying Rule 6.5. Is endogenous variable B
X ⊥ Y | B, E another potential instrument for X? Yes: Variable B is
X ⊥ Y | A, D, E both correlated with X in the original graph and also
d-separated from Y in Figure 6.6(b), the modified graph
are true in the modified graph. This means that the without the direct effect from X to Y. It is also true in
coefficient for the direct effect of X on Y is identified in the same modified graph that variable A is d-separated
Pt2Kline5E.indd 94 3/22/2023 3:44:01 PM

from Y, which is consistent with the intuition that A is which can also be expressed as
a proper instrument. Thus, two instruments, variables
A and B, for identifying the causal effect of X on Y in (A | B) ⊥ Y
the presence of unmeasured confounding are located
by Rule 6.5. Presented next is a formal definition for a conditional
Now let’s consider a more difficult graph. In Figure instrument (Pearl, 2009):
6.6(c), neither variable A nor B is a proper instrument:
Variable A affects Y through an intermediary other than RULE 6.6 The variable Z is a conditional instrument
X, and variable B has a direct effect on Y. But there is an relative to the direct effect of X on Y, if
alternative: A conditional instrument, which is formed
1. Z correlates with X controlling for W;
by partialling out one variable from another such that
the residual is a valid instrument. For example, Fig- 2. W d-separates Z from Y in the modified graph
ure 6.6(d) is the modified graph, where the direct effect formed by deleting the direct effect of X on Y from
from X to Y is deleted. Variable A is a proper instrument the original graph; and
but only after controlling for B; that is, the instrument is 3. W is the set of variables (excluding X) that are not
A | B, or the residuals after regressing A on B in a linear descended from Y
model. This conditional instrument A | B is (1) related
to X after controlling for B and is (2) d-separated from In Figure 6.6(d), the set of variables that are nondescen-
Y in the modified graph, or dants of Y is W = (B) for variable Z = A. After controlling
for B, variable A both correlates with X and is unrelated
A⊥Y|B to Y; thus, A | B is a valid instrument. See Chalak and
Instruments (A, B)
(a) Original graph (b) Modified graph

A┴Y
B┴Y
A B A B
X Y X Y
Conditional instrument (A|B)
(c) Original graph (d) Modified graph

A┴Y|B
A B A B
X Y X Y
FIGURE 6.6. An original directed acyclic graph (a). The graph modified by deleting the direct effect from X to Y in the origi-
nal graph and independencies of both instruments, A and B, with Y (b). An original directed acyclic graph, where neither A
nor B is a proper instrument. (d) The graphs modified by deleting the direct effect from X to Y in the original graph and implied
independence of the conditional instrument A | B with Y.
Pt2Kline5E.indd 95 3/22/2023 3:44:02 PM

White (2011) for more information about conditional Exercise Fitness

instruments and other types of extended instruments
constructed from original variables in a model. Illness
In actual data sets, instruments or partial instru-
Hardy Stress
ments located by, respectively, Rule 6.5 or 6.6, should
have appropriate statistical properties. For example,
the sample covariance between a potential instrument FIGURE 6.7. A nonparametric path model of illness
and a causal variable should be appreciably high, and expressed as a casual directed acyclic graph.
results of falsification tests—although not absolutely
conclusive—should not indicate an obvious problem.
That is, selection of an instrument by a graphical rule
does not guarantee adequate statistical properties of 3. There are no unmeasured common causes of fit-
that instrument in actual samples. ness, stress, and illness.
Pearl (1995) described an equation-based alterna- 4. Unmeasured causes of the three outcomes just men-
tive to graphical identification rules called do-calcu- tioned are unrelated to both exercise and hardy (and
lus, which predicts the effects of interventions, given also unrelated to their latent common cause).
a causal graph, and evaluates whether a particular
causal effect is identified. Briefly, it simulates physical There are a total of 4 direct effects in Figure 6.7. The
interventions in a causal graph through the mathemat- graph has “room” for 5 additional direct effects (e.g.,
ical operator do(x), which sets the value of a causal from exercise to illness) such that there are no causal
variable X to equal a constant x. Next, the original loops, but all such “missing” direct effects were hypoth-
graph is modified by replacing certain relations in the esized by Roth et al. (1989) to be zero. There are also
model with the constant X = x while the rest of the two indirect pathways in the figure, such as
model is not altered. The joint probability distribution
associated with the modified graph reflects the pos- Exercise → Fitness → Illness
tintervention distribution of the variables, given do(x).
Expressions for postintervention distributions are Because direct effects of exercise and hardy on illness
manipulated to determine whether causal effects can are hypothesized to equal zero, both indirect effects
be estimated, given the data defined by the preinter- in the figure represent the total effects of exercise and
vention joint probability distributions associated with hardy on illness.
the original graph. There are R packages that imple- Listed in Table 6.3 are the annotated script files
ment do-calculus algorithms, such as causaleffect and R packages used to analyze Figure 6.7. All syn-
(Tikka, 2022; Tikka & Karvanen, 2017), but I believe tax and output files can be downloaded from this
graphical identification rules are more accessible for book’s website. In the first analysis, packages dagitty
applied researchers. (Textor et al., 2021), ggm (Marchetti et al., 2020),
and C auseAndCorrelation (Shipley, 2017) are used to
generate all possible implied conditional independen-
cies and the smaller basis set that predicts all the rest.
DETAILED EXAMPLE
That basis set has 5 elements, which equals the num-
ber of “missing” direct effects (predicted to be zero)
Presented in Figure 6.7 is the causal DAG of a man-
in the figure. That df M = 5 for the parametric version
ifest-variable path model based on Roth et al. (1989).
of Figure 6.7 is not a coincidence as is explained in the
Hypotheses implied by the graph are listed next:
next chapter. The dagitty package is used in the second
analysis to locate covariates (adjustment sets), instru-
1. Exercise and hardy (i.e., hardiness) have a common
ments, or partial instruments that identify each direct
unmeasured cause.
or total effect.5 The results of both analyses generate
2. Both exogenous variables just mentioned affect ill-
ness through a single intermediary, fitness for exer- 5A graphical, browser-based version, DAGitty, is freely available
cise and stress for hardiness. at http://dagitty.net/
Pt2Kline5E.indd 96 3/22/2023 3:44:02 PM

a plan for fitting the parametric version of Figure 6.7 and fitness is assumed in the figure, the same direct
to the data summarized in Table 4.3 that is carried out effect can be estimated by specifying hardy or stress
in Chapter 8. The parametric path model is linear with as an instrument for exercise. (You should verify this
continuous outcomes. statement using Rule 6.5 for instruments). Altogether,
Listed in Table 6.4 are all 15 possible conditional there are three different ways to identify the direct
independencies implied by Figure 6.7. For example, the effect of exercise on fitness. Through similar logic,
variables exercise and stress are d-separated by a total there are also three estimators for the effect of hardy
of three different sets of variables, among which the on stress (see the table).
conditioning set that consists of hardy as the sole vari- In Table 6.5, the direct effect of fitness on illness is
able belongs to the union basis set. The remaining four identified by three different minimally sufficient sets of
pairs of nonadjacent variables in the table each have covariates each composed of a single variable, stress,
three conditioning sets that d-separate them, only one exercise, or hardy. Conditioning on any of the individ-
of which for each pair belongs to the union basis set of ual variables just listed satisfies Rule 6.4 for the single-
five implied conditional independencies for the whole door criterion. The same direct effect is also identified
graph. Because all variables in the parametric version through the specification of a conditional instrument
of Figure 6.7 are continuous and the model is linear, for fitness, either
each element of the union basis set corresponding to a
vanishing partial correlation that can be compared with Exercise | Stress or Hardy | Stress
the sample coefficient. One of these is
both of which satisfy Rule 6.6 for conditional instru-
rFS•EH = 0 ments. Altogether there are five estimators for the
direct effect of fitness on illness, three based on covari-
which predicts that the partial correlation between fit- ate selection and two based on conditional instruments.
ness and stress equals 0 when controlling for both exer- Estimators for the direct effect of stress on illness (5
cise and hardy (see Table 6.4). If the sample correlation in total) are generated with similar reasoning (see the
rFS•EH is appreciably different from zero, then a specifi- table).
cation error for this pair of variables is indicated. There are a total of three different estimators in
Reported in the second and third columns of Table Table 6.5 for each indirect pathway in Figure 6.7. For
6.5 are, respectively, minimally sufficient sets of example, regressing illness on exercise and hardy
covariates in standard (OLS) regression analysis or closes all back-door paths between exercise and illness,
instruments in instrumental variable regression analy- which satisfies Rule 6.3 for the total effect of exercise
sis that identify each direct or indirect effect in Figure on illness. That total effect in the figure is the indirect
6.7. For example, no covariates are needed to estimate effect of exercise on illness through fitness. A differ-
the direct effect of exercise on fitness because there ent covariate, stress instead of hardy, also meets the
are no biasing paths between this pair of variables. back-door criterion, and the coefficient for exercise
Although no unmeasured common cause of exercise in the analysis just described is the second estimator.
TABLE 6.3. Script Files and R Packages for Analyses of a Nonparametric Path
Model of Illness
Analyses Script file R packages
1. Locate all implied conditional independencies roth-basis-set.r dagitty
and generate the union basis set ggm
CauseAndCorrelation
2. Identify direct and total effects through roth-identify.r dagitty

covariates, instruments, or partial instruments
Note. Output files have the same name except the extension is “.out.”
Pt2Kline5E.indd 97 3/22/2023 3:44:02 PM

The third estimator of the same indirect effect features TABLE 6.5. Adjustment Sets (Covariates)
specification of the conditional instrument or Instruments that Identify Causal Effects
in a Nonparametric Path Model of Illness
Hardy | Stress Adjustment
Effect set Instruments
for the causal variable exercise (Rule 6.6). Through Exercise → Fitness — Hardy
analogous logic, the indirect effect of hardy on illness Stress
through stress also has a total of three estimators—see
Hardy → Stress — Exercise
Table 6.5. Fitness
For correctly specified parametric models analyzed
in large, representative samples, values of multiple Fitness → Illness Stress Exercise | Stress
estimators for the same causal effect should be reason- Exercise Hardy | Stress
Hardy
ably similar; otherwise, a possible specification error
is indicated. What is considered “reasonably similar” Stress → Illness Fitness Exercise | Fitness
should be guided by the researcher’s domain knowl- Exercise Hardy | Fitness
edge, including about the psychometric characteristics Hardy
of measures for causal or outcome variables and about Exercise → Fitness → Hardy Hardy | Stress
the magnitudes of “typical” effect sizes in a particular Illness Stress
research area. For example, whether estimates of 5.50
Hardy → Stress → Exercise Exercise | Fitness
versus 6.25 based on different methods for the same Illness Fitness
unstandardized direct effect are appreciably different
or ignorably similar depends on context. Note. Adjustment sets are minimally sufficient.
TABLE 6.4. All Implied Conditional SUMMARY

Independencies and the Union Basis Set
for a Nonparametric Path Model of Illness
According to Swanson and Hernàn (2013), “Causal
Part inference relies on transparency of assumptions and
Implied of union on triangulation of results from methods that depend
independence Conditioning sets basis set on different sets of assumptions” [emphasis added]
Exercise ⊥ Stress Hardy Yes (p. 373). In nonparametric SEM, graphical rules can be
Hardy, Fitness applied to locate covariates or instruments that iden-
Hardy, Fitness, Illness tify causal effects in the model. Covariates address
measured confounding bias by deactivating or block-
Exercise ⊥ Illness Fitness, Stress Yes
Hardy, Fitness
ing backdoor (biasing) paths, and instruments adjust
Hardy, Fitness, Stress causal variables for their associations with unmeasured
confounders. Thus, the researcher is alerted to the pos-
Hardy ⊥ Fitness Exercise Yes sibility that the same causal effect has multiple estima-
Exercise, Stress tors, some of which could be based on different meth-
Exercise, Stress, Illness ods and assumptions (i.e., covariates vs. instruments).
Hardy ⊥ Illness Fitness, Stress Yes If there are no covariates or instruments that identify a
Exercise, Stress particular causal effect, the need to measure additional
Exercise, Fitness, Stress variables is indicated. Still other graphical rules gen-
erate testable implications in the form of conditional
Fitness ⊥ Stress Exercise, Hardy Yes independencies between certain pairs of variables pre-
Exercise
dicted by the graph. All the graphical rules just men-
Hardy
tioned can be automatically applied by computer tools,
Pt2Kline5E.indd 98 3/22/2023 3:44:02 PM

which helps to avoid error in larger graphs. Additional VanderWeele (2019) discusses covariate selection when a
contributions of Pearl’s (2009) nonparametric approach complete causal graph is unavailable.
to SEM, including novel perspectives on mediation, are
considered in Chapter 20. Parametric structural equa- Angrist, J. D., & Krueger, A. B. (2001). Instrumental vari-
tion models are introduced in the next chapter. ables and the search for identification: From supply and
demand to natural experiments. Journal of Economic
Perspectives, 15(4), 69–85.
LEARN MORE Elwert, F., & Winship, C. (2014). Endogenous selection
bias: The problem of conditioning on a collider variable.
Angrist and Krueger (2001) describe the history and appli- Annual Review of Sociology, 40(1), 31–53.
cations of instrument variable methods and address the
challenges of finding good instruments, Elwert and Win- VanderWeele, T. J. (2019). Principles of confounder selec-
ship (2014) elaborate on several forms of collider bias, and tion. European Journal of Epidemiology, 34(3), 211–219.
EXERCISES
1. Modify Figure 6.1(e) to show that P does not mea- 6. Draw a DAG for the variables in Figure 6.3 where
sure all aspects of the confounder UC. Generate all S1 is a proper covariate when regressing S2 on E.
paths between X and Y in the modified graph and
describe the anticipated effect of conditioning on P. 7. For the variables X (cause), Y (outcome), and C
(covariate), draw a DAG where controlling for C as
2. Explain the difference between an instrument and a a cause of X, but not Y, reduces bias. Draw a second
confounder. DAG where controlling for C as a cause of Y, but not
X, reduces bias.
3. Draw a DAG that implies the same conditional
independencies as Figures 6.2(a) and 6.2(b). 8. Show that (A, B) and (B, X) are minimally sufficient
sets for the total effect of D on Y in Figure 6.5(a).
4. Succinctly define confounding bias, overcontrol
bias, and collider bias. 9. Find minimally sufficient sets of covariates that
identify the coefficient for the direct effect of E on Y
5. Generate the conditional independencies implied in Figure 6.5(a).
by all graphs in Figure 6.3.
Pt2Kline5E.indd 99 3/22/2023 3:44:04 PM

7
Parametric Causal Models
A parametric structural equation model represents a commitment to operational definitions of theoretical

variables and specification of expected functional forms for causal effects. Along with the estimator, or the
statistical method used to generate estimates from sample data, a parametric model also reflects assumptions
about specific probability distributions, especially for outcome (endogenous) variables. In path analysis, it
helps when planning a study to evaluate a nonparametric model to determine whether each causal effect
is identified. If not, then the problem should be addressed, such as by adding covariates or instruments to
identify target causal effects. In most types of SEM analyses, it is a parametric model that is ultimately tested
(fitted to sample data). Thus, researchers must specify a parametric model that both reflects their hypotheses
and is consistent with analysis requirements, including identification. Accordingly, the main goals of this
chapter are to (1) describe graphical symbolism for parametric models, (2) explain the correspondence
between model diagrams and analysis options, and (3) articulate assumptions represented by model dia-
grams. In this chapter we assume linear path models with continuous variables unless otherwise stated.
MODEL DIAGRAM SYMBOLISM sented in the model. A benefit of a notational system

with so few matrices is that computation times in maxi-
Recall that error terms for outcome variables are usu- mum likelihood estimation, a widely used simultane-
ally not represented in diagrams for nonparametric ous method described in Chapter 9, can be reduced,
models (e.g., Figure 6.1). In contrast, symbols for error especially for large models (von Oertzen & Brick,
terms are typically included in diagrams for parametric 2014). But given the speed and large memory capacities
models. For both types of causal models, though, the of modern personal computers, any reduction in pro-
hypothesis of correlated errors is explicitly represented cessing time due to the notational system per se could
in the diagram; otherwise, errors terms are assumed be slight.
to be independent. The graphical symbolism used for The McArdle–McDonald RAM system also
parametric models is introduced next. includes a set of graphical symbols for model diagrams.
McArdle and McDonald (1984) described the retic- A special feature of full RAM graphical symbolism is
ular action model (RAM), a system based on matrix that each and every model parameter is represented
algebra for representing parametric structural equa- with its own symbol. This property has pedagogical
tion models with just three matrices: S (Symmetric) value because it helps beginners to avoid mistakes
for variances and covariances among exogenous vari- when translating a diagram to the syntax of a particular
ables including the error terms; A (Asymmetric) for computer tool. A drawback to RAM symbolism is that
direct causal effects; and F (Filter) for designating the it takes up more space (diagrams are bigger) compared
observed variables from among all variables repre- with more compact symbolism that is probably seen
100
Pt2Kline5E.indd 100 3/22/2023 3:44:05 PM

Parametric Causal Models 101
in most published SEM studies. But one of the costs in Chapter 9. The second symbol, an arc with arrow-
of saving space in the diagram is that more compact heads at each end that exit and reenter the same vari-
symbolism can “hide” (does not explicitly represent) able, depicts the variance of an exogenous variable.
certain parameters that the computer must estimate. In Because causes of exogenous variables are not included
contrast, model diagrams in RAM symbolism are both in model diagrams, exogenous variables are generally
transparent and complete; that is, what you see is what considered free to vary and perhaps covary, too (if they
you get (all model parameters are depicted). are not assumed to be independent). In contrast, endog-
The issues just considered explain why model dia- enous variables are not free to vary because they are
grams for the detailed examples in Parts II and III of represented as determined by their presumed direct
this book are rendered in full RAM graphical sym- causes and by their unmeasured causes (i.e., error
bolism; otherwise, more compact notation is used in terms), and the model, as a whole, explains why endog-
model diagrams to save space. This practice reflects the enous variables covary with each other or with the
reality that researchers with more experience in SEM exogenous variables. To summarize, model parameters
probably do not need full RAM symbolism, but tran- in full RAM symbolism are represented with just three
sitioning from full RAM to more compact graphical symbols when means are not analyzed:
symbolism can benefit novices.
Part of RAM graphical symbolism is nearly univer- and (7.1)
sal because one sees these symbols in most model dia-
grams even when they are based on more compact sym-
Model Parameters
bolism. This includes the graphical representation of
The RAM graphical symbols in Equation 7.1 corre-
1. Observed variables (indicators) with squares or spond to the types of parameters defined next:
rectangles.
2. Proxies for latent variables, such as common factors RULE 7.1 Parameters of structural equation models
with multiple indicators, with circles or ellipses. when means are not analyzed include
3. Direct causal effects by a line with a single arrow- 1. Variances and covariances of the exogenous
head (e.g., →). variables
4. Covariances (in the unstandardized solution) or 2. Direct effects on exogenous variables from other
correlations (in the standardized solution) between variables
exogenous variables with a curved line with two
arrowheads ( ). That’s it. The simple rule just stated applies to all para-
metric structural equation models described in this
The symbol described in item 4 designates an unan- book (including path models) with continuous out-
alyzed association (i.e., covariance, correlation). comes and just a covariance structure. Thus, we will
Although such associations are estimated by the com- refer to Rule 7.1 many times in later chapters.
puter, they are unanalyzed in the sense that no predic- A model parameter can be free, fixed, or constrained
tion is put forward about why the two causal variables depending on its specification. A free parameter
are related. The counterpart of the symbol (see item 4) is estimated by the computer with the data. A fixed
for nonparametric models ( ) has the same general parameter is specified to equal a constant. This means
meaning, including the designation of correlated errors that the computer “accepts” the constant as the estimate
terms. regardless of the data. For example, the hypothesis
Full RAM graphical symbolism includes two special that variable X has no direct effect on Y corresponds
symbols that may be omitted in more compact model to the specification that the coefficient for the path
diagrams. They are X → Y is fixed to equal zero. It is common in SEM to
test hypotheses by specifying that a previously fixed-
and to-zero parameter becomes a free parameter, or vice
versa. Results of such analyses may indicate whether to
The first symbol just listed designates the analysis of respecify a model by making it more complex (an effect
means along with covariances, and its use is explained is added—a fixed parameter becomes a free parameter)
Pt2Kline5E.indd 101 3/22/2023 3:44:05 PM

or more simple (an effect is dropped—a free parameter RULE 7.2 If v is the number of observed variables
becomes a fixed parameter). in the model, the number of observations equals v
A constrained parameter is estimated by the com- (v + 1)/2 when means are not analyzed
puter within some restriction, but it is not fixed to equal
a constant, and the restriction typically involves the If v = 5 observed variables are in the model, the
relative values of other parameters. For example, an number of observations is 5(6)/2, or 15. This count (15)
equality constraint means that the estimates of two equals the total number of variances (5) and unique
or more parameters are forced to be equal. Suppose covariances below the main diagonal (10) in the data
that an equality constraint is imposed on the two direct matrix (e.g., Table 4.3). With v = 5, the greatest num-
effects in the same model listed next: ber of free parameters that could be estimated by the
computer is 15. Fewer parameters can be estimated in
X → Y and W→Y a simpler model, but not more than 15. The number of
observations has nothing to do with sample size. If
The equality constraint just described simplifies the five variables are measured for 100 or 1,000 cases, the
analysis because only one coefficient is needed rather number of observations is still 15. Adding cases does
than two. In a multiple-group analysis, where the model not increase the number of observations; only adding
is simultaneously fitted to the data from ≥ 2 samples, a observed variables can do so.1
cross-group equality constraint forces the computer The difference between the number of observa-
to derive equal estimates of the same parameter across tions and the number of free parameters is the model
all groups. This specification corresponds to the null degrees of freedom, or
hypothesis that the parameter is equal in all popula-
tions from which the samples are drawn. df M = p – q (7.2)
A proportionality constraint forces one parameter
estimate to be some fraction of the other. For instance, where p is the number of observations (Rule 7.2) and q
the coefficient for the direct effect of X on Y may be is the number of free parameters (Rule 7.1). A general
forced to equal 3 times the value of the direct effect requirement for identification in SEM is that df M ≥ 0.
of W on the same outcome. An inequality constraint This is because a model with more free parameters
forces an estimate to be either less than or greater than than available observations (df M < 0) is not amenable
the value of a specified constant. The requirement that to empirical analysis because there are infinite sets of
the value of an unstandardized coefficient must be at estimates (e.g., Equation 3.1). If you tried to estimate a
least 5.0 is an example of an inequality constraint. The model with negative degrees of freedom, an SEM com-
imposition of proportionality or inequality constraints puter program would likely terminate its run with error
generally requires knowledge about the relative mag- messages. Otherwise, a parametric model with df M < 0
nitude of effects, but such knowledge is relatively rare. must be respecified. One option is to decrease the num-
A nonlinear constraint imposes a nonlinear relation ber of free parameters by imposing constraints, such as
between two or more parameter estimates. For exam- equality constraints or fixing a previously free param-
ple, the value of one estimate may be forced to equal eter to a constant that could include being equal to zero.
the square of another. Nonlinear constraints are part Adding observed variables to the model would increase
of some advanced SEM methods or analyses that are the number of observations, but new free parameters
described later in the book. for those variables would be needed, too, so the net
effect should be to increase df M. The option just men-
Model Degrees of Freedom tioned comes too late, if the data are already collected.
Any respecification of a model with negative degrees of
The potential size of a parametric model in terms of its freedom should respect theory.
free parameters is limited by the number of observa- Most identified structural equation models with no
tions, which is not the sample size (N). Instead, it is degrees of freedom (df M = 0) will not only perfectly fit
literally the number of entries in the sample covariance
matrix in lower diagonal form. This number can be cal- 1 “Number of observations” is confusingly used in some SEM
culated with a simple rule: computer programs to refer to N, the sample size.
Pt2Kline5E.indd 102 3/22/2023 3:44:05 PM

the data in a particular sample (e.g., Equation 3.3), but (a) RAM (b) Compact
they will also perfectly fit any arbitrary sample data
matrix for the same variables. In contrast, models with
positive degrees of freedom do not generally have per- D
1
fect fit. This is because df M > 0 allows for the possibility
of model–data discrepancies. Raykov and Marcoulides
(2006) described each degree of freedom as a dimen- X Y X Y
sion along which a model can potentially be rejected.
Thus, retained models with greater degrees of freedom FIGURE 7.1. Diagrams for a contracted chain in full
have withstood a greater likelihood for rejection. This McArdle–McDonald reticular action model (RAM) graphical
idea underlies the parsimony principle: Given two symbolism (a) versus more compact symbolism (b).
models with similar fit to the data, the simpler model is
preferred, assuming that the simpler model is theoreti-
cally plausible.
ism as unmeasured exogenous variables—see Figure
7.1(a). Disturbance variances must be estimated by the
DIAGRAMS FOR CONTRACTED computer, so they count as free model parameters, too
CHAINS AND ASSUMPTIONS (Rule 7.1).
The path D → Y in Figure 7.1(a) represents the direct
Presented in Figure 7.1(a) is the parametric version of a effect of all omitted causes (and error, too) on the cor-
contracted chain with two measured variables rendered responding endogenous variable Y. The numerical value
in full RAM graphical symbolism. The presumed total (1) that appears in the figure next to the path is a scaling
effect of X on Y is represented as a directed path, just as constant that assigns a metric to the disturbance. This
in a nonparametric model. But the total effect is simply specification is necessary because disturbance variance
the linear causal effect of X on Y for continuous vari- is latent, and latent variables need scales before the
ables in a parametric model. The variance of exogenous computer can estimate anything about them. The scal-
variable X is a free parameter, so it is designated in the ing constant is also called the unstandardized residual
figure with the RAM symbol for a variance. The second path coefficient or unit loading identification (ULI)
variance parameter is not for Y—which is endogenous constraint (1 is the “unit”). Any positive constant could
and, thus, not free to vary—but instead for its distur- be specified as a scaling constant, such as 2.2, but the
bance, or error term, designated as D, which represents value of “1” tells the computer to exactly partition the
variation in Y not explained by X. total (observed) variance of Y into two nonoverlapping
There are four general sources of disturbance vari- (orthogonal) parts, variance explained by X and unex-
ance (Bollen, 2002): plained variance, or the disturbance variance.
Some, but not all, SEM computer tools automatically
1. Systematic variation due to ≥ 1 unmeasured causes specify and scale error terms in structural equation
of the corresponding outcome. models. In lavaan syntax, for example, the command
2. Inherent random variation in just about any system
Y ~ X
or individual variable.
3. Random measurement error of the kind estimated instructs the computer to regress variable Y on X and
in reliability analyses, such as time sampling error, to automatically scale the disturbance as depicted in
content sampling error, or interrater error. Figure 7.1(a) (Rosseel et al., 2023). The same command
4. Misspecification of the functional form of the also defines the variances of X and the disturbance vari-
causal effect, such as linear when the true relation ance of Y as free parameters. But in syntax for Amos
is curvilinear. Basic, the user must both explicitly name the error term
and include the constant as part of that name, such as
That disturbances reflect, in part, omitted causes is
consistent with their representation in RAM symbol- Y = X + DY(1)
Pt2Kline5E.indd 103 3/22/2023 3:44:05 PM

Amos Graphics likewise requires the user to manu- X → Y), and the functional form of that relation is
ally enter the constant 1 in a dialog box after drawing strictly linear.
onscreen the symbol for a disturbance to fix the unstan- 3. All unmeasured causes of Y are independent of the
dardized residual path coefficient to 1 (Arbuckle, 2021). measured cause X; that is, there are no unmeasured
Check the documentation for your SEM computer to common causes of both variables.
see how metrics for error terms are specified (i.e., pro-
gram defaults vs. user specification). The assumption that measured exogenous variables
Two general requirements for identification can now in path models are unaffected by random measurement
be stated: error is analogous to the requirement in standard mul-
tiple regression that predictors have no measurement
RULE 7.3 The necessary but insufficient requirements
error. The reason is the same, too: Such variables do
for identification of parametric models are
not have error terms (see Figure 7.1), so random error
1. df M ≥ 0 has no place to go. In contrast, endogenous variables
2. Every unmeasured variable (including error terms) in path models—and criterion variables in regression
must be assigned a scale (metric) analysis—have error terms that can “absorb” measure-
ment error. In bivariate regression, measurement error
The first requirement in Rule 7.3 is the counting rule in just the criterion Y does not affect the unstandard-
(see Rules 7.1–7.2), and the second requirement is the ized regression coefficient but (1) its standard error
scaling rule. Use of full RAM symbolism helps the increases, (2) the value of R2 decreases, and (3) the
researcher to check both requirements just listed; spe- absolute value of the standardized coefficient decreases
cifically, counting free parameters is part of deriving as measurement error in Y increases (Williams et al.,
df M (Equation 7.2), and explicit scaling constants in 2013). Measurement error in just the predictor X (but
the diagram remind the researcher about the need to not in Y) negatively biases regression coefficients.
scale unmeasured variables. Various types of paramet- But the effect of measurement error in both X and
ric structural equation models have additional sufficient Y is more difficult to anticipate. For example, if mea-
requirements that guarantee identification, but only if surement error in both variables is independent, then
Rule 7.3 is satisfied. the general effect is negative bias; that is, absolute
Figure 7.1(b) is the diagram for a contracted chain population regression coefficients are underestimated.
shown using more compact symbolism that omits the But if measurement error is shared between X and Y,
symbols for variance parameters (for X, D), the scal- then regression results can overestimate population
ing constant (1), and the circle that represents the dis- coefficients; that is, the bias is positive. How the con-
turbance as a latent variable compared with full RAM sequence just mentioned (positive bias) comes about is
symbolism in Figure 7.1(a). What remains is a less demonstrated in Topic Box 15.1 if you would like to
informative model diagram that shows just the basics— look ahead, but it is a myth that measurement error
X causes Y, and Y has a disturbance—but it also saves always results in negative bias. For all these reasons,
space compared with RAM symbolism. McDonald and unmodeled measurement error in exogenous variables
Ho (2002) described even more conventions for repre- can severely distort the results (Cole & Preacher, 2014;
senting error terms in structural equation models. They Williams et al., 2013), especially if multiple exogenous
emphasized there is no single “right” way to draw a variables are highly correlated (Kenny, 1979). The
model diagram, and that is true. It is just as important good news is that the assumption of zero measure-
to know that the use of more complete versus more ment error in exogenous variables can be evaluated in
compact graphical symbolism should not matter for the data by estimating the reliability of scores in the
researchers who know Rules 7.1–7.3. researcher’s sample, a best practice (Chapter 4). There
Assumptions of Figure 7.1 are listed next and dis- is also a method in SEM that controls for measurement
cussed afterward: error in individual observed exogenous or endogenous
variables that is described in Chapter 15, but we need to
1. Scores on exogenous variable X are perfectly reli- cover more statistical modeling basics in the meantime.
able (i.e., rXX = 1.0). The assumption that the relation between X and Y
2. Causal directionality is correctly specified (i.e., in Figure 7.1 is linear can also be evaluated in the data
Pt2Kline5E.indd 104 3/22/2023 3:44:05 PM

(Chapter 4). If the observed relation is appreciably cur- cant yet trivial in magnitude, but in a small sample,
vilinear, then the analysis can be adjusted to relax the a coefficient that indicates a meaningful effect size
linearity requirement. Two options for doing so—poly- could fail to be significant.
nomial regression and nonparametric regression—are
described as advanced topics in Appendix 7.A. But the A bottom-line summary is that SEM is not gener-
remaining assumptions of Figure 7.1 cannot be directly ally a technique for causal discovery. This means that
addressed through analysis alone. For example, direc- if given a true model, then SEM could be applied to
tionalities of causal effects are generally assumed, estimate the directions, magnitudes, and precisions of
not actually tested in SEM. That is, there is little, if causal effects. But this is not how researchers generally
anything, from analysis results that could either verify use SEM. Instead, a causal model is hypothesized, and
or refute hypotheses about directionality represented then the model is fitted to the data under the assumptions
in the researcher’s model. The main reason is because just outlined. If the assumptions are invalid, then so are
equivalent structural equation models feature (1) that analysis results. This is why Pearl (2000) reminded us
same variables and value of df M as the researcher's that “causal assumptions are prerequisite for validating
original model but (2) the directions of some presumed any causal conclusion” (p. 136). See Kline (2023) for
causal effects are reversed or replaced over the two more information about assumptions in SEM.
models. Also, (3) the two models, original and equiva-
lent, will have identical—not just similar but exactly
the same—fit to the data. We will revisit the issue of CONFOUNDING
equivalent models several times in later chapters, but IN PARAMETRIC MODELS
they are a major but often unacknowledged validity
threat in many, and perhaps most, SEM studies. Endogeneity in the diagram for a contracted chain is
Summarized next are additional reasons why causal represented in Figure 7.2(a) by the symbol for a covari-
directionality is assumed but not directly tested in ance that connects measured cause X with the distur-
SEM: bance for Y. For two reasons, Figure 7.2(a) is not iden-
tified. First, df M = –1, so the model violates Rule 7.3.
1. Most SEM studies are based on cross-sectional Exercise 1 asks you to prove the fact just stated. Second,
designs where all variables are measured at the the biasing (back-door) path between X and D in the fig-
same occasion (i.e., no temporal precedence). In ure cannot be closed because D is treated as a latent
such designs, the only thing that supports direction- variable (e.g., Figure 6.1(d)). Endogeneity in parametric
ality specification is argument, that is, the quality models can be induced by the conditions listed next and
of the ideas behind the hypothesis that X causes discussed afterward (Antonakis, 2017; Bollen, 2012):
Y instead of the reverse or that the two variables
mutually influence each other in a feedback rela- 1. An unmeasured common cause of X and Y (i.e., a
tion, or reciprocal causation. confounder).
2. Measurement of variables at different times in lon- 2. Random measurement error in X (i.e., rXX < 1.0).
gitudinal designs provides temporal precedence: 3. Reciprocal causation, or X and Y mutually cause
The hypothesis that X causes Y would be bolstered each other (i.e., they are both endogenous variables)
if X were actually measured before Y. But tempo- in a feedback loop.
ral precedence is no guarantee. This is because 4. Autoregressive errors, or where X is a lagged ver-
the covariance between X and Y could still be rela- sion of Y and errors persist over the two variables.
tively large even if Y causes X and the effect (X) is 5. Spatial autoregression, which occurs when scores
measured before the cause (Y)—see Bollen (1989, of each case are influenced by those from nearby, or
pp. 61–65). spatially adjacent, cases.
3. Whether the coefficient for a presumed causal effect
is significant or not significant at some arbitrary Unmeasured confounders can be addressed through
level, such as p < .05, neither confirms nor discon- covariate selection or through instrumental variables
firms the corresponding directionality hypothesis. methods. For example, represented in Figure 7.2(b)
In a large sample, the coefficient could be signifi- is the specification of a proxy (P) for an unmeasured
Pt2Kline5E.indd 105 3/22/2023 3:44:05 PM

(a) Confounding (b) Proxy (c) Instrument
P Z
X Y X Y X Y
FIGURE 7.2. Endogeneity in a contracted chain (a). Identification of the model by controlling for a proxy (P) of a common
unmeasured cause (b) and through instrumental variable (Z) methods, which also addresses measurement error in variable X
(c). All diagrams shown in compact symbolism.
confounder of variables X and Y. The causal variable of this book to cover this topic further, but see Guliyev
X is endogenous in this model because P is assumed to (2020), who analyzed spatial panel models of COVID-
cause X (and also Y). Exercise 2 asks you to specify an 19 spread in mainland China for an example.
equivalent version of Figure 7.2(b) but where the proxy
P is not assumed to cause variable X but P is still is
a cause of Y. Instrumental variable methods, depicted MODELS WITH CORRELATED CAUSES
in Figure 7.2(c), address both unmeasured confound- OR INDIRECT EFFECTS
ers and measurement error in the exogenous variable:
When variable X with its implied correlation with the The parametric model in Figure 7.3(a) represents the
disturbance is replaced by X Z in 2SLS regression, ran- hypothesis that Y is a common outcome of two cor-
dom measurement error in X is also moved from X Z related exogenous variables, X and W. The diagram
under the standard assumptions for instruments (Chap- offers no causal hypotheses about why X and W covary,
ter 6). Note that variable X is represented as endogenous so their association is unanalyzed (estimated but not
in Figure 7.2(c), but not all researchers include instru- explained). In computer analysis, estimates for effects
ments in model diagrams. How to model hypothesized of X and W would be adjusted for their sample covari-
reciprocal causation is the subject of Chapter 19. ance. Some SEM computer tools assume by default that
A statistical model is autoregressive if future values measured exogenous causes of the same outcome are
are predicted by past values for the same variable. In all pairwise correlated. For example, the command in
this case, variables X and Y in Figure 7.2(a) would mea- lavaan listed next
sure the same thing but at different times. That is, these
variables are repeated measures, where scores on Y at a Y ~ X + W
later time are predicted by scores on X from an earlier
time. Errors of repeated measures may not be indepen- specifies Figure 7.3(a) because the computer will auto-
dent, especially if the measurement occasions are close matically estimate effects of X and W while control-
together in time. It is relatively straightforward in SEM ling for their observed covariation. The same command
to estimate autoregressive error for repeated measures; also tells the computer that variances of X, W, and the
indeed, it is one of the great strengths of the technique disturbance for Y are all free parameters. If exogenous
for analyzing longitudinal data, as you will see in Chap- variables X and W are hypothesized to be indepen-
ter 21 about latent growth models. Spatial autoregres- dent—that is, there is no covariance path between this
sion concerns the possibility that measurements of an pair of variables in the diagram—then the additional
outcome variable in different physical locations are not command in lavaan
independent, especially if those locations are in close
proximity. The idea of a disease hotspot or cluster as a X ~~ 0*W
geographic region with elevated incidence, prevalence,
or transmission rate of an infectious illness exemplifies fixes the covariance between X and W to zero while
the idea of spatial autoregression. It is beyond the scope leaving both of their variances as free parameters.
Pt2Kline5E.indd 106 3/22/2023 3:44:05 PM

(a) Correlated (b) Indirect that measurement error can be explicitly modeled in
SEM.
In Figure 7.3(b), variable X is represented as having
X direct and indirect effects on outcome Y. The indirect
X M
pathway is
Y
X→M→Y
W
Y where M is presumed to be an intervening or interme-
diate variable through which effects from cause X are
transmitted to outcome Y. Intervening variables are
FIGURE 7.3. Models with correlated causes (a) and both
endogenous (i.e., X → M) while at the same time they
direct and indirect effects (b). All diagrams shown in com-
are also causal variables (i.e., M → Y). Variable M also
pact symbolism.
has a dual role concerning reliability: As an outcome
of X, variable M has a disturbance, which allows for
measurement error in M. But as a cause of Y along with
X, the standard regression assumption for observed
Although nonparametric causal models allow for the variables is that both X and M are unaffected by mea-
possibility of interaction between multiple causes of the surement error. There is no problem if reliabilities for
same outcome (e.g., Figure 6.2(c)), this is not true for scores on intervening variables are very high (i.e., there
parametric models. In Figure 7.3(a), for example, any is no contradiction in requirements about measurement
interaction between causes X and W is assumed to be error). Figure 7.3(b) also assumes that (1) variable X
zero; that is, the linear effect of X on Y does not change directly affects Y above and beyond its indirect effect;
across the levels of W, and vice versa. The hypothesis (2) variables X and M do not interact in their strictly
of interaction between causes of the same outcome is linear effects on Y; and (3) there are no omitted con-
the prediction for conditional causality, for example, founders for any pair of variables among X, M, and Y.
that the linear effect of X on Y changes across the levels Effects of measurement error and omitted confound-
of W, and vice versa. But Figure 7.3(a) with no specifi- ers in models with indirect causal effects are also com-
cation for interaction predicts unconditional causality plex and tough to fully anticipate. For example, with
such that the linear effect of X on Y remains constant no measurement error in causal variable X in Figure
over the levels of W, and vice versa. How to represent 7.3(b), measurement error in intermediary variable M
the hypothesis of interaction in parametric path models results in negative bias for estimates of the indirect
is described in Appendix 7.A, and methods to estimate causal effect, while violation of the requirement for
conditional causal effects in SEM are outlined in Chap- no confounding in between M and outcome Y leads to
ters 12 and 20. Exercise 3 asks you to list the assump- positive bias. Violation of both assumptions at once can
tions for Figure 7.3(a). result in overestimation, underestimation, or in rare
You should know that the consequences of mea- cases no bias in estimates for the indirect effect (Fritz
surement error in models with correlated causes are et al., 2016). Results of computer simulations by Fritz et
complex and difficult to anticipate (e.g., Bollen, 1989, al. (2016) suggested that correcting for only one source
chap. 5). This is because bias can be negative or posi- of bias, such as for measurement error only, when both
tive depending on whether measurement error is cor- sources of bias are present can lead to even greater bias
related over multiple predictors or shared between pre- than in uncorrected results.
dictors and outcome and also on sample covariances Perhaps you noticed that I did not use the term
among all variables. Although there is a relatively “mediational model” for Figure 7.3(b) nor “mediator”
simple technique to disattenuate (increase in absolute for variable M in the figure. As explained in Topic Box
value) individual regression coefficients for measure- 6.1 and elaborated in Topic Box 7.1 about a timing cri-
ment error (Osborne & Waters, 2002), it assumes that terion for putative cause, mediator, and outcome vari-
error is independent over all variables; otherwise, ables, the hypothesis of mediation depends on strong
disattenuated coefficients can be inaccurate (Williams assumptions with persuasive rationales, not just on a
et al., 2013). Thus, it is no small potential advantage model diagram with an indirect causal pathway. For
Pt2Kline5E.indd 107 3/22/2023 3:44:05 PM

TOPIC BOX 7.1
Mediation: Timing Criterion

The strong causal hypothesis of mediation depends on a time-ordered relation between variables, where
the cause must precede the mediator, which in turn must affect the outcome at some later time (Little,
2013). Research designs with temporal precedence where cause, mediator, and outcome are measured in
this order but at different times, such as longitudinal designs, are consistent with estimating time-ordered
effects. But most mediational studies are based on cross-sectional designs with no temporal precedence,
so how does the issue of mediation as a time-ordered phenomenon align with the use of a cross-sectional
design?
Tate (2015) described the Hyman–Tate Criterion (see also Hyman, 1955), which requires a con-
ceptual time-ordering of cause, mediator, and outcome. This means that regardless of study design (even
if cross-sectional), the cause should precede mediator just as the mediator should precede outcome in
theoretical time. For example, does the cause conceptually exist before the mediator? Does the mediator
conceptually exist before the outcome? Unless the answer to both questions just posed is affirmative, there
is no interventional sequence among the three variables (Tate, 2015). The same rationale about conceptual
timing should also rule out as implausible any reverse causal pattern among presumed cause, mediator,
and outcome. For example, specification of a demographic variable as a cause is incompatible with the
possibility that the mediator affects the cause (e.g., attitude as a mediator cannot affect age as a cause).
It should likewise be argued that the mediator sets into motion necessary elements for observing change
in the outcome.
The same conceptual timing criterion also implies that the presumed mediator must be theoretically
amendable to influence by the cause. For example, variables conceptualized as traits, or relatively stable
characteristics like general cognitive ability or anxiety as a consistent personality attribute, cannot mediate
between a cause and outcome. It is only variables conceptualized as states, or potentially changeable
characteristics like anxiety as a momentary experience or motivation to perform well in a particular con-
text, that could potentially mediate anything. Likewise, group-level characteristics, such as norms, cohesive-
ness, or stereotypes, cannot function as mediators because they are basically constants that apply over all
people in particular groups.
Addressing the conceptual timing criterion is a matter of argument or logic, not analysis. This is
because the results of statistical mediation analysis cannot “prove” that a particular variable is actually
a mediator or distinguish between alternative causal models, some of which do not involve mediation at
all. An example is spurious mediation, where a presumed mediator is merely correlated with another
unmeasured variable that is the actual mediator. Another possibility occurs when the presumed mediator
is actually a consequence of the outcome. Both scenarios just mentioned are typically indistinguishable
by statistical analysis from meditation that happens just as hypothesized by the researcher (Fiedler et al.,
2011). MacKinnon et al. (2000) made the related point that mediation and both confounding and sup-
pression effects are statistically identical and can be distinguished only by rationale, not analysis. The
challenges just mentioned overlap with the problem of equivalent models, which fit the data just as well as
the researcher’s preferred model while making opposing causal claims. In Chapter 20 we will revisit the
issue of equivalent models in mediation analysis.
Pt2Kline5E.indd 108 3/22/2023 3:44:05 PM

the same reason, Kenny (2018) noted that mediation is this model and describe the hypothesized relations
not statistically defined; instead, estimators of indirect between Y1 and Y2.
effects can be used to evaluate presumed mediation, Nonrecursive models have causal (feedback) loops
but analysis results for models with indirect causal where ≥ 2 endogenous variables are specified as causes
effects are not automatically interpreted as evidence and effects of one another, directly or indirectly. In non-
for mediation. Altogether the statistical and conceptual parametric form, such models correspond to directed
requirements for estimating mediation are demanding cyclic graphs. Figure 7.4(b) is an example of a non-
but rarely acknowledged in empirical studies (Kline, recursive parametric model with reciprocal causation
2015; Pek & Hoyle, 2016). depicted as
Y1  Y2
RECURSIVE, NONRECURSIVE,
AND PARTIALLY RECURSIVE MODELS which specifies that variables Y1 and Y2 have simultane-
ous effects on each other. Examples of studies in which
All more complex parametric path models can be hypotheses about causal loops were tested include
“assembled” from the elemental models in Figures Schmitt and Bedeian (1982), who hypothesized that
7.1–7.3. There are two basic kinds, recursive and non- life satisfaction and job satisfaction reciprocally cause
recursive. Recursive models are the most straight- each other such that satisfaction with work boosts sat-
forward and have two essential features: all causal isfaction with life, and vice versa. Another is Stanovich
effects are unidirectional, and their disturbances are (1986), who described expected reciprocal causation
independent. (All models considered to this point are between reading and vocabulary: Reading contributes
recursive.) The model in Figure 7.4(a) is recursive as to vocabulary growth, and knowledge of more words
just defined. Exercise 4 asks you to calculate df M for benefits reading comprehension.
(a) Recursive (b) Nonrecursive
X1 Y1 X1 Y1
X2 Y2 X2 Y2
Partially recursive
(c) Bow-free pattern (d) Bow pattern
X1 Y1 X1 Y1
X2 Y2 X2 Y2
FIGURE 7.4. Examples of recursive, nonrecursive, and partially recursive models with two different patterns of error correla-
tion. All diagrams shown in compact symbolism.
Pt2Kline5E.indd 109 3/22/2023 3:44:06 PM

Models with causal loops may or may not have dis- The second path just listed is a biasing path that
turbance covariances (correlated errors). Recall that involves two latent confounders (D1, D 2), so it cannot
correlated errors in nonparametric models represent be closed by controlling for covariates. But the causal
hypotheses about unmeasured common causes (e.g., path between Y1 and Y2 can be identified in the instru-
Figure 6.3(c)), and the same is true in parametric mod- mental variable method, a point elaborated next.
els. For example, variables Y1 and Y2 in Figure 7.4(b) Recursive models and partially recursive models
are specified as causes of each other, and the additional with no causal loops can be represented with directed
specification acyclic graphs, which means that all the graphical iden-
tification rules described in the previous chapter can be
D1 D2 applied to path models like those in Figures 7.4(a) and
7.4(c)–7.4(d). For example, computer tools that ana-
says that Y1 and Y2 are expected to share ≥ 1 unmea- lyze directed acyclic graphs can be used to determine
sured common causes. Though it makes sense that whether a particular causal effect in a recursive or par-
variables involved in reciprocal causation might also be tially recursive path model can be identified through
affected by overlapping unmeasured confounders, the covariate selection or the use of instruments. Nonre-
inclusion of a disturbance covariance requires explicit cursive models with causal loops like Figure 7.3(b) cor-
justification, just as for any other model specification. respond to directed cyclic graphs, and graphical identi-
It is also possible to specify reciprocal causation where fication rules for such graphs are not as well developed.
the disturbances are independent (Rigdon, 1995); thus, Some methods convert a directed cyclic graph with
correlated errors are an option in nonrecursive models, causal loops to a directed acyclic graph with no causal
but they are not mandatory. loops before applying a method for the latter type of
There is another type of path model, one that has graph (Spirtes, 1995). But there is relatively little com-
unidirectional effects and disturbance covariances; two puter support for analyzing directed cyclic graphs com-
examples of this type are presented in Figures 7.4(c) pared with directed acyclic graphs. Models with causal
and 7.4(d). Some authors call these models nonrecur- loops also have special assumptions that generally can-
sive (Bollen, 1989), whereas others use the term par- not be verified in the data. Chapter 19 deals with how
tially recursive (Tsai et al., 2006). But more important to manage special challenges when analyzing models
than the label for these models is the distinction made with causal loops.
in the figure and also by Brito and Pearl (2003): Par- We can now state a general rule for parametric path
models:
tially recursive models with a bow-free pattern of dis-
turbance correlations can be treated in the analysis just
RULE 7.4 Recursive models or partially recursive
like recursive models. A bow-free pattern means that
models with bow-free patterns of disturbance
correlated errors are restricted to pairs of endogenous
covariances that satisfy Rule 7.3 are identified
variables without direct effects between them—see Y1
and Y2 in Figure 7.4(c).
Recall that Rule 7.3 concerns necessary but insufficient
In contrast, partially recursive models with a bow
requirements (i.e., df M ≥ 0, each unmeasured variable is
pattern of disturbance correlations must be treated in
scaled). Rule 7.4 is sufficient, so path models that also
the analysis as nonrecursive models. A bow pattern meet it are in fact identified.
means that a disturbance covariance occurs between Identified path models—or any type of identified
a pair of endogenous variables with a direct effect structural equation model—with just as many observa-
between them, such as Y1 and Y2 in Figure 7.4(d). The tions as free parameters (df M = 0) are just-identified
problem is that the combination of a direct effect and (just-determined), and identified models with more
correlated disturbances implies a back-door path that observation than free parameters (df M > 0) are overi-
cannot be closed through covariate selection in this dentified (overdetermined). A structural equation
model. For example, the two paths between Y1 and Y2 model can be underidentified in two ways: (1) It fails
in the figure are Rule 7.3 because df M < 0, or (2) although df M ≥ 0, some
free parameters are underidentified because there is
Y1 → Y2 not enough information to estimate them but others
Y1 ← D1 D 2 → Y2 are identified. In the second case, the whole model is
Pt2Kline5E.indd 110 3/22/2023 3:44:06 PM

considered nonidentified, even though df M ≥ 0. Unless the same model (Table 6.4). Detailed analysis of this
nonrecursive models are specified in very particular model is described in Chapters 8–10.
ways, they can wind up not identified as just described.
Thus, a more general definition is that an underidenti-
fied model is one for which it is not possible to uniquely SUMMARY
estimate all of its free parameters (Kenny, 2011).
Parametric structural equation models are closer to
the analysis than nonparametric models in at least two
DETAILED EXAMPLE ways: Observed variables in parametric models should
directly correspond to measured variables in an extant
Represented in Figure 7.5 with full McArdle–McDon- or planned data set, and specific forms of functional rela-
ald RAM graphical symbolism is a parametric path tions between causal and outcome variables, such as lin-
based on Roth et al. (1989). The data collected in a ear versus curvilinear effects for continuous variables,
sample of N = 373 university students are summarized should also be represented in the model. The diagram
in Table 4.3. The hypotheses were summarized in the of a parametric model is also a communication medium
previous chapter for the nonparametric version of the in that a complete diagram is basically a set of visual
same path model (Figure 6.7). With v = 5 observed instructions about how to specify the model in com-
variables in the figure, there are 5(6)/2, or 15 observa- puter syntax. Every model parameter, free or fixed (e.g.,
tions available for the analysis. Free parameters include scaling constants), is represented in diagrams based
on McArdle–McDonald RAM graphical symbolism,
1. A total of 5 variances (2 measured exogenous vari- which can also help researchers who are learning about
ables, exercise and hardy, and 3 error terms for SEM to better understand the analysis. There are other,
endogenous variables, fitness, stress, and illness). more compact graphical symbolisms that save diagram
space, but some model parameters or analysis require-
2. A single covariance (1) between exercise and hardy. ments may not be explicitly shown compared with full
3. A total of 4 coefficients for direct effects on endog- RAM symbolism. This should not pose a problem for
enous variables. more experienced researchers who understand the con-
nections between model parameters, the data, and anal-
The total number of free parameters is thus 10, so ysis details. In the next chapter we begin the analysis
df M = 5, which exactly matches the number of implied of the ongoing detailed example for the recursive path
conditional independencies in the union basis set for model of illness in Figure 7.5. Let the good times roll.
DF
1
Exercise Fitness DI
1
Illness
Hardy Stress
1
DS
FIGURE 7.5. A parametric path model of illness.
Pt2Kline5E.indd 111 3/22/2023 3:44:06 PM

LEARN MORE Jaccard, J., & Jacoby, J. (2020). Theory construction and
model-building skills: A practical guide for social scien-
Jaccard and Jacoby (2020) address how to bridge the gap tists (2nd ed.). Guilford Press.
between theoretical concepts and predictions and their
McDonald, R. P., & Ho, M.-H. R. (2002). Principles and
representation in causal or mathematical models that can
practice in reporting structural equation analyses. Psy-
be tested, McDonald and Ho (2002) describe options for
chological Methods, 7(1), 64–82.
graphical symbolism in parametric structural equations mod-
els, and Tate (2015) explains the conceptual timing criterion Tate, C. U. (2015). On the overuse and misuse of mediation
in studies of mediation. analysis: It may be a matter of timing. Basic and Applied
Social Psychology, 37(4), 235–246.
EXERCISES
1. Show that df M = –1 for Figure 7.2(a). Comment on the new results. Plot the regression
line with both linear and quadratic trends.2
2. Specify a variation of Figure 7.2(b) such that proxy
P is not assumed to cause X but where the effect of 6. Listed next as (X, W, Y) are scores for N = 8 cases:
X on Y would be estimated controlling for P.
(2, 10, 5), (6, 12, 9), (8, 13, 11), (11, 10, 11)
(4, 24, 11), (7, 19, 10), (8, 18, 7), (11, 25, 5)
3. State assumptions for Figure 7.3(a).
Regress Y on X and W and report the multiple cor-
relation. Next, regress Y on X, W, and XW without
4. Calculate df M for Figure 7.4(a) and describe hypoth-
centering the scores. Report the new multiple cor-
eses represented in the model about the relation
relation. From the unstandardized regression equa-
between Y1 and Y2.
tion that includes the product term, generate the
simple linear regressions for
5. Listed next as (X, Y) are scores for N = 8 cases:
M W – SDW, M W, and M W + SDW
(–9, 14), (–6, 11), (–3, 9), (–1, 9)
(1, 5), (3, 8), (7, 11), (9, 14) and describe the pattern. Plot the simple regression
lines.
Compute rXY. Next, regress Y on both X and X2.
2 Anonline polynomial regression plotter is available at https://

mycurvefit.com/
Pt2Kline5E.indd 112 3/22/2023 3:44:06 PM

Appendix 7.A Presented in Table 7.1 is a set of two dummy codes,

C1 and C2, that represent membership in the three groups
listed in the table. Note that each group has a unique pat-
tern of values over C1 and C2. The group coded as (0, 0)
Advanced Topics over the two dummy codes, Control, is specified as the
in Parametric Models reference group against which each of the two remain-
ing groups is contrasted. For example, code C1 in the
table specifies the comparison of the Diagnosis A group
Three extensions of parametric path models are with Control, and code C2 specifies the contrast between
described next. They all involve regression concepts Diagnosis B and Control. Specification of a different
about how to analyze categorical predictors and esti- reference group, such as Diagnosis A, is possible by
mate curvilinear or interactive effects of continuous respecifying the dummy codes so that this group is coded
predictors—see Cohen et al. (2003, chaps. 6–8) for as (0, 0) over C1 and C2 instead of Control.
more information. Figure 7.6(a) represents membership in one of the
three groups from Table 7.1 as two correlated causes
of a continuous outcome Y. The causes in the figure
CATEGORICAL EXOGENOUS are C1 and C2, the two dummy codes in Table 7.1. The
VARIABLES coefficient for the direct effect of C1 estimates the mean
contrast on Y between Diagnosis A and Control, and
The levels of a categorical (nominal) variable repre- the coefficient for C2 is the mean difference between
sent membership in two or more mutually exclusive Diagnosis B and Control, each adjusted for overlap
and exhaustive groups (g ≥ 2) that are not necessarily with the other comparison. The specification of a refer-
ordered from lowest to highest along a single dimen- ence group in dummy coding and the choice of dummy
sion. Group membership is represented both in the codes over other kinds of code variables for categorical
analysis and model diagrams with g – 1 code variables exogenous variables should be guided by substantive
that (1) uniquely associate each case with a particular considerations—see Daly et al. (2016) for an example.
group and (2) can be specified as predictor variables in
regression analysis or as exogenous causal variables in
SEM. Categorical variables with g = 2 groups can be CURVILINEAR EFFECTS
represented with a single code variable, such as 0 for OF CONTINUOUS
control and 1 for treatment, but multiple code variables EXOGENOUS VARIABLES
are needed to represent membership in g ≥ 3 groups.
An example follows. There are two basic options to relax the assumption that
Suppose the levels of a categorical exogenous vari- the effect of a continuous exogenous variable is strictly
able are represented in a data file as linear (e.g., Figure 7.1): Polynomial regression and non-
parametric regression. In polynomial regression, the
1 = Diagnosis A, 2 = Diagnosis B, 3 = Control researcher creates power terms that represent orders of
curvilinear effects. For example, regressing Y on both
The numerical values just listed are arbitrary because any X and the power term X2—literally, the squared scores
set of three different numbers, such as (3, –2, 17), could
be used to associate cases with groups. Thus, results of
mathematical operations on arbitrary values for group TABLE 7.1. Dummy Codes for an Exogenous
membership would be meaningless. But code variables Categorical Variable with Three Levels
are not arbitrary in that (1) they represent all informa- Dummy codes
tion contained in the whole categorical variable about Group C1 C2
group membership, and (2) their specification provides
a distinct pattern of contrasts between groups. Among Diagnosis A 1 0
the four basic types of code variables—dummy codes, Diagnosis B 0 1
unweighted effect codes, weighted effect codes, and con-
Control 0 0
trast codes—only dummy codes are described next.
Pt2Kline5E.indd 113 3/22/2023 3:44:06 PM

(a) Dummy codes (b) Linear and quadratic
C1 X
Y Y
C2 X2
FIGURE 7.6. Models for representing membership in one of three groups with dummy codes (a), and both linear and qua-
dratic effects of a quantitative cause (b). All diagrams shown in compact symbolism.
on X—adds one bend to the regression line, which cor- generates the slope of the tangent line for Equation 7.5
responds to a quadratic trend. Now any combination at a given value of X. In the figure, the slope of tangent
of linear or quadratic effects of X on Y in the data can line at X = 1 is 2.20 – .40(1), or 1.60. The slope of the
be estimated. Regressing Y on X, X2, and X3 specifies tangent line at X = 4 is still positive but not as steep,
a regression line with two bends, or a cubic trend in or 2.20 – .40(4) = .60, which is consistent with a nega-
addition to linear and quadratic trends, and so on. Cur- tive quadratic trend in the regression line. Exercise 5
vilinear effects beyond cubic are relatively rare in the asks you to estimate a curvilinear relation in a small
behavioral sciences, but cubic relations describe cer- data set using polynomial regression. See Loehlin et al.
tain learning curves and dose-response curves, where (1990), who analyzed linear and quadratic effects of age
increases in Y over lower levels of X is slow at first but in models of extroversion, socialization, and emotional
then increases more rapidly over intermediate levels lability among adopted and biological children of adop-
before further increases in Y decelerate over higher tive parents.
(asymptotic) levels of X. A second option is nonparametric regression,
The unstandardized equation for regressing Y on X which estimates the form of the functional relation
and X2 is as follows: between continuous variables exclusively from the data
Ŷ = B1X + B2 X2 + A (7.4)
where B1 estimates the linear slope, B2 represents the 8

quadratic slope (i.e., degree of departure from linear-
ity), and A is the intercept. Equation 7.4 is represented
as a parametric causal model in Figure 7.6(b) without
a mean structure (i.e., there is no intercept in the dia- 6
gram). The figure represents both the linear and qua-

Y
dratic causal effects of X on Y. Figure 7.7 is a scatterplot

for a relation between X and Y that has both a posi- 4
tive linear trend and a negative (decelerating) quadratic
trend. The specific equation for the regression line in
the figure is
2
Ŷ = 2.20X – .20X2 + 2.00 (7.5) 0 1 2 3 4 5
X
The first derivative of Equation 7.5, or
dYˆ FIGURE 7.7. Hypothetical scatterplot and regression line
= 2.20 − .40 X (7.6) (solid) with linear and quadratic trends. The tangent lines
dX (dashed) are shown for the points X = 1, 4.
Pt2Kline5E.indd 114 3/22/2023 3:44:06 PM

and with no prior assumptions about its shape. Instead, X and W for all cases. In moderated multiple regres-
methods like the loess procedure rely on smoothing sion (MMR), the prediction equation just described is
techniques analogous to how moving averages are
computed in time series data. These methods gener- Ŷ = B1X + B2W + B3XW + A (7.7)
ate regression curves with multiple bends that locally
fit the data at each point throughout the length of the where coefficients B1 and B2, respectively, estimate
scatterplot. In this case, the specification X → Y rep- unconditional linear effects of X and W and where B3
resents any functional forms of the relation between estimates their conditional linear effect, or a linear ×
these two variables, just as in a nonparametric model. linear interaction. If B3 ≠ 0, then (1) the linear effect of
Nonparametric regression techniques generally require X on Y changes in a gradual (linear) way over the levels
larger samples compared with parametric methods that of W, and because interaction is symmetrical, then it is
assume linearity, and their use is better known in eco- also true that (2) the linear effect of W on Y changes in
nomics than in the behavioral sciences—see Takezawa a linear way over the levels of X. Note that although the
(2006) for more information. product term in Figure 7.8(a) has a direct effect on Y, it
has no causal agency by itself. Instead, the coefficient
for the path XW → Y estimates the joint effect of vari-
INTERACTIVE EFFECTS ables X and W (i.e., their interaction).
OF CONTINUOUS There are at least three other ways to represent
EXOGENOUS VARIABLES interactive effects of continuous exogenous variables
shown with no intercepts in diagrams for parametric
In Figure 7.3(a), two continuous predictors, X and W, are path models. They differ in their symbolism, if any, for
assumed not to interact in their linear effects on Y. This the product term in Equation 7.7. In Figure 7.8(b), for
assumption is relaxed in Figure 7.8(a), shown with no example, the product term is represented with a closed
intercept, where Y is regressed on X, W, and the prod- circle, and covariances between XW and its constitu-
uct term XW, which is just the product of scores on ent variables, X and W, are implied, but not explicitly
(a) Regression style (b) Mplus style
X X
W Y Y
XW W
(c) Focal (X) vs. (d) Skeletal

moderator (W)
W W
X Y X Y
FIGURE 7.8. Alternative diagrams of parametric models for representing linear interactive effects of two continuous causes.
All diagrams shown in compact symbolism.
Pt2Kline5E.indd 115 3/22/2023 3:44:06 PM

shown. The coefficient for the direct path from the analysis of Equation 7.7 in a small data set with scores
closed circle to Y in the figure estimates the interactive on X, W, and Y. After you create the product term, next
effect (i.e., B3 in Equation 7.7). This particular graphi- regress Y on X, W, and XW. Although centering scores
cal symbolism for interaction is associated with Mplus on the predictors—that is, subtracting the mean from
(Muthén & Muthén, 1998–2017). predictor raw scores—is an option in MMR analyses,
The product term XW is not explicitly represented in doing so it not required—see Edwards (2009) for an
Figure 7.8(c), but its analysis along with variables X and explanation—so do not center the scores in this analy-
W is assumed. The hypothesis that the linear effect of sis. With the full regression equation, you are asked to
focal variable X on Y changes over the levels of mod- generate equations for the simple linear regressions
erator W is represented in the figure by the path from of Y on X at three different levels for W, at its mean
W that bisects the X-to-Y path at a right angle, and the and at ±1 standard deviation above or below the mean.
coefficient for this path corresponds to the interactive Your plot should resemble Figure 7.9, which depicts the
effect (B3 in Equation 7.7). Because interaction is sym- three simple regression lines just described for the data
metrical, an equivalent version of Figure 7.8(c) features in Exercise 6. In the figure, the linear relation between
switching the labels for X and W with no other change Y and X is positive when the level of W is 1 standard
in the diagram. In the equivalent version just described, deviation below its mean. But there is basically zero
variable W is represented as focal with X depicted as linear relation at the mean of W, and the association
the moderator. Figure 7.8(d) is a minimalist version of between Y and X is negative when W is 1 standard devi-
Figure 7.8(c) that also omits the covariance between X ation above its mean. Thus, the linear relation between
and W and the direct effect of W on Y. This very skeletal Y and X changes in a linear way (from positive to nega-
representation is best viewed as a conceptual diagram tive) as W increases in value. Aguinis and Gottfredson
that quickly conveys the hypothesis of moderation, but (2010) describe best practices for estimating interactive
it hides many details compared with the full regression- effects of continuous variables in moderated multiple
style model in Figure 7.8(a). regression.
Exercise 6 asks you to conduct a multiple regression
11
MW − SDW
10
9 MW
8
Y MW + SDW
7
4
2 3 4 5 6 7 8 9 10 11
X
FIGURE 7.9. Simple linear regressions of Y on X at three different levels on W for the data in Exercise 6.
Pt2Kline5E.indd 116 3/22/2023 3:44:06 PM

8
Local Estimation and Piecewise SEM
There are two broad families of estimation methods in SEM, local and global. In local estimation—
also called limited-information methods, partial-information methods, or single-equation
methods—equations for endogenous variables are analyzed one at a time. Conceptually, (1) the full
model is decomposed into a series of submodels, one for each outcome; and (2) presumed causal effects
for each outcome are estimated in separate regression analyses. In global estimation, the whole model
is analyzed all at once; that is, equations for all outcomes and their presumed causes are simultaneously
estimated. Until recently, (1) local estimation in SEM was mainly restricted to manifest-variable path analysis
models, and (2) global estimation was the sole practical choice for models with common factors as proxies
for latent variables. But the availability of relatively new computer tools for local estimation of path models
has expanded analysis options for both types of models just mentioned. This chapter covers local estimation
for manifest-variable path models and the related method of piecewise SEM. Options for local estimation or
global estimation of models with common factors are covered in later chapters.
RATIONALE OF LOCAL ESTIMATION For example, ordinary least squares (OLS), or stan-
dard multiple regression (identity link), is for continu-
In local estimation, the researcher conducts a series of ous outcomes with linear relations to all predictors.
regression analyses, one for each outcome variable in Curvilinear relations can also be estimated if the com-
the model. A suitable regression technique should be puter is instructed to include in the equation polyno-
used for a particular outcome. This means that mial terms, such as X2 for quadratic effects of X, along
with variable X itself (Appendix 7.1). Normal distribu-
1. The link function, which associates a linear com- tions for observed scores are not required in the OLS
bination of predictors to a parameter for the distri- method, but distributions of residuals for cases should
bution of the outcome, is appropriate, given the data be normal. Dichotomous outcomes could be analyzed
type of the outcome (e.g., count, ordinal, or binary in logistic regression (logit link) or probit regression
data). (probit link), among other options for binary regression,
and outcomes that are count variables could be ana-
2. Distributional assumptions of the technique, if any,
lyzed in Poisson regression (log link), and so on. The
should be plausible. point is that there should a be a good match between the
3. Functional forms of statistical associations between distributional assumptions and types of functional rela-
predictors and outcomes are properly specified in tions estimated in a particular regression technique and
parametric methods, or nonparametric methods the outcome analyzed with its presumed causes.
are used that do not assume a particular functional Potential advantages of local estimation are listed
form (Appendix 7.1). next (Bollen, 2019; Lefcheck, 2016; Shipley, 2000):
117
Pt2Kline5E.indd 117 3/22/2023 3:44:06 PM

1. No specialized software is needed. This is level analysis of path models, comparison of alterna-
because local estimation can be conducted using soft- tive models fitted to the same data, evaluation of path
ware for general statistical analyses, such as SPSS, models over multiple samples, and models with prox-
SAS/STAT, or native regression procedures in R, such ies for latent variables (Shipley, 2003, 2009; Shipley
as function “lm( )” (linear regression models) for con- & Douma, 2020, 2021). Lefcheck (2016) described the
tinuous outcomes. freely available piecewiseSEM package for the R com-
puting environment.
2. Local estimation accommodates a wide range of
The basic steps of piecewise SEM for recursive path
variable types and distributions, and it also allows the
models with continuous outcomes with linear relations
use of parametric or nonparametric regression meth-
(Shipley, 2000) are summarized next:
ods. Thus, no universal requirements or assumptions,
such as multivariate normality, necessarily apply over
1. The path model is expressed as a directed acyclic
the analyses for all outcome variables.
graph (DAG).
3. Local estimation may be less susceptible to prop- 2. The union basis set of implied conditional indepen-
agation of specification error compared with global dencies is derived. Recall that the union basis set
estimation. Because each outcome is separately ana- controls for all parents of each nonadjacent pair of
lyzed in local estimation, effects of specification error variables, or those not directly connected by a path
for one outcome may not spread to different outcomes in the graph. It consists of the smallest number of
with correctly specified equations. nonoverlapping conditional independence claims
4. Global estimation methods are typically asymp- that generate all such hypotheses encoded by the
totic; that is, they require large samples for precise esti- DAG.
mation. They also require statistically identified mod- 3. In the data, calculate the value of the Pearson cor-
els; otherwise, estimation may fail. In contrast, local relation or partial correlation that corresponds to
estimation is generally less demanding about sample each implied conditional independency in the union
size, and it may be possible in local estimation to gen- basis set. Each of these coefficients is also a correla-
erate estimates for individual outcomes even though the tion residual, or the difference between the observed
whole model is not identified. (sample) correlation and the predicted value, which
is zero (i.e., conditional independence). Correla-
5. The availability of significance tests of overall
tion residuals are measures of local fit because they
model fit was once the near-exclusive domain of global
involve a single pair of variables, not all variables in
estimation, but there are now computer tools that con-
the model considered at once.
duct global fit testing in the context of local estimation,
too. For observed-variable path models, these global fit 4. Next, for each observed correlation test the null
tests are generally based on the concept of d-separation hypothesis that the corresponding parameter equals
and simultaneously test all model-implied conditional zero against a nondirectional alternative hypoth-
independencies for a union basis set (Chapter 6). Such esis. For example, if rXY•W is the sample coefficient
tests can be conducted without estimating a single for the implied conditional independence X ⊥ Y | W,
model parameter; that is, d-separation tests can be con- then the null and alternative hypotheses are, respec-
ducted prior to local estimation. Doing so is part of the tively,
rationale for piecewise SEM, which is described next.
H0: rXY•W = 0 and H1: rXY•W ≠ 0
PIECEWISE SEM Depending on the computer tool, the test statistic

could be t (N – 2 – c), where c is the number of
Shipley (2000) described the basic logic of piece- variables for which we are controlling (c = 1 in this
wise SEM, also called confirmatory path analysis, example), or it could be the normal deviate z based
for recursive path models with no causal loops and on the Fisher transformation for correlation coeffi-
no correlated errors. The method was later expanded cients.
to include path models with correlated errors, multi- 5. Conduct the d-separation (d-sep) test, which is a
Pt2Kline5E.indd 118 3/22/2023 3:44:07 PM

Local Estimation and Piecewise SEM 119
multivariate significance test of all implied condi- nificant, but this degree of departure from zero may be
tional independencies in the union basis set. The seen as negligible.1 A d-sep test based on partial corre-
test statistic is C (Fisher, 1954), and its formula is lations that all differ trivially from zero could likewise
k be significant in a large sample. But in a small sample
C = −2∑ ln( pi ) (8.1) due to low power, the d-sep test may fail to be signifi-
i =1 cant even though some sample partial correlations are
where ln is the natural log transformation to base e much larger, such as r = .20, which indicates a 100-fold
(approximately 2.7182), and pi is the p value from greater departure from zero than r = .002.
each of the individual significance tests described Suggested next is a rule of thumb for interpret-
in Step 4. The C statistic is distributed over random ing correlation residuals based more on an effect size
samples as central chi-square with df = 2k, where k perspective than on outcomes of significance testing:
is the number of independence claims in the union Absolute discrepancies between observed and pre-
basis set. The null hypothesis tested by C is dicted correlations that exceed .10 may signal apprecia-
ble model–data disagreement. This standard has been
H0: pk×1 = 0k×1 suggested for exploratory factor analysis (Pett et al.,
2003; Tabachnick & Fidell, 2013) and, in my judgment,
where pk×1 is the population vector of p values from it seems like a reasonable guideline when continuous
the tests of all implied conditional independences variables are analyzed in SEM, too. Although it is dif-
and 0k×1 is the zero vector of the same dimension ficult to say how many absolute correlation residuals
where all elements equal zero. > .10 is “too many,” the more there are, the worse the
6. If the model fails the d-sep test (e.g., C is statis- correspondence between model and data concerning
tically significant at p < .05), then the researcher implied conditional independencies. An example of
may decide to respecify it. Model respecification is considering effect size when conducting the d-sep test
considered in more detail in a later chapter, but for follows.
now we will treat a failed d-sep test as indicating Suppose in a large sample that p = .001 for C sta-
a potential problem with the original model. The tistic, so the model “fails” the d-sep test at a conven-
d-sep test should also be conducted for any respeci- tional level of statistical significance. The researcher
fied model. inspects the absolute values for the whole set of par-
tial correlations and finds that none exceeds .01. If
7. If the original model or any respecified version is
these degrees of departure from zero are all consid-
eventually retained, the last step is to locally esti-
ered unimportant, then the researcher could decide to
mate the equation for each outcome. Path coef-
ignore the failed d-sep test for the model. That is, the
ficients for presumed causal effects are generally
model is not rejected, given the relatively low mag-
identified through the specification of adjustment
nitudes of partial correlations even though the global
(conditioning) sets of covariates in the OLS method
significance test was failed. Now suppose in a small
or through the specification of instruments in
sample that a model “passes” the d-sep test (i.e., C is
instrumental variables regression, such as the two-
not significant) even though some, and perhaps most,
stage least squares (2SLS) method.
absolute partial correlations exceed .10. Low power
of the d-sep test could explain this pattern of results.
Two elaborations are needed. First, the question of
Accordingly, the researcher could decide in this case
what is the minimally acceptable absolute correlation
to ignore the passed d-sep test (i.e., the model is not
residual before concluding that an independence claim
retained), given the magnitudes of the correlation
is deficient is not clearly specified in the works on piece-
residuals.
wise SEM cited to this point. Statistical significance
as the sole decision criterion (e.g., reject the model if 1 An alternative is to test correlation residuals for significance
p < .05 for the C statistic) is problematic because it against nonzero values, such as .05 in absolute value or some
ignores effect size and power. For instance, in a large other reasonably small value—see Thoemmes and Rosseel
sample, the test for a sample partial correlation that is (2018) for more information and examples of R code that imple-
close to zero, such as r = .002, could be statistically sig- ment this type of test.
Pt2Kline5E.indd 119 3/22/2023 3:44:07 PM

A second issue in piecewise SEM concerns local esti- variables in linear recursive models, each element of the
mation. The version of piecewiseSEM that I used in the union basis set corresponds to a vanishing partial cor-
upcoming detailed example permitted the specification relation that can be compared with a sample partial cor-
of a single regression equation for each outcome, gener- relation coefficient. If the model is correctly specified,
ally the one that includes all parents for a specific out- then the two values—predicted (zero) and observed—
come (Lefcheck, 2020). There may be other equations should be similar with the bounds of sampling error or
for the same outcome that feature different covariates or effect size (i.e., any discrepancy is considered trivial);
the inclusion of instruments for particular causal vari- otherwise, the model should not be retained.
ables, but multiple equations for the same outcome can- Listed in Table 8.1 are the analyses, annotated
not be analyzed in a single execution (run) of the function script files, and R packages used in the piecewise
“psem( ),” which is used in the piecewiseSEM package SEM analysis of the Roth et al. (1989) parametric path
to specify the equations and fit them to the data. But it is model of illness in Figure 7.5. All files can be freely
not problematic to specify and analyze additional equa- downloaded from this book’s website. The version of
tions for the same outcome in regression analyses con- the p iecewiseSEM package used in this example (Lef-
ducted outside of the piecewiseSEM package. check, 2020) could not read summary data (i.e., raw
data input is required). So, I generated in analysis 1 a
raw data file in comma separated values (.csv) format
DETAILED EXAMPLE based on the summary statistics in Table 4.3 for the
Roth et al. (1989) data set in a sample of N = 373. Spe-
Let’s recap the ongoing example to this point: In Chap- cifically, I used the “kd( )” function in semTools (Jor-
ter 6, we specified as a directed acyclic graph (DAG) the gensen et al., 2022) for the Kaiser-Dickman algorithm
nonparametric version of the Roth et al. (1989) recur- (Kaiser & Dickman, 1962) to create raw scores for 373
sive path model of illness in Figure 6.7. We analyzed cases where variable descriptive statistics (covariances,
the graph with a computer tool (analysis 1, Table 6.3) means) exactly match those in Table 4.3 for the actual
that generated the union basis set, or the smallest num- data. These generated raw scores were specified as the
ber of conditional independencies (5 in total) located input data for the analyses 2–4 in Table 8.1.2
by the d-separation criterion that (1) are mutually inde-
pendent; (2) imply all other conditional independen-
cies; and (3) include the parents of both variables in Partial Correlations
the conditioning (adjustment) set (Rules 6.1–6.2). The and the d‑Separation Test
union basis set for the Roth et al. (1989) model is listed For analysis 2 in Table 8.1, I used the psych package
in Table 6.4. For example, the graph predicts that the (Revelle, 2022) and the piecewiseSEM package to
fitness and stress outcomes are independent after con- calculate the sample partial correlation and p value
trolling for both of their parents, exercise and hardy. for each of the five conditional independence claims
For the same DAG in Chapter 6, we applied graphi- in the union basis set and also to conduct the multi-
cal identification criteria (analysis 2, Table 6.3) to gen- variate d-sep test for the whole model. Recall that
erate (1) minimally sufficient adjustment sets of covari- sample (observed) partial correlations in these analy-
ates to estimate causal effects in ordinary least squares ses are also correlation residuals because they all cor-
(OLS) regression (Rules 6.3–6.4) or in two-stage least respond to predicted values that are zero. The results of
squares (2SLS) regression with instruments or partial analysis 2 are summarized in Table 8.2. There is one
instruments (Rules 6.5–6.6). The results of these analy- absolute correlation that is just .10 or more. This result,
ses—see Table 6.5—provide a “roadmap” or analysis
–.103 (shown in boldface in the table), is for the pair
plan for local estimation in this chapter.
fitness and stress. The model implies that fitness and
In Chapter 7, we specified the parametric version of
stress are independent, given exercise and hardy, but
the Roth et al. (1989) path model depicted using full
McArdle–McDonald RAM graphical symbolism in 2 Note that “kd( )” generates a different set of raw scores each
Figure 7.5. We determined that the model is identified time it is run, but score descriptive statistics always exactly
(Rules 7.3–7.4) and that the degrees of freedom are match those of target covariances and means. Thus, all analysis
df M = 5 (Rules 7.1–7.2), which exactly equals the size results described in this chapter are identical in any raw data so
of the union basis set for this model (5). For continuous generated for this example.
Pt2Kline5E.indd 120 3/22/2023 3:44:07 PM

TABLE 8.1. Analyses, Script Files, and Packages in R for Piecewise SEM
Analyses of a Recursive Path Model of Illness
Analysis Script files R packages
1. Generate unstandardized scores that roth-generate-scores.r semTools
exactly match sample covariances, means lavaan
2. Estimate and test implied conditional roth-d-sep-test.r piecewiseSEM

independencies psych
3. Local estimation of causal effects

a. Covariate adjustment (OLS) roth-effects-ols.r piecewiseSEM
b. Instruments (2SLS) roth-effects-2sls.r systemfit
4. Bootstrapped standard errors and roth-bootstrap-ci.r bmem

confidence intervals for indirect effects sem
Note. The external raw data file created in analysis 1 is roth.csv. Output files have the same names except
the extension is “.out.” Packages semTools and lavaan are also used in analyses 2–4. OLS, ordinary least
squares; 2SLS, two-stage least squares.
their observed residual differs by what I would consider With a total of 5 conditional independence claims in
to be worrisome. In Figure 7.5, there is a single back- the union basis set, the degrees of freedom are 5 × 2,
door path between fitness and stress: or 10. For c2 (10) = 19.521, p = .034, so the model fails
the d-sep test at the .05 level. Thus, there is covariance
Fitness ← Exercise Hardy → Stress evidence against the model from the perspective of
significance testing. The sample size here is not large
A possible specification error is that fitness and stress (N = 373), one absolute correlation exceeds .10 (for
are related through paths omitted from the origi- fitness and stress), and other absolute correlations are
nal model. For instance, perhaps fitness affects stress nearly as large (e.g., .089 for hardy and fitness)—see
(Fitness → Stress), or vice versa (Stress → Fitness). We Table 8.2—so local fit problems are indicated from
will deal with respecification later, but we have already an effect size perspective, too. Exercise 1 asks you to
detected a local fit problem. calculate C for this analysis, given the p values for the
The value of the C statistic for the d-sep test cal- partial correlations in Table 8.2.
culated for these data in the piecewiseSEM package Given the results to this point, I would reject the
is 19.521 (see the output file for analysis 2, Table 8.1). model as inconsistent with the data and thus begin the
TABLE 8.2. Sample Partial Correlations and p

Values for a Union Basis Set of Implied Conditional
Independencies for a Recursive Path Model of Illness
Conditional independence Adjustment set Partial correlation p
Exercise ⊥ Stress Hardy –.058 .260
Exercise ⊥ Illness Fitness, Stress .039 .455
Hardy ⊥ Fitness Exercise .089 .087
Hardy ⊥ Illness Fitness, Stress –.081 .118
Fitness ⊥ Stress Exercise, Hardy –.103 .048
Note. The p values are for two-tailed tests that the population correlation is zero.
Pt2Kline5E.indd 121 3/22/2023 3:44:07 PM

respecification phase. But in this pedagogical example, There are a total of three OLS estimators for the
we continue next to local estimation using two different unstandardized direct effect of fitness on illness, each
methods, covariate selection with estimation in OLS with a different adjustment set (see Table 8.3). Their
regression and estimation with instruments in 2SLS values are generally consistent and range from –1.036
regression. Doing so gives us the opportunity to appre- when controlling for exercise to –.849 when control-
ciate that multiple estimators for the same causal effect ling for stress, the other parent of illness in the origi-
may be available in local estimation of path models. nal model (Figure 7.5). The result just mentioned says
that for every 1-point increase in fitness, there is an
expected decline in illness of .849 points while hold-
Estimates of Direct Causal Effects
ing stress constant. There are also a total of three esti-
I used the piecewiseSEM package for analysis 3a in mators for the unstandardized direct effect of stress on
Table 8.1 to generate the OLS estimators of unstan- illness, each with just one of the variables fitness, exer-
dardized direct effects in the Roth et al. (1989) path cise, or hardy as the adjustment set (Table 8.3). Values
model that are listed in the second and third columns of of these alternative estimators are all positive and gen-
Table 8.3 and shown in boldface. These results control erally consistent. Exercise 3 asks you to interpret the
for the parents of each outcome.. Because there are no OLS coefficient for the direct effect of stress on illness
causes of fitness other than exercise and also no back- while controlling for fitness, the other parent of illness.
door paths between these two variables—see Figure Reported in the fourth and fifth columns in Table 8.3
7.5—the adjustment set is empty (i.e., no covariates). are the 2SLS estimates of unstandardized direct effects
This means that the bivariate regression of fitness on in the Roth et al. (1989) path model. These results are
exercise is the sole OLS estimator for this effect. The inconsistent or plainly anomalous for some effects, and
unstandardized coefficient is .108 (see the table), which thus problematic. For example, the estimate for the
indicates that fitness is expected to increase by .108 direct effect of exercise on fitness is negative, or –.646
points in its raw score metric, given a 1-point increase (i.e., more exercise, less fitness) when the instrument is
in the raw score metric of exercise. Exercise 2 asks you hardy, but the coefficient for the same direct effect is
to interpret the unstandardized OLS coefficient for the positive, or .719 (i.e., more exercise, more fitness) when
direct effect of hardy on stress. the instrument is stress. There is a similar inconsistent
TABLE 8.3. Unstandardized Local Estimates for Direct Effects in a Recursive

Path Model of Illness
OLS 2SLS
Effect Estimate Adjustment set Estimate Instrument
Exercise → Fitness .108 (.013) — –.646 (1.377) Hardy
.719 (.687) Stress
Hardy → Stress –.203 (.045) — 1.469 (3.252) Exercise

–1.637 (1.240) Fitness
Fitness → Illness –.849 (.162) Stress –.558 (.443) Exercise | Stress

–1.036 (.183) Exercise –6.927 (8.533) Hardy | Stress
–.951 (.168) Hardy
Stress → Illness .574 (.089) Fitness 88.191 (5,980.901) Exercise |

.628 (.091) Exercise 1.180 (.431) Fitness
.597 (.093) Hardy Hardy | Fitness
Note. OLS, ordinary least squares; 2SLS, two-stage least squares. Adjustment sets are minimally sufficient.
Standard errors are reported in parentheses. Values in boldface for OLS estimates control for parents of each
outcome, and values in italic boldface for 2SLS estimates are contradictory in sign for the same effect or out-
of-bounds (invalid).
Pt2Kline5E.indd 122 3/22/2023 3:44:07 PM

pattern of 2SLS estimates for the direct effect of hardy Disturbance Variances
on stress depending on the instrument, exercise or fit-
The second column of Table 8.4 lists the observed
ness (see the table). Both 2SLS estimators for the direct
variances (s2) for the outcome variables fitness, stress,
effect of fitness on illness are negative, but the magni-
and illness, and the third column lists the values of R2
tude of the result when the conditional instrument is
where the predictor variables are the parents of each
Hardy | Stress outcome. Proportions of explained variation range
from .053 for stress to .177 for illness. The standardized
exceeds by more than 10-fold the magnitude of the esti- disturbance variances are calculated as 1 – R2, or the
mate when the conditional instrument is proportion of variance not explained, for each outcome.
For example, R2 = .152 for fitness, 1 – .152 = .848, so
Exercise | Stress exercise does not explain .848 of the total variation in
fitness.3 Unstandardized disturbance variances are cal-
or, respectively, –6.927 versus –.558 (Table 8.3). culated as (1 – R2) s2. For fitness, the unstandardized
Finally, the standard error for the 2SLS estimate of the disturbance variance is calculated as .848 (338.56), or
direct effect of stress on illness, or 5,980.901, is so large 287.009. Exercise 4 asks you to interpret the results in
compared with the observed standard deviation of the Table 8.4 for illness.
illness variable (62.48; Table 4.3) that no meaningful
interpretation seems possible (i.e., it is out-of-bounds, Parametric Model Diagram
and thus invalid). with Estimates
There are features of the Roth et al. (1989) data set
that handicap estimation with instrumental variables. Now we have unstandardized OLS estimates for all
For instance, the sample correlation between exercise direct effects and disturbance variances in the Roth
and hardy is practically zero (r = –.03; Table 4.3), so et al. (1989) path model. They are shown in their
these variables would be weak instruments for one proper places in Figure 8.1(a) depicted in full McAr-
another. Another example is that the conditional instru- dle–McDonald RAM symbolism. Estimates for direct
ment effects on illness control for both of its parents, stress
and illness (Table 8.3). Because the variables exercise
Exercise | Stress and hardy are exogenous, their variances and covari-
ances are also model parameters, but these values are
is essentially unrelated to stress (r = –.001) and, thus, just the corresponding descriptive statistics (Table 4.3).
it is a weak instrument when estimating the coefficient Because not all measured variables in Figure 8.1(a)
for the direct effect of stress on illness (the estimate was have the same raw score metric, values of unstandard-
invalid; see Table 8.3). Given these problems, estima-
tion with the 2SLS method is not pursued further in 3 Values of R 2 adjusted for shrinkage could be substituted for
this example. unadjusted R2 in these calculations.
TABLE 8.4. Unstandardized Ordinary Least Squares Estimates

of Disturbance Variances for a Recursive Path Model of Illness
Unstandardized
Outcome s2 R2 1 – R2 estimate
Fitness 338.56 .152 .848 287.099
Stress 1,122.25 .053 .947 1,062.771
Illness 3,903.75 .177 .823 3,212.786
Note. The parent(s) of fitness, stress, and illness are, respectively, exercise, hardy, and both
fitness and stress. The 1 – R2 values are the standardized estimates of disturbance variances.
Pt2Kline5E.indd 123 3/22/2023 3:44:07 PM

(a) Unstandardized estimates

287.099
4,422.250 DF 3,212.786
1
.108
Exercise Fitness DI
−.849 1
−75.810 1,440.000 Illness
−.203 .574
Hardy Stress
1,062.771
1
DS
(b) Standardized estimates

.848
1.000 DF .823
.390
Exercise Fitness DI
−.250
−.030 1.000 Illness
−.230 .308
Hardy Stress
.947
DS
FIGURE 8.1. A recursive path model of illness with ordinary least squares estimates. Results for illness are based on both
fitness and stress as parents.
ized path coefficients, such as for direct effects on ill- absolute magnitude of the standardized direct effect of
ness from fitness (–.849) and stress (.574), cannot be stress on illness exceeds that of fitness by about 23%
directly compared. This is not a problem for the stan- (.308/.250 = 1.23). Exercise 5 asks you to interpret the
dardized coefficients, which are presented in Figure standardized direct effects of exercise and hardy on
8.1(b). For example, the standardized direct effect of their respective outcomes.
fitness on illness is –.250, which says that for every Figure 8.1(a) for the unstandardized estimates does
increase in fitness of 1 standard deviation, the level of not include regression intercepts, or values of predicted
illness is expected to decline by .25 standard deviations scores when scores on all predictors equal zero, for fit-
while controlling for stress. The standardized direct ness, stress, and illness, the outcome variables. (All
effect of stress on illness is .308, so the level of illness intercepts are zero in the standardized solution.) Inter-
is predicted to increase by about .30 standard devia- cepts are usually reported in output from regression
tions for every increase in stress of 1 standard deviation computer procedures—see the output files for analy-
while controlling for fitness. Because both results just sis 3a for this example (Table 8.1)—and their values
mentioned are expressed in a common metric (standard could be reported for each outcome along with those
deviation units), they can be directly compared: The for unstandardized regression coefficients. In contrast,
Pt2Kline5E.indd 124 3/22/2023 3:44:07 PM

some SEM computer tools do not generate or print product of the standardized coefficients for the direct
intercepts unless specifically instructed to do so, but effects that compose the indirect pathway (Figure
we will consider this issue in the next chapter. 8.1(b)). Thus, illness is predicted to decrease by .099
standard deviations while keeping exercise constant
and increasing fitness to the level it would be under an
Indirect Effects
increase in exercise of a full standard deviation. Exer-
The fact that indirect effects do not automatically war- cise 6 asks you to reproduce the calculations for product
rant interpretation as “mediation” in cross-sectional estimators of the unstandardized and standardized indi-
designs with no temporal precedence or a clear con- rect effect of hardy on illness through stress reported in
ceptual time-ordering of cause, mediator, and outcome Table 8.5 and interpret both path coefficients. Note that
was discussed in Topic Boxes 6.1 and 7.1. For histori- indirect effects in the table are based on the direct effect
cal completeness, the four steps by Baron and Kenny of fitness on illness controlling for stress or on the direct
(1986) for testing mediation are described in Topic Box effect of stress on illness controlling for fitness. There
8.1. Note that simply following the four steps does not are other potential product estimators for direct effects
by itself “prove” mediation. That is, analysis is insuf- of fitness and stress on illness that control for different
ficient to establish mediation without strong theory. variables—see the OLS estimates in Table 8.3. Thus,
The direct effects of exercise and hardy on illness multiple product estimators for each indirect effect
in the Roth et al. (1989) path model are both fixed to through fitness and stress are available in this example.
zero. Each presumed causal variable just mentioned Because product estimators of indirect effects have
has a single indirect effect on illness, exercise through complex distributions over random samples, it can be
fitness, and hardy through stress (Figure 8.1). Both of challenging to estimate their standard errors. The best-
these indirect effects are also total effects, so they can known example of a method amenable to hand calcula-
be estimated in two different ways: (1) as products of tion for unstandardized indirect effects that involve just
coefficients from the direct effects that comprise each three variables is the Sobel approximate standard
part of the whole indirect pathway, and (2) through error (Sobel, 1982). Suppose that a is the unstandard-
covariate adjustment. Both types of estimates just men- ized coefficient for the direct effect X → W and that
tioned are described next. SEa is its standard error. Let b and SEb, respectively,
The second column of Table 8.5 gives the values of stand for the same things for the direct effect W → Y.
the product estimators for both indirect effects in the The product ab estimates the unstandardized indirect
Roth et al. (1989) path model. The unstandardized effect of X on Y through W, and its standard error is
estimate for the effect of exercise on illness through approximated as
fitness is –.092, which equals the product of the two
unstandardized coefficients for the two direct effects =
SE ab b 2 SE a2 + a 2 SE b2 (8.2)
that make up the indirect pathway, or .108 (–.849) (see
Figure 8.1(a)). Note that the second term of the product Values of the Sobel standard errors for both unstan-
for the effect of fitness on illness, –.849, controls for dardized indirect effects in the Roth et al. (1989) path
stress, the other parent of illness. In words, the unstan- model are reported in Table 8.5. Exercise 7 asks you
dardized estimate for the indirect effect means that to reproduce the calculations for the standard error of
.092 is the expected decrease in illness in its raw score the unstandardized indirect effect of exercise on illness
metric while holding exercise constant and increas- through fitness.
ing fitness to whatever value it would attain under a In large samples, the ratio z = ab/SEab is the Sobel
one-point increase in the raw score metric of exercise test for the unstandardized indirect effect. A web page
(Pearl, 2009, pp. 355–358). This definition is actually by K. Preacher automatically calculates the Sobel test.4
counterfactual because it expresses what could happen The same calculator also gives results for the Aroian
(a decrease in illness), if a previous condition had been test and the Goodman test, each of which is based
different (increasing fitness to the level it would be after on somewhat different approximations of the standard
an increase in exercise). error compared with the Sobel test. Specifically, the
The standardized product estimator for the indirect value of the Sobel standard error is smaller than the
effect of exercise on illness through fitness is –.099
(Table 8.5). It is calculated as .390 (–.250), which is the 4 http://www.quantpsy.org/sobel/sobel.htm
Pt2Kline5E.indd 125 3/22/2023 3:44:07 PM

TOPIC BOX 8.1
Mediation: The Four Steps and Assumptions

Baron and Kenny (1986) described the application of multiple regression over four steps to estimate
indirect effects among continuous variables. The steps were originally phrased in terms of statistical signifi-
cance, but that language was later changed to refer to zero versus nonzero coefficients. This is because
coefficients that are trivially small can be significant in large samples while very large coefficients can fail
to be significant in small samples (Kenny, 2021). This means that statistical significance is not a decision
criterion when estimating indirect effects (or any other kinds of effects). The four steps listed next refer to
Figure 8.2, where X, M, and Y designate, respectively, the hypothesized cause, mediator, and outcome,
and where a, b, c′, and c represent coefficients for direct effects between these variables:
1. The cause affects the outcome ignoring the mediator, that is, coefficient c in Figure 8.2(a) is not
zero.
2. The cause affects the mediator, that is, coefficient a in Figure 8.2(b) is not zero.
3. The mediator affects the outcome controlling for the cause, that is, coefficient b in Figure 8.2(a) is
not zero.
4. To claim that the mediator is completely responsible for the relation between cause and outcome,
coefficient c′ should be zero.
In Figure 8.2(b), the product ab estimates the indirect effect of X on Y through M. The quantity ab + c′ esti-
mates the total effect of X on Y, or the sum of the direct and indirect effects of X. It also equals coefficient
c in Figure 8.2(a) where X is the sole cause of Y. For continuous variables,
c – c′ = ab (8.3)
That is, the difference between the total effect of X in Figure 8.2(a) ignoring M and the direct effect of X in
Figure 8.2(b) controlling for M equals the product estimator for the indirect effect. Equation 8.3 does not
hold when Y is a binary outcome variable analyzed using logistic regression or probit regression. This is
(continued)
(a) No indirect effect (b) Full model
a
X X M
c c′ b
Y Y
FIGURE 8.2. Models for putative cause, mediator, and outcome variables, respectively, X, M, and Y. Partial model
with no indirect effect (a). Full model with direct and indirect effects (b).
Pt2Kline5E.indd 126 3/22/2023 3:44:07 PM

because the variance of the outcome variable is not fixed across the models in Figures 8.2(a) and 8.2(b)
analyzed in logistic or probit regression. In contrast, the scale in standard regression analyses for continu-
ous outcomes is constant over equations—see MacKinnon (2008, chap. 11) for more information and
examples.
The requirement in Step 1 just listed that coefficient c should not be zero is problematic because it
does not allow for inconsistent mediation—also called competitive mediation—where the signs
of ab and c′ in Figure 8.2(b) are different. In this case, the total effect c in Figure 8.2(a) could be zero even
though the size of the product estimator ab is appreciable. Suppose that
a = –.50, b = .30, and c′ = .15
for Figure 8.2(b). The indirect effect of X on Y through M is ab = –.50(.30) = –.15, which exactly cancels
out the direct effect of X on Y, or .15, when the total effect is computed, or
c = ab + c′ = –.15 + .15 = 0
The situation where the coefficients for the two constituent direct effects of an indirect effect for three vari-
ables (e.g., a and b in Figure 8.2(b) have the same sign is called consistent mediation or comple‑
mentary mediation (Zhao et al., 2010).
James and Brett (1984) argued that Step 3 just listed should be modified by not controlling for the
causal variable X in Figure 8.2(b) when estimating coefficient b for the direct effect of the mediator, if the
hypothesis involves complete mediation. This means that the cause is unrelated to the outcome when
the mediator is held constant. If so, then including the cause adds nothing to the prediction of the outcome
over what is already explained by the mediator (i.e., c′ = 0). Step 4 refers to the expectation for complete
mediation. In contrast, partial mediation is indicated when c′ ≠ 0; that is, the mediator is not solely
responsible for the observed association between cause and outcome. Because complete mediation is not
always expected, though, there may be little harm in routinely controlling for the cause in Step 3 (Tate,
2015).
In consistent mediation, controlling for the mediator weakens the association between the cause and
outcome variables, or c′ < c in absolute value for Figure 8.2. Suppression can be described as special
cases of inconsistent mediation where controlling for the mediator strengthens the association between
cause and outcome, or c′ > c in absolute value. That is, the relation between cause and outcome is
strengthened by the mediator’s omission. In general, suppression is indicated when the indirect effect and
the total effect have opposite signs (Rucker et. al., 2011). Another indication of suppression is when the
magnitudes of the direct and indirect effects exceed that of the total effect (Lachowicz et al., 2018)—see
MacKinnon et al. (2000) and Zhao et al. (2010) for more information.
Here is an example of suppression in an actual mediation study of demoralization in breast cancer
patients after primary therapy (Peng et al., 2021): The variables are X = stress, M = demoralization, Y =
psychological well-being estimated as common factors and
a = .48, b = –.85, and c′ = .37
The indirect effect is ab = .48(–.85) = –.41, the total effect is –.41 + .37 = –.04, which is also the model-
implied correlation between X and Y. Thus, the relation between stress and well-being increases from –.04
(continued)
Pt2Kline5E.indd 127 3/22/2023 3:44:08 PM

ignoring demoralization to .37 after controlling for it. Without including demoralization in the analysis, a
researcher could falsely conclude that stress and well-being are basically unrelated among these patients.
In too many mediation studies, researchers have uncritically followed the four-step method while
applying statistical significance as basically the sole criterion for interpretation. The “logic” works like this:
If the product estimator ab is “significant,” then variable M mediates at least part of cause X on outcome
Y. But without also addressing assumptions, effect size, research design, and theory, the conclusion just
stated is unwarranted; Zhao et al. (2010) described additional myths in mediation analysis. New develop-
ments in mediation analysis, outlined in Chapter 20, are making the four-step method ever more obsolete.
Aroian standard error but larger than the Goodman The method generates a bootstrapped confidence inter-
standard error (e.g., MacKinnon et al., 2002, p. 85). val for a particular effect. If the value of zero is not
Thus, it can happen that the same indirect effect is “sig- included within the bounds of a bootstrapped 95% con-
nificant” in the Goodman test but “not significant” in fidence interval, then the corresponding effect could be
the Sobel test or in the Aorian test. It can be difficult to considered as “significant” at the .05 level for a non-
know which outcome is correct in this case because all directional test. But if the confidence interval includes
results are approximate. Thus, p values from the Sobel zero, then the effect could be considered as “not sig-
and related tests should not be overinterpreted (i.e., nificant.”
avoid dichotomania). Of course, there is no requirement to interpret a con-
The Sobel test assumes normality, but distributions fidence interval as a significance test. This is because
of product estimators are not generally normal; instead, from the perspective of interval estimation, all values
such distributions are often asymmetrical with high within a confidence interval are considered as basically
kurtosis (MacKinnon et al., 2002). The test requires equivalent within the limits of sampling error at a par-
large samples, and p values in small samples can be ticular level of confidence (i.e., 1 – a). For example, if
very inaccurate. The test is restricted to unstandard- zero falls within the bounds of a confidence, it has no
ized indirect effects composed of just three variables. more special status than any numerical value contained
An alternative method is nonparametric bootstrapping, by the interval.
which does not assume normality. Nonparametric boot- Preacher and Hayes (2004) described macros for
strapping can be applied to direct or indirect effects, SPSS and SAS/STAT that generate bootstrapped con-
and indirect effects can be composed of ≥ 3 variables. fidence intervals for unstandardized indirect effects
TABLE 8.5. Unstandardized and Standardized Estimates of Indirect Effects

in a Recursive Path Model of Illness
Effect Product estimator Estimated as total effect Adjustment set
Exercise → Fitness → Illness –.092a (.021) –.099 –.080 (.048) –.085 Hardy
–.059 (.046) –.063 Stress
Hardy → Stress → Illness –.116b (.031) –.071 –.267 (.084) –.163 Exercise
–.231 (.081) –.140 Fitness
Note. The estimator is ordinary least squares for all results. Adjustment set is minimally sufficient when estimating each
indirect effect as a total effect. Standard errors for product estimators are Sobel standard errors. Estimates are reported
as unstandardized (standard error) standardized.
aBootstrapped 95% confidence interval is [–.131, –.053].
bBootstrapped 95% confidence interval is [–.195, –.065].
Pt2Kline5E.indd 128 3/22/2023 3:44:08 PM

that involve just three variables. Preacher and Hayes unstandardized coefficients for the indirect effect of
(2008) described revised SPSS and SAS/STAT mac- exercise on illness range from –.092 to –.059, and the
ros and also syntax for Mplus and LISREL that extend standardized coefficients range from –.099 to –.063
nonparametric bootstrapping methods to models with (see Table). Exercise 8 asks you to verify that outcomes
multiple intervening variables or indirect pathways of significance testing are not consistent over different
composed of ≥ 3 variables. Hayes (2022) described estimators for the same indirect effect.
PROCESS, a macro for R, SPSS, and SAS/STAT for For analysis 4 in Table 8.1, I used the bmem package
analyzing a wide range of models with indirect effects to generate bootstrapped 95% confidence intervals for
based on nonparametric bootstrapping. There are also estimates of all model parameters, but next we consider
R packages that can generate bootstrapped confidence only results for the indirect effects. An advantage of
intervals for direct or indirect effects in path models. bmem in this analysis is that it allowed the specification
Examples include MBESS (Kelley, 2022), and bmem that direct effects of both exercise and hardy on illness
(Zhang & Wang, 2022), which also has extensive capa- are zero; that is, effects of these causal variables are
bilities for handling missing data in mediation studies solely indirect—see Figure 8.1. The method specified
(Zhang & Wang, 2013). was the bias-corrected bootstrap, which adjusts for
Confidence intervals or significance tests based on possible asymmetry in the empirical sampling distri-
nonparametric bootstrapping are not magic. For exam- bution by determining the proportion of bootstrapped
ple, bootstrapped estimates can be severely biased in estimates that fall below the observed result (Efron,
small samples, especially if sample distributions do not 1987). The default total of 1,000 generated samples
reflect population distributions. There are various cor- was not changed in these analyses. The path model was
rections for small sample bias (Dwivedi et al., 2017), specified for analysis in bmem using syntax from the
but whether corrected estimates in a particular analysis sem package (Fox et al., 2022).
are trustworthy is generally unknown. Also, values of The bootstrapped estimate of the standard error for
the lower and upper bounds for a bootstrapped con- indirect effect of exercise on illness through fitness is
fidence interval are potentially not unique unless the .020, or very similar to the Sobel standard error for this
researcher specifies a seed, or the initial value of the effect (.021; see Table 8.5). The bootstrapped 95% con-
random number generator used by the computer to fidence interval is [–.131, –.053], and the corresponding
select cases. Suppose for a particular seed that the value point estimate is –.092—see Table 8.5. For the indi-
zero falls just inside the bounds of a 95% bootstrapped rect effect of hardy on illness through stress, the boot-
confidence interval, so the corresponding effect is “not strapped standard error of .032 is just slightly larger than
significant” at the .05 level. The analysis is rerun except the Sobel standard error of .031 for this effect, and the
for a different seed, and the value zero now falls just bootstrapped 95% confidence interval is [–.195, –.065]
outside the bounds of the second confidence interval. for the point estimate of –.116. Neither bootstrapped
Now the same effect is “significant,” again at the .05 confidence interval includes zero, so both point esti-
level. This “disagreement” is not surprising because mates are statistically significant at the .05 level, but
statistical results based on simulated random sampling keep in mind the limitations of significance tests based
are typically indeterminant (i.e., not unique). on nonparametric bootstrapping considered earlier.
Presented in the third and fourth columns of Table Whether any effect, indirect or otherwise, is sig-
8.5 are results for the indirect effects of exercise and nificant or not significant at some arbitrary level of a
hardy on illness estimated as total causal effects through may be irrelevant, especially if the researcher empha-
covariate adjustment (see also Table 6.5). For example, sizes interval estimation and also considers whether
two different minimally sufficient sets identify the total observed effect sizes are large and precise enough to
effect of exercise on illness. The unstandardized and matter in a particular research area. In Chapter 20,
standardized estimators with hardy as the covariate which covers enhanced mediation analysis, you will
are, respectively, –.080 and –.085, and the correspond- learn about additional ways to describe the magnitudes
ing estimators derived with stress as the covariate are, of indirect effects. The next chapter concerns global
respectively, –.059 and –.063. Results across the three estimation, and we will consider analysis results for the
different estimators (including the product estimators) same model and data as covered in this detailed exam-
of the same indirect effect are generally similar: The ple of local estimation.
Pt2Kline5E.indd 129 3/22/2023 3:44:08 PM

SUMMARY consistent; otherwise, a problem is indicated. Global

estimation is introduced in the next chapter.
In local estimation, the equation for just one outcome at
a time is analyzed using a suitable regression method.
It may be less susceptible to the propagation of speci- LEARN MORE
fication error than global estimation. This means that
if the equation for a particular outcome is wrong, then Shipley (2000) and Lefcheck (2016) outline the logic of
the error need not inevitably contaminate estimates for piecewise SEM, and Zhao et al. (2010) classify types of
mediated effects and caution against blindly applying the
other outcomes. Local estimation is also the last step
classical four-step method for testing mediation.
in the method of piecewise SEM. The initial steps are
to (1) express a path model as a DAG; (2) derive the
Lefcheck, J. S. (2016). piecewiseSEM: Piecewise structural
union basis set of conditional independencies implied equation modelling in R for ecology, evolution, and
by graph; (3) test each of these hypothesized indepen- systematics. Methods in Ecology and Evolution, 7(5),
dence relations against sample data; and (4) conduct 573–579.
the multivariate d-separation test over all implied
conditional independencies. A failed (i.e., significant) Shipley, B. (2000). A new inferential test for path models
d-separation test signals covariance evidence against based on directed acyclic graphs. Structural Equation
the model, but effect size, or the empirical magnitudes Modeling, 7(2), 206–218.
of departures from implied conditional independence, Zhao, X., Lynch, J. G., Jr., & Chen, Q. (2010). Reconsider-
should be considered, too. If either the original model ing Baron and Kenny: Myths and truths about media-
or a respecified version is retained, its parameters are tion analysis. Journal of Consumer Research, 37(2),
locally estimated. There may be multiple estimators for 197–206.
some parameters, but their results should generally be
EXERCISES
1. Calculate C for the d-sep test based on the results in 6. Calculate and interpret the product estimators
Table 8.2. for the unstandardized and standardized indirect
effects of hardy on illness through stress, given the
2. Interpret the OLS result in Table 8.3 for the direct coefficients in Figure 8.1.
effect of hardy on stress.
7. In Table 8.5, the Sobel approximate standard error
3. Interpret the OLS result in Table 8.3 for the direct for the unstandardized indirect effect of exercise on
effect of stress on illness, controlling for fitness. illness through fitness is .021. Calculate this result
based on the information in Table 8.3.
4. Interpret the results in Table 8.4 for illness.
8. Verify in Table 8.5 that outcomes of significance
tests for estimators of the unstandardized indirect
5. Interpret the standardized coefficients in Figure
effect of exercise on illness through fitness are not
8.1(b) for the direct effects of exercise and hardy on
consistent.
their respective outcomes.
Pt2Kline5E.indd 130 3/22/2023 3:44:09 PM

9
Global Estimation and Mean Structures
Unlike local estimation, where equations for each outcome are separately analyzed, in global estimation—
also called simultaneous methods or full-information methods—all free model parameters are
estimated at once. Under conditions that may not hold in many actual studies, global estimation is generally
more efficient than local estimation. An efficient estimator has lower variation (sampling error) among
estimates of the same parameter than a less efficient estimator, when analyzing a correctly specified model
over random samples. This property of simultaneous methods is realized because they take greater advan-
tage of information in the data than local estimation. But this potential benefit may be more theoretical than
practical because researchers do not often analyze models known to be true over representative samples.
Thus, global estimation is not always a better choice than local estimation. There are various methods for
global estimation, but maximum likelihood (ML) estimation is probably the most widely used technique in
SEM. Types of ML methods for continuous outcomes are described. Also introduced in this chapter is the
analysis of path models with a mean structure, which implies that (1) both sample covariances and means
are analyzed, and (2) intercepts for endogenous variables and means for exogenous variables are also
estimated. A tutorial on full-information ML (i.e., FIML) methods for analyzing incomplete data sets as an
alternative to multiple imputation is also offered in this chapter.
SIMULTANEOUS METHODS contamination, but the more serious the specification

AND ERROR PROPAGATION error, the more serious the resulting bias in other parts
of the model may be.
Because simultaneous methods estimate all free model When misspecification occurs, local estimation may
parameters at once, the overriding assumption is that outperform global estimation. This is because single-
the model is correctly specified. This assumption is equation methods may better isolate the effects of error
critical due to propagation of specification error. Simul- to misspecified parts of the model instead of allowing
taneous methods tend to spread such errors throughout them to spread. In a computer simulation study, Bollen
the whole model, which means that a specification error et al. (2007) found that bias in the ML method (global
in one parameter can affect results for other parameters estimation) and various 2SLS methods (local estima-
elsewhere in the model. Suppose that a common cause tion) was generally negligible when analyzing a cor-
for a pair of outcomes is not included in the model rectly specified, three-factor measurement model. But
(unmeasured confounding bias), but their disturbances for misspecified models, there was greater bias in ML
are specified as independent. This specification error results compared with those generated in 2SLS, even
may propagate to estimation of the direct effects or dis- in large samples. More recent simulation results also
turbance variances for this pair of variables. It can be support the relative advantage of single-equation esti-
difficult to predict the direction or magnitude of this mators based on instrumental variables over ML esti-
131
Pt2Kline5E.indd 131 3/22/2023 3:44:09 PM

mation for misspecified measurement models (Nestler, estimates, such as for standard errors, may be incor-
2013), but valid instruments are required; otherwise, rect. You should also know that (1) equality constraints
local estimation results can be worse than those from on parameter estimates are generally imposed in the
global estimation (Jin et al., 2016). unstandardized solution only; and (2) estimates con-
Another requirement of global estimation is that the strained to equality in the unstandardized solution are
whole model is identified, which means that the com- typically unequal in the standardized solution. Failure
puter can generate a unique estimate for each and every to appreciate these points about equality constraints is
free parameter; otherwise, the analysis may fail, such a common source of confusion in SEM analyses. There
as terminating with a warning or error message. Or the are ways to constrain standardized estimates to be
solution may be improper. This is not generally a prob- equal, but special estimation methods for standardized
lem for recursive path models, which are identified, but variables, described later in this chapter, are needed.
the requirement for model identification is more chal- Parameters in the ML method are estimated itera-
lenging for nonrecursive path models with causal loops tively in nonlinear optimization algorithms that mini-
or for models where common factors are analyzed as mize the fit function. The mathematics of ML estima-
proxies for conceptual variables. tion are complex, and it is beyond the scope of this sec-
tion to describe them in detail—see Enders (2010, chap.
3) for a gentle introduction or Mulaik (2009b, chap. 7)
MAXIMUM LIKELIHOOD ESTIMATION for a more quantitative presentation. There are points
of contact between ML estimation and single-equation
The ML method can be applied to a wide range of mod- OLS estimation. For example, estimates of unstandard-
els from classical path models with observed variables ized path coefficients for recursive path models are
only to various types of latent variable models. The basically identical. The two methods may differ slightly
term maximum likelihood describes the principle that in their estimates of unstandardized disturbance vari-
estimates of a parameter maximize the chance that the ances for reasons explained in the next section.
data were drawn from a target population. For a model Somewhat larger differences between OLS and ML
with multiple free parameters, the whole set of ML esti- estimation can be observed in values of standardized
mators maximizes the joint probability distribution of path coefficients or disturbance variances (i.e., 1 – R2),
the data; that is, the estimates make the sample data especially for more “downstream” outcomes in the
most probable. model specified as caused by other endogenous vari-
For continuous outcomes, the statistical criterion ables (and perhaps also by exogenous variables). This
minimized in ML estimation, or the fit function—also is because R2 is calculated in OLS regression as the
called the fitting function—is related to the discrep- proportion of observed (sample) variance explained,
ancy between sample covariances and those implied by but the ML method in SEM computer tools generally
the model for the same variables. The final set of param- estimates R2 as the proportion of predicted variance
eter estimates minimizes squared differences between explained, given the model and its parameter estimates.
the respective elements of the two matrices just men- Comparison of predicted variances, covariances, or
tioned. The ML fit function is analogous to the least correlations with their observed counterparts informs
squares criterion in standard regression analysis except the assessment of model–data correspondence.
that the ML fit function (1) directly concerns variables,
not scores from individual cases (although data from
Variance Estimates
cases are analyzed); and (2) it is more complex because
there are multiple predictors and outcomes—some Population variance (s2) is estimated in the ML method
observed, but others may be proxies for latent variables, as S2 = SS/N, where the numerator is the total sum of
such as common factors—in structural equation mod- squared deviations from the mean. In OLS estimation,
els compared with simple regression models. s2 is estimated as s2 = SS/df, where df = N – 1. In small
Simultaneous methods in SEM, including ML, gen- samples, S2 estimates s2 with negative bias. In large
erally assume the analysis of unstandardized variables; samples, values of S2 and s2 are similar, and they are
that is, either a covariance matrix or a raw data file asymptotic in very large samples. Variances calculated
where the scores are not standardized (i.e., converted as s2 in OLS estimation may not exactly equal those cal-
to normal deviates, z) is submitted. Otherwise, certain culated in ML estimation as S2 for the same variables.
Pt2Kline5E.indd 132 3/22/2023 3:44:09 PM

Global Estimation and Mean Structures 133
Some computer tools for SEM, such as the lavaan the actual effect is negative—then iterative estimation
package for R (Rosseel et al., 2023), automatically may fail to converge, which means that a stable solution
rescale sample variances in matrix summaries from has not been reached. Computer programs typically
s2-units to S2-units through the conversion S2 = (df/N) issue a warning if iterative estimation fails. When such
s2, unless otherwise instructed by the user. Check the failure occurs, whatever estimates were derived by the
documentation of your computer tool to avoid potential computer may warrant little confidence.
confusion about this issue. Most SEM computer tools automatically generate
their own starting values, and some programs also fea-
ture options about how starting values are calculated.
Iterative Estimation
For example, a starting values option in lavaan is a
and Starting Values
special form of the 2SLS method for models with com-
Implementation of ML estimation is typically itera- mon factors (Bollen, 1996). A different option mimics
tive, which means that the computer derives an initial the default starting values in Mplus, which vary accord-
approximate solution and then attempts to improve ing to the type of model analyzed (Muthén & Muthén,
these estimates through subsequent cycles of calcula- 1998–2017). Computer default starting values normally
tions. The computer therefore “auditions” somewhat work fine, but they do not always lead to converged
different values until it finds the set of parameter esti- solutions. Fortunately, computer tools for SEM gener-
mates that is most likely to have generated the data. ally allow the user to specify starting values for free
“Improvement” means that the overall fit of the model model parameters—see Topic Box 9.1 for suggestions.
to the data gradually gets better. For most just-identified Another tactic is to increase the computer tool’s default
models with no degrees of freedom (df M = 0), model fit limit on the number of iterations to a higher value, such
will eventually be perfect. For overidentified models as from 100 to 1,000. Allowing the computer more
with df M > 0, the fit of the model may be imperfect, but “tries” may lead to a converged solution.
iterative estimation will continue until improvements in
fit fall below a predefined value. When this happens,
Information Matrix
the estimation has converged.
and Standard Errors
The free Wnyx computer tool for SEM (von Oertzen
et al., 2015) uses a multi-agent estimation algorithm Part of ML estimation involves generation of the Fisher
(Pinter, 1996) as it attempts to fit the model to the data. information matrix, also called just the information
For example, after a converged solution is found and matrix or Fisher matrix. It represents the amount of
displayed onscreen, the computer continues to refine information that observed variables carry in the estima-
the estimates in the background. If better estimates tion of a set of unknown free parameters in the research-
are found later, the user is notified. The algorithm also er’s model. Intuitively, it measures the variability in the
alerts the user to multiple optima, or the existence curvature or “peakedness” of the log-likelihood func-
of multiple solutions that satisfy the same fit function tion that is maximized in ML estimation. The greater
nearly to the same degree. This concept is analogous the variability in curvature around the point that corre-
to the fungible weights method in regression, which sponds to the converged parameter estimates, the more
generates alternative solutions that are just slightly less precise the solution (i.e., there is more information).
optimal than the least squares solution (Waller, 2008). The inverse of the information matrix is the asymptotic
In contrast, most SEM computer tools display just the covariance matrix of the ML parameter estimates (Abt
best solution regardless of whether other solutions are & Welch, 1998), a concept explained next.
nearly as good. If two solutions with quite different Although parameters in classical frequentist statisti-
parameter estimates generate about the same degree cal methods (i.e., not Bayesian) are considered as fixed
of overall fit between model and data, then little confi- values in the population, their estimates vary over ran-
dence may be warranted in either solution. dom samples. Thus, the square roots of the diagonal
Iterative estimation may converge faster if the com- elements in the inverted information matrix are param-
puter is given reasonably accurate starting values, or eter standard errors, and the off-diagonal elements are
initial estimates of some parameters. If these initial covariances between pairs of parameter estimates.
estimates are grossly inaccurate—for instance, the Standardizing these covariances yields correlations,
starting value for a path coefficient that is positive when and a problem is indicated if any of these absolute cor-
Pt2Kline5E.indd 133 3/22/2023 3:44:09 PM

TOPIC BOX 9.1
Suggestions for User‑Specified Starting Values

These recommendations are for path models with continuous outcomes. Remember that starting values are
generally specified for unstandardized variables:
1. Direct effects. Think first about expected standardized direct effects. Suppose that a researcher
predicts that variable Y will increase by .33 standard deviation, given a change of a full standard
deviation in variable X while controlling for all other causes. Then .30 is a reasonable guess for
the standardized coefficient, and the starting value for the unstandardized coefficient would be
.30 (SDY /SDX).
2. Disturbance variances. Now think about standardized effect sizes in terms of the proportion of
explained variance (i.e., R2). Suppose that a researcher expects that all direct causes of Y will
explain 15% of its variance (R2 = .15). This corresponds to a proportion of unexplained variance
of 1 – .15, or .85. Thus, the starting value for the unstandardized disturbance variance would be
.85 (sY2).
3. Disturbance covariances. The starting value for a disturbance covariance is the product of the
square roots of the disturbance variances from the two corresponding outcomes and the expected
Pearson correlation between their error terms. A positive correlation (r > 0) indicates that a com-
mon unmeasured cause affects both outcomes in the same directions, but a negative correlation
(r < 0) says just the opposite (one variable increases, the other decreases, given a change in the
omitted cause). Suppose that Y1 and Y2 are two endogenous variables and that D1 and D 2 are
their disturbances. Starting values for the variances of D1 and D 2 are, respectively, 9.0 and 16.0,
and their disturbance expected correlation is .40. Given these values, the starting value for the
unstandardized disturbance covariance would be .40 (9.0 × 16.0)1/2, or 4.80.
relations is close to 1.0. Such a result would indicate sible in ML estimation or other simultaneous methods.
extreme linear dependency for a pair of parameter esti- This problem is most evident by a parameter estimate
mates that could be due to problems with the data, mis- with an illogical value, such as a Heywood case, named
specification of the model in computer syntax such that after the statistician H. B. Heywood. These cases include
the target model is not actually analyzed, or an attempt negative variance estimates (e.g., an unstandardized
to analyze a model that is not really identified (e.g., disturbance variance is –13.50) or estimated absolute
Equations 3.1–3.2). Some empirical tests for identifica- correlation > 1.0 (e.g., the correlation between a pair of
tion based on the information matrix are described later common factors is 1.08). Another example of a problem
in the book, but they are generally more useful when occurs when the standard error is so large that no expla-
analyzing nonrecursive path models with causal loops nation seems reasonable (e.g., Table 7.3, 2SLS estimator
or complex models with common factors as proxies for for Stress → Illness). Causes of Heywood cases include
latent variables. Basic kinds of information matrices (Chen et al., 2001; Gagne & Hancock, 2006):
and computer defaults or options for selecting among
them are described in Appendix 9.A. 1. Model misspecification (crucial paths or variables
are omitted).
Inadmissible Solutions 2. The model is not identified.
and Heywood Cases 3. An error in computer syntax that specifies a differ-
Although usually not a problem when analyzing recur- ent model than the one the researcher intended to
sive path models, a converged solution may be inadmis- analyze.
Pt2Kline5E.indd 134 3/22/2023 3:44:09 PM

4. The presence of outliers, violation of regression the computer, who provides the ultimate quality control
assumptions (e.g., heteroscedasticity), or extreme check for solution admissibility.
collinearity (e.g., sample correlations close to 1.0).
5. The phenomenon empirical underidentification Scale Freeness
occurs when the data provide too little information and Scale Invariance
to estimate one or more parameters in the model
(Kenny, 1979). An example is extremely low or high The ML method is generally both scale free and scale
correlations. For example, if rXY = .98 for two con- invariant. Scale free means that if a variable’s scale
tinuous variables, then, practically speaking, they is linearly transformed, a parameter estimated for the
are the same variable. This extreme bivariate col- transformed variable can be algebraically converted
linearity reduces by one the effective number of back to the original metric. Scale invariant means
observations below the value generated by Rule 7.2, that the value of the fit function in a particular sample
which, in turn, decreases the effective value of df M, remains the same regardless of the metrics of the origi-
perhaps to < 0. nal variables (Kaplan, 2009). These properties gener-
ally assume the analysis of unstandardized variables.
6. A combination of small sample sizes and only two
Major forms of the ML method for continuous out-
indicators per factor in latent variable models.
comes are described next.
7. Analysis of scores that are very imprecise (low reli-
ability) or invalid (the target construct is not mea-
sured). DEFAULT ML
8. Bad starting values.
The default estimation method in many, if not most,
An analogy may help to provide context for Hey- computer tools for covariance-based SEM is a basic
wood cases: ML estimation (and related simultaneous form of ML estimation and is referred to here as
methods) is like a religious or political zealot in that default ML. Either a raw data file or a matrix sum-
it so strongly believes the model’s specification that mary of the data, such as the sample covariance
it will do anything, no matter how crazy, to force the matrix, can generally be submitted as the input data in
model on the data. Some SEM computer tools do not default ML estimation. Graham and Coffman (2012)
permit certain Heywood cases. For example, the com- noted that an historical name for ML estimation that
puter may impose a lower bound—an inequality con- required only summary statistics is maximum Wis-
straint—which precludes variance estimates < 0. But hart likelihood (MWL), named after the statistician
solutions where ≥ 1 estimates have been constrained John Wishart. Analysis of summary statistics was
by the computer to prevent a Heywood case should not computationally more efficient in that fewer computer
be trusted. It is better to try to determine the source of resources, such as memory or processing speed, were
the problem instead of constraining an error variance to required. This is in part because a single likelihood
be ≥ 0 and then rerunning the analysis. function for the whole sample was analyzed in MWL
Always carefully inspect the solution, unstandard- estimation.
ized and standardized, for any sign that it is inadmis- But modern computers with fast processors and large
sible. In written reports, comment on the admissibil- memory capacities can easily analyze raw data files in
ity of the solution and corrective actions taken (if any) full-information ML (FIML), which estimates a like-
(Chapter 3). Computer tools for SEM generally issue lihood function for each individual case. These case-
warning messages about Heywood cases. For example, wise likelihoods (1) are eventually summed to form the
an option in the lavaan function “lavInspect( )” overall likelihood for the whole sample, and (2) allow
instructs the computer to print an error message if any for different number of variables or items for each case
error variances are < 0. For a latent variable model, an (e.g., Graham & Coffman, 2012, p. 282). This means
error message would also be generated in lavaan if that subsets of cases with different patterns of missing
the estimated covariance matrix for the latent variables data (including none; i.e., complete cases) are each rep-
is nonpositive definite. But computer checks for admis- resented with a unique likelihood function. This prop-
sibility are not foolproof. This means that it is possible erty of FIML estimation for analyzing incomplete raw
for a solution to be inadmissible but no warning or data files is elaborated on in the next section, so we
error message was issued. Thus, it is the researcher, not assume complete data sets for now.
Pt2Kline5E.indd 135 3/22/2023 3:44:09 PM

Default ML is a normal theory method that assumes 3. Sometimes exogenous variables are specified as
multivariate normality (Chapter 4). Variations on this fixed, which says that (a) their values are intention-
basic requirement—that all assume independence of ally selected by the researcher from the population
the exogenous variables and disturbances—are listed of all possible levels of these variables, and (b) the
next (Bollen, 1989, pp. 126–128): endogenous variables are randomly sampled at
each level of the exogenous variables (Cohen et al.,
1. If all observed variables, both exogenous and 2003). Another view is that their variances, covari-
endogenous, are continuous, then their joint prob- ances, and means do not vary over samples (Bol-
ability distribution should be multinomial. len, 1989). This means there is no need to assume
2. Exogenous variables are not always continuous. a population distribution for fixed exogenous vari-
Examples include categorical variables, such as ables; thus, unconditional multivariate normal-
dummy codes, that represent group membership ity is assumed for the endogenous variables (i.e.,
and power or product terms of continuous exoge- irrespective of the exogenous variables). See Topic
nous variables that represent curvilinear or interac- Box 9.2 on the specification of measured exogenous
tive effects (Appendix 7.A). Distributions for such variables as fixed versus random.
variables are nonnormal. But the consistency of
default ML estimation is preserved by assuming Additional requirements of default ML (and other
conditional multivariate normality for the endog- variations described next) include large samples, inde-
enous variables, or that their distributions are mul- pendent scores that are not standardized, normally
tinormal at every level of the exogenous variables. distributed errors, and independence of exogenous
TOPIC BOX 9.2
Specification of Measured Exogenous Variables as Fixed

Some SEM computer tools offer a fixed-X option, where “X” means “exogenous.” This option tells the
computer that the variances, covariances, and means of exogenous variables are not free parameters.
Instead, (1) their parameters in the model are fixed to their respective sample values and, thus, (2) they
have no standard errors. Suppose that dummy codes represent membership in ≥ 2 groups for a categorical
exogenous variable (e.g., Table 7.1, Figure 7.6). Group membership is determined without error (the assign-
ment mechanism is known), and base rates of cases in each group are not expected to vary appreciably
over samples. Here it would be reasonable to specify the dummy codes as fixed exogenous variables. But
a similar specification would be unrealistic for exogenous individual difference variables, such as age,
that are expected to vary over samples. It is also true that exogenous variables measured with error (i.e.,
rXX < 1.0) are random, not fixed, variables (Bollen, 1989).
You should know that testing hypotheses about exogenous variables generally requires specifying
their parameters as free, not fixed. Suppose that two exogenous variables are predicted to be unrelated
(i.e., their population covariance is hypothesized to equal zero). Thus, it is generally necessary to (1) spec-
ify their covariance as a free parameter and (2) explicitly constrain it to equal zero in model syntax or in
the model diagram for a graphical editor. These specifications permit the eventual direct comparison of the
constrained model just described with the unconstrained model where the covariance is freely estimated. If
the relative fit to the data of the constrained versus unconstrained models is similar, then the hypothesis that
the corresponding exogenous variables are independent is supported. But if the two exogenous variables
are specified as fixed, then the computer “accepts” their observed covariance—whatever the value, zero
or otherwise—as the parameter value.
Pt2Kline5E.indd 136 3/22/2023 3:44:09 PM

variables and disturbances. When a summary matrix ing effects of nonnormality on model statistics may
is analyzed instead of a raw data file, it is assumed that be even greater in smaller samples, such as N < 200.
the original distributions for scores on continuous out-
comes are multivariate normal. An extra assumption Given severe nonnormality, one option is to nor-
when a path model is analyzed is that the exogenous malize distributions for endogenous variables through
variables are measured without error (i.e., rXX = 1.0). nonlinear transformations and then analyze the trans-
This requirement can be relaxed if the researcher formed data with default ML (Chapter 4). A potential
applies a single-indicator respecification that controls drawback is that variables for which normal distribu-
for measurement error. This method requires the analy- tions are not expected may be fundamentally altered
sis of a model with common factors (Chapter 15). after transformation; that is, the target outcome is not
actually studied. Another drawback is that meaning-
ful original raw score metrics, such as survival time in
ANALYZING NONNORMAL DATA years or treatment cost in dollars, are lost after trans-
formation. The loss of original meaningful metrics can
The possible implications of analyzing continuous but complicate the interpretation of the results for trans-
severely nonnormal outcomes with default ML are formed variables.
summarized next (Finney & DiStefano, 2013; Savalei, A second option is to use default ML to generate
2014): the parameter estimates, but then apply nonparametric
bootstrapping to estimate their standard errors. Non-
1. Parameter estimates are generally robust against parametric bootstrapping assumes only that the shapes
nonnormality. of population and sample distributions are the same,
2. Values of standard errors for parameter estimates not that they are all normal in shape. In this approach,
can be very distorted by severe nonnormality, per- standard errors are estimated in empirical sampling
haps by as much as 25–50%, given the model and distributions for parameter estimates over generated
data. Specifically, leptokurtic distributions with samples randomly selected with replacement from the
heavier tails and a higher peak relative to the nor- original data file. Estimating standard errors is the
mal curve (positive kurtosis) tend to attenuate stan- only role for bootstrapping in this method; otherwise,
dard errors, which increases Type I errors (H0 is the estimates are the same as those using default ML
rejected too often). but with no bootstrapping.
Nevitt and Hancock (2001) found in computer simu-
3. In contrast, platykurtic distributions with lighter
lations that bootstrapped estimates were generally less
tails and lower peaks relative to normal curve (nega-
biased than those from default ML estimation under
tive kurtosis) tend to inflate standard errors, which
conditions of nonnormality and large sample sizes. But
increases Type II error (H0 is not rejected often
in small samples, bias and variability in bootstrapped
enough). Of the two directions for distortion, it is
estimates became highly inflated, and many bootstrap-
probably attenuation (standard errors are too small)
generated samples had nonpositive definite data matri-
that occurs most of time, given severe nonnormality.
ces (i.e., the analysis failed). Nevitt and Hancock (2001)
4. Values of model test statistics under ML estimation noted that “use of the bootstrap with samples of N =
that concern the discrepancy between sample and 100 is unwise” (p. 374). These problems are consistent
model-implied covariances tend to be biased under with the caution that bootstrapped estimates in small
conditions of moderate nonnormality, and bias gets samples can be very biased.
worse with increasing nonnormality. Specifically, There is a special method for generating boot-
leptokurtic distributions tend to inflate model test strapped estimates for model test statistics and corre-
statistics, which increases Type I errors, so cor- sponding indexes of global model fit called the Bol-
rectly specified models are rejected too often. len–Stine bootstrap (Bollen & Stine, 1993; see also
5. Values of model test statistics are attenuated by Enders, 2010). In contrast to “naïve” nonparametric
platykurtic distributions, which increases Type II bootstrapping that randomly samples using replace-
errors such that incorrect models are retained too ment from a raw data file, a transformed sample is
often. Inflation of model test statistics probably generated from the original data in the Bollen–Stine
happens more often than attenuation. These distort- bootstrap, and then cases from the transformed sample
Pt2Kline5E.indd 137 3/22/2023 3:44:09 PM

are randomly sampled with replacement. This process their unadjusted counterparts for the same parameter
corrects for departures in empirical distributions from estimate, but not in all cases such as when distributions
those expected under the null hypothesis, which is the are platykurtic.
reference distribution for model test statistics (Hancock A scaled (corrected) model test statistic is a global
& Liu, 2012). significance test of the fit of the whole model to the data
In the Bollen–Stine bootstrap, the original data are matrix that is adjusted for nonnormality. In robust ML,
transformed such that the covariance matrix in the the uncorrected statistic is the model chi-square from
transformed sample exactly matches that predicted by default ML estimation where the degrees of freedom
the model. This means that the model would perfectly equal those for the researcher’s model (df M). Sampling
fit the data in the transformed sample. The bootstrapped distributions for the default ML chi-square assume
p value is the proportion of the bootstrapped model test multivariate normality, but this assumption is implau-
statistics that exceed the empirical value for the same sible in the presence of severe nonnormality. In robust
statistic in the original data. Enders (2005) described a ML, the degree of multivariate kurtosis in the raw data
modified Bollen–Stine bootstrap for incomplete data is estimated, Next, the computer applies this estimate
sets, which is also available in some commercial com- to adjust the default ML chi-square for nonnormality.
puter tools for SEM, such as Amos (Arbuckle, 2021), Scaled model chi-squares are generally smaller than
and is also an option in R packages for SEM, such as their uncorrected counterparts (leptokurtic distribu-
bmem, semTools, and lavaan, among others. tions), but not always (platykurtic distributions). Model
chi-square statistics in both default ML and robust ML
are described in more detail in the next chapter.
ROBUST ML Asparouhov and Muthén (2016) described skew-
SEM, an ML approach to estimating nonnormal skewed
A third option for analyzing continuous outcomes distributions that takes even greater account of informa-
with severely nonnormal distributions is robust ML. tion in the data compared with robust ML. It fits entire
The basic parameter estimates are the same as those distributions—means, covariances, skew, kurtosis, and
generated in default ML, but no transformations are higher-level moments—to a family of parametric distri-
involved. This means that variables are analyzed in butions known as skew t distributions, which include
their original, unstandardized metrics. If any of these the normal, skew normal, and t distributions as special
metrics are meaningful, then they are preserved in the cases. The skew normal distribution is a generalization
robust ML method. Standard errors and model test sta- of the normal curve with an error function that allows
tistics are not bootstrapped in robust ML, which could for nonzero skew. Models specified in this approach
make the method a better choice than default ML with include skew parameters for observed or latent continu-
bootstrapping in samples that are not large. ous variables in multivariate skew t distributions. Next,
Robust ML is a corrected normal theory method robust standard errors are estimated with sandwich esti-
that estimates the degree of multivariate kurtosis in the mators. Although based on more distributional informa-
raw data and then calculates robust standard errors and tion compared with standard robust ML, estimation in
scaled model test statistics. A robust standard error skew-SEM is more complex, requires large samples,
estimates what the standard error for the corresponding and can fail under conditions where certain inequal-
point estimate would be if the distributions were nor- ity constraints do not hold (Asparouhov & Muthén,
mal. In the statistical literature, such estimates may be 2016, pp. 7–8). The estimator is available in Mplus as
referred to as sandwich standard errors or sandwich an “experimental” method, but is not yet widely used in
estimators, where the term “sandwich” refers to the practice (Muthén & Muthén, 1998–2017).
triple matrix product where the “meat” (inner), or the
corrections for nonnormality, are pre- and post-multi-
plied by the “bread” (outer), or the naïve (i.e., default FIML FOR INCOMPLETE DATA
ML) asymptotic covariance matrix of the parameter VERSUS MULTIPLE IMPUTATION
estimates. There are various methods for generating
sandwich standard errors and corrected model test sta- Bentler (2010) referred to variations of the FIML esti-
tistics, not just for different estimators, but also for the mators for incomplete data as casewise ML. These
same estimator, given different assumptions (Savalei, estimators work by (1) partitioning the cases in a raw
2014). Robust standard errors are generally larger than data file into subsets, each with the same pattern of
Pt2Kline5E.indd 138 3/22/2023 3:44:09 PM

missing data, including none (complete cases). Next, for any endogenous variable in the model; and (3) dis-
(2) relevant statistical information, such as means and turbances for multiple auxiliary variables are specified
covariances, is extracted from each subset, thus retain- with one another (e.g., Graham, 2003, p. 83). The extra
ing all cases in the analysis, whether they are complete DV method is well-suited for classical path models.
or incomplete. Finally, (3) parameters are estimated The saturated correlates method, which may be pre-
after combining all available information over the sub- ferred for latent variable models, feature the specifi-
sets of cases. Thus, model parameters are estimated cations that each auxiliary variable covaries (1) with
directly from the available data with neither deletion all other auxiliary variables, (2) with any exogenous
of incomplete cases nor imputation of missing values. observed variables, and (3) with the disturbances for
Thus, incomplete cases are not deleted in the method all endogenous observed variables (e.g., Graham, 2003,
and neither are missing observations imputed with esti- p. 84)—see Graham and Coffman (2012; pp. 290–292)
mated values; that is, the original raw data file is not for examples.
altered. Under conditions of an MAR data loss pattern and
Classical (i.e., obsolete) methods for dealing with when auxiliary variables are included in the analysis,
incomplete data, such as case deletion (listwise, pair- estimates from casewise ML are both generally opti-
wise) or single imputation, assume that the data loss mal and asymptotically equivalent to estimates from
mechanism is missing completely at random (MCAR) multiple imputation when the number of imputation is
(Chapter 4). But casewise ML methods in most SEM large (Lang & Little, 2018). For example, Savalei and
computer tools rely on the less restrictive assumption Rhemtulla (2012) reported that casewise ML estimates
that data are missing at random (MAR). Evidence from of the fraction of missing information, or FMI (Chapter
computer simulation studies indicates that casewise 4), were superior to those in multiple imputation when
ML results are usually more accurate compared those the number of imputations was small (e.g., ≤ 10), but
in classical techniques when the missing data pattern the two methods generated similar results with more
is MAR instead of MCAR (Enders & Bandalos, 2001; imputations. Savalei and Rhemtulla (2012) described
Peters & Enders, 2002). a method to calculate the FMI when missing data are
Early implementations of casewise ML for incom- analyzed in the casewise ML method. In lavaan, an
plete data in SEM computer tools, such as Amos 4 option in the function “summary( )” prints the value
(Arbuckle & Wothke, 1999) and LISREL 8.5 (Jöreskog of the FMI in program output.
& Sörbom, 1996), did not support the automatic inclu- Implementations of casewise ML in some contempo-
sion of auxiliary variables in the analysis. Recall that rary SEM computer tools support the automatic inclu-
auxiliary variables are not part of the researcher’s sion of auxiliary variables. For example, the function
model, but they are believed to appreciably covary “auxiliary( )” in semTools (Jorgensen et al., 2022)
with incomplete variables or with causes of data loss. automatically includes auxiliary variables in the analy-
Inclusion of auxiliary variables supports, but does not sis using the saturated correlates method for models
confirm, the assumption that the data loss mechanism analyzed in lavaan. Its use requires the specification
is MAR. Accuracy of estimates in multiple imputation that the variances and covariances for all exogenous
can be improved through the inclusion of auxiliary variables are free parameters. Both multiple imputa-
variables, even if those variables have missing data, too tion and casewise ML with the automatic inclusion of
(Chapter 4). The MAR assumption may be plausible auxiliary variables are available in Mplus, too.
only if auxiliary variables are included in the analysis; Researchers already very familiar with multiple
otherwise, casewise ML estimates can be quite biased imputation may prefer it over casewise ML. After all,
(Collins et al., 2001). the two methods tend to yield similar results in large
Graham (2003) described two different techniques samples for the same model and data when auxiliary
for including auxiliary variables in structural equa- variables are included. Otherwise, there are some rela-
tion models through programming (i.e., syntax about tive advantages of casewise ML over multiple imputa-
the model and variables). In the extra dependent vari- tion (Allison, 2012):
able (DV) method, auxiliary variables are specified as
additional outcomes such that (1) they are affected by 1. Because multiple imputation is based on simulated
all exogenous variables and mediators (i.e., all predic- random sampling, there is no single, definitive set
tors) in the model; (2) the disturbance of any auxiliary of results. Outcomes in bootstrapped significance
variable is specified as correlated with the disturbance tests and bootstrapped estimates of standard errors
Pt2Kline5E.indd 139 3/22/2023 3:44:10 PM

have the same essential indeterminacy: The results across random samples, but it is not as efficient as the
can change each time the method is applied to the ML method (Kaplan, 2009). A drawback of the ULS
same model and data. In contrast, there is a single, estimator is the requirement that variables have the
unique set of results in casewise ML. same scale (metric). This is because the method is nei-
2. For small numbers of imputations (e.g., 3–5), case- ther scale free nor scale invariant. A potential advan-
wise ML is actually more efficient than multiple tage is that, unlike ML, the ULS method does not
imputation, but the difference may not be striking, require a positive definite covariance matrix. It is also
and any efficiency advantage of casewise ML is quite robust concerning starting values. Thus, ULS
reduced by specifying more imputations. estimation could be used to generate user-specified ini-
tial estimates for a second analysis of the same model
3. There are more potential decisions to make in and data but with the ML method.
multiple imputation compared with casewise ML. Generalized least squares (GLS) is a member of
Examples include the particular algorithm for sim- a larger family of methods known as fully weighted
ulating draws from a population distribution, the least squares (WLS) estimation, and some members
particular form of that distribution, the number of of this family can be used for continuous outcomes
imputed data sets, the number of iterations between with severely nonnormal distributions or categori-
each imputed data set, and the difference between cal outcomes. In contrast to ULS, the GLS estimator
the imputation model and the analysis model is both scale free and scale invariant, and under the
(Appendix 4.A). assumption of multivariate normality, the ML and GLS
methods yield equivalent results in very large samples.
In contrast to the MAR assumption that observed In smaller samples, the ML method is generally more
variables predict missingness for incomplete variables, efficient. The GLS method generally requires less
it is latent variables that predict missingness in the computation time and computer memory compared
missing not at random (MNAR) pattern (i.e., the data with ML. For instance, unlike the ML fit function, the
loss mechanism is nonignorable; Chapter 4). There are GLS fit function does not require estimation in a loga-
versions of casewise ML methods for MNAR data, rithmic scale (e.g., Mulaik, 2009b, pp. 157, 165). The
but they are not yet widely available in SEM computer availability of fast processors and abundant memory in
tools, except for Mplus, at the time of the writing of this relatively inexpensive personal computers today make
chapter. These methods are also mathematically com- the advantage of GLS method less meaningful. I have
plex, and technical problems in the analysis can prevent used the GLS method to replicate analyses published in
their successful application in a particular sample—see earlier works in which this method was used instead of
Appendix 9.B for more information. ML (Kline, 2016, pp. 440–442).
ALTERNATIVE ESTIMATORS Arbitrary Distribution Estimator

FOR CONTINUOUS OUTCOMES Browne’s (1984) arbitrary distribution function
(ADF) method is another name for the full WLS esti-
Variations of ML estimation work fine in many applica- mator that makes no distributional assumptions for any
tions of SEM with continuous outcomes, but you should outcome variable, so it does not require multivariate
be aware of other methods. The alternative estimators normality. This is because it estimates the degree of
described next are generally simultaneous, iterative, both kurtosis and skew in the raw data. Calculations
full information, and available in many SEM computer using this method are complex in part because it derives
tools. a relatively large weight matrix as part of its fit func-
tion. The dimensions, or number of rows by number of
columns, of this matrix is v (v + 1)/2 × v (v + 1)/2, where
Other Normal Theory Methods
v is the number of observed variables when means are
The unweighted least squares (ULS) method is actu- not analyzed. A problem with this method is that the
ally a form of OLS estimation that minimizes the sum size of this matrix can be so large that it is difficult for
of the squared differences between sample and pre- the computer to derive its inverse (Finney & DiStefano,
dicted covariances. It can generate unbiased estimates 2013). For example, if v = 15, then the dimensions of the
Pt2Kline5E.indd 140 3/22/2023 3:44:10 PM

WLS weight matrix are 120 × 120 (i.e., 15(16)/2 = 120), model to a correlation matrix instead of a covariance
which makes for a total of 1202, or 14,400 elements in matrix (Browne, 1982). This process involves the impo-
the matrix. Impractically large samples with thousands sition of nonlinear constraints on certain parameter
of cases may be needed to attain converged and stable estimates to guarantee that the model is scale invari-
results. Even bare-bones (i.e., uninteresting) models ant. For example, it is necessary to constrain the distur-
may need 200–500 cases in the ADF method. bance variances to equal to 1 – R2, or the proportion of
In computer simulations, Olsson et al. (2000) found variance in each endogenous variable not explained by
that the WLS method generated parameter estimates its direct causes in the model. This constraint is non-
and values of global fit statistics close to those obtained linear because it involves the path coefficients for each
in the ML and GLS methods only for larger sample direct cause and also their standardized variances and
sizes (N = 1,000, 2,000) and slightly misspecified mod- correlations. It can be complicated to manually pro-
els. As the severity of misspecification increased, the gram these constraints—see Cudeck (1989) and Steiger
WLS method generated imprecise estimates and overly (2002)—and not all SEM computer tools support non-
optimistic value of fit statistics (i.e., the model–data linear constraints. Kwan and Chan (2011) described an
fit was exaggerated). The ML method was generally alternative two-stage method to compare standardized
less affected by variations in kurtosis, sample size, coefficients that reparameterizes the model so that sim-
and model misspecification than both WLS and GLS. pler linear constraints can be imposed.
Results in GLS estimation were more stable than those Some SEM computer tools, such as the SEPATH
in the WLS method, but GLS generally required mod- procedure in Statistica (TIBCO Statistica, 2022),
els with relatively minor misspecifications compared allow constrained estimation to be performed auto-
with ML. Finney and DiStefano (2013) noted that the matically by selecting an option. In Mplus, the syntax
performance of WLS estimation is poor under common “matrix = correlation” specifies the analysis of
situations of relatively large measurement models (e.g., a correlation matrix, but only in single-sample analy-
> 2 factors, > 8 indicators) or sample sizes < 500, so it ses for models where (1) all outcomes are continuous
is not a realistic general option for nonnormal data in for a version of the WLS estimator, or (2) all outcomes
many studies. are categorical. A raw data file, not a summary matrix,
must be analyzed. In lavaan, the “lavCor( )” func-
tion specifies that the model is to be fitted to a correla-
FITTING MODELS
tion matrix, but the constraints must be programmed
TO CORRELATION MATRICES
manually. Check the documentation for your SEM
computer tool about the analysis of covariance matrices
The simultaneous methods described to this point gen-
versus correlation matrices.
erally require the analysis of unstandardized variables.
The analysis of correlation matrices when the out-
If the variables are standardized, then the results may
comes are continuous is justifiable on at least two occa-
be inaccurate, including estimates of standard errors
sions:
and model test statistics. This can happen if the model
is not scale invariant, or if its fit depends on whether the
1. A researcher is conducting a secondary analysis
variables are standardized or unstandardized. Whether
a model is scale invariant is determined by a complex based on a source wherein correlations are reported,
pattern of features, including how common factors are but not standard deviations.
scaled and whether certain parameter estimates are 2. A theoretical reason exists to impose equality con-
constrained to be equal (Cudeck, 1989). One symptom straints on standardized estimates, such as when
of scale invariance when a correlation matrix is ana- standardized direct effects of different causes of the
lyzed is the observation that some of the diagonal ele- same outcome are presumed to be equal. When a
ments in a predicted correlation matrix do not equal 1.0. covariance matrix is analyzed, equality constraints
Models where all variables are standardized have are imposed in the unstandardized solution only.
a correlation structure, not a covariance structure,
and such models require special considerations in their But analyzing correlation matrices without standard
analyses. The constrained estimation or constrained deviations (or raw scores converted to normal devi-
optimization method can be used to correctly fit a ates) is generally inappropriate whenever variance dif-
Pt2Kline5E.indd 141 3/22/2023 3:44:10 PM

ferences are essential to understanding a phenomenon TABLE 9.1. Analyses, Script Files, and Packages
(Bentler et al., 2001). Examples include (1) analyses of in R for Maximum Likelihood Estimation
longitudinal data when it is expected that variances will of a Recursive Path Model of Illness
change over time (e.g., individual differences increase Analysis Script files
with maturation), and (2) comparison of multiple
1. Estimates for the covariance roth-cov-ml.r
groups expected to differ in both means and variances structure only (means not analyzed)
for the same outcome variables.
2. Estimates for the both the mean and roth-mean-ml.r
covariance structures (covariances
HEALTHY PERSPECTIVE and means analyzed)
ON ESTIMATORS
Note. The lavaan package was used for all analyses. Output files have
AND GLOBAL ESTIMATION the same names except the extension is “.out.”
Segal’s law states that a person with one clock always

know the time, but a person with two clocks never of the Roth et al. (1989) recursive model of illness in
knows. This adage speaks to the challenge of dealing Figure 7.5 with default ML. All input and output files
with too much information. It may also describe how for this example generated in lavaan can be down-
newcomers to SEM may feel after learning about the loaded from this book’s web site. In analysis 1 in the
availability of so many different estimators or methods. table, the model was analyzed without intercepts for the
Here is some advice about how to cope: endogenous variables, and the input data were a sum-
mary covariance matrix. In analysis 2, intercepts were
1. Use the simplest method that you understand that estimated, and the input data were the covariances and
also makes reasonable assumptions. means (Table 4.3). In both analyses, the variances and
2. Think about sample size. Certain methods, such as covariance of the measured exogenous variables, exer-
WLS, need much bigger samples in order for the cise and hardy, were specified as free parameters, and
results to be precise. Consider alternatives, such as the information matrix is the default expected matrix.
robust ML, if the sample size is not large and there Analysis 1 in lavaan converged normally to an
is appreciable nonnormality. admissible solution. Reported in Table 9.2 are the ML
3. Remember that results can be specific to a particu- parameter estimates for the Roth et al. (1989) path
lar method; that is, it can happen that two different model except for the variances and covariance of the
methods applied to the same model and data can measured exogenous variables (exercise, hardy), which
generate different patterns of results. For instance, are just the sample values (Table 4.3). As expected for
small differences in estimated standard errors can the same model and continuous data:
make big differences in p values for significance
tests of individual parameter estimates. This is yet 1. Values of unstandardized path coefficients from
another reason not to make hair-splitting distinc- ML estimation are essentially identical to those
tions among p values in SEM. from single-equation OLS estimation in Figure
8.1(a) when based on the same adjustment set.
If two alternative estimators that yield appreciably dif- 2. There are slight differences in ML versus OLS
ferent solutions are both viable choices, then report estimates of unstandardized disturbance variances
both sets of results instead of selecting the solution that (Table 8.4).
most favors your hypotheses (Chapter 3). To do other- 3. Somewhat greater differences between the two
wise is model hacking. methods are apparent in standardized estimates
for illness, which is specified as affected by other
endogenous variables, fitness and stress (Figure
DETAILED EXAMPLE 8.1(b)).
Listed in Table 9.1 are the analyses and annotated For example, R2 = .177 for illness in OLS estimation
script files to simultaneously estimate the parameters based on the observed variance for this endogenous
Pt2Kline5E.indd 142 3/22/2023 3:44:10 PM

variable (Table 8.4). The estimate in default ML for the The third option just listed may be preferred for cod-
same outcome is R2 = .160 (i.e., 1 – .840; Table 9.2), but ing variables, such as dummy codes for exogenous cat-
the result just mentioned is based on the variance for egorical variables (Appendix 7.A), because change in
illness predicted by the model. Exercise 1 asks you to a standard deviation metric is not very meaningful for
calculate the predicted variance for illness, given the such variables. It may also be favored when continu-
results in Table 9.2. Hint: You can also find the pre- ous exogenous variables have meaningful raw score
dicted variances for all endogenous variables in the metrics, such as age in years, because such metrics are
output file for analysis 1 (Table 9.1). retained in the standardized solution. Check the doc-
Up to three different standardized solutions can be umentation of your SEM computer tool to see how it
printed in lavaan output: derives a standardized solution. If ≥ 2 options are avail-
able, tell your readers which estimates are reported.
1. Std.all: The variances of all variables, latent and
observed, are set to unity (1.0). This is the standard-
Effect Decomposition
ized solution reported in Table 9.2.
2. Std.lv: The variances of the just the latent variables In an effect decomposition, direct, total indirect, and
(common factors) are set to 1.0. Because there are total effects are computed for all pairs of causal vari-
no latent variables in the Roth et al. (1989) path ables and their presumed outcomes. In models with
model, this solution is equivalent to the unstandard- multiple indirect causal pathways between a pair of
ized solution in lavaan output and also reported in variables, total indirect effects are estimated as sums
Table 9.2. of the coefficients for each individual indirect effect.
An example follows.
3. Std.nox (i.e., no X [exogenous] variables): All latent
Suppose that variable X is specified to affect Y
and observed variables are standardized except for
through two different indirect pathways,
measured exogenous variables.
X → W → Y and X→Z→Y
TABLE 9.2. Maximum Likelihood Parameter The product estimators for the two unstandardized
Estimates for a Recursive Path Model of Illness indirect effects just listed are, respectively, 1.50 and
4.25. The coefficient for the unstandardized total indi-
Parameter Estimate rect effect is 1.50 + 4.25, or 5.75, the sum of the coef-
Direct effects ficients over both indirect pathways. Thus, the expected
Exercise → Fitness .108 (.013) .390
increase in Y in its raw score metric is 5.75 while hold-
ing X constant and increasing W and Z to whatever val-
Hardy → Stress –.203 (.044) –.230 ues they would obtain under a 1-point increase in the
Fitness → Illness –.849 (.159) –.253 raw score metric of X. For the same variables and data,
Stress → Illness .574 (.088) .311 if the coefficient for the unstandardized direct effect of
X on Y is, say, 2.50, then the unstandardized total effect
Indirect effects of X on Y is 2.50 + 5.75, or 8.25, the sum of the coef-
Exercise → Fitness → Illness –.092 (.021) –.099 ficients for the direct effect and total indirect effects.
Hardy → Stress → Illness –.116 (.031) –.071
This result says that Y is expected to increase by 8.25
points, given an increase in X of 1-point through all
Disturbance variances direct and indirect causal pathways. Standardized total
Fitness 287.065 (21.020) .848 indirect effects and total effects have similar interpreta-
tions except in standard deviation units instead of raw
Stress 1,062.883 (77.830) .947
score units.
Illness 3,212.567 (235.241) .840 Listed in Table 9.3 is the effect decomposition for the
Roth et al. (1989) path model of illness. For example,
Note. Estimates are reported as unstandardized (standard error) stan-
dardized. Standard errors for indirect effects are Sobel standard er-
the exogenous variable hardy is specified to have a sin-
rors. Standardized estimates for disturbances are proportions of unex- gle indirect effect on illness through stress (Table 9.2).
plained variance. This sole indirect effect is also the total indirect effect
Pt2Kline5E.indd 143 3/22/2023 3:44:10 PM

TABLE 9.3. Effect Decomposition for a Recursive Model of Illness

Causes
Exogenous Endogenous
Endogenous Exercise Hardy Fitness Stress
Fitness
Direct .108 (.390) 0 0 0
Total indirect 0 0 0 0
Total .108 (.390) 0 0 0
Stress
Direct 0 –.203 (–.230) 0 0
Total indirect 0 0 0 0
Total 0 –.203 (–.230) 0 0
Illness
Direct 0 0 –.849 (–.253) .574 (.311)
Total indirect –.092 (–.099) –.116 (–.071) 0 0
Total –.092 (–.099) –.116 (–.071) –.849 (–.253) .574 (.311)
Note. Reported as unstandardized (standardized).
because there are no other indirect causal pathways comes that is amenable to hand calculation known as
between hardy and illness. The same indirect effect Wright’s tracing rules (Wright, 1934). In these rules,
is also the total effect because there is no direct effect a predicted correlation is the sum of all standardized
between these variables. Both fitness and stress have causal (total) effects and noncausal associations from
direct effects on illness but no indirect effects through all valid tracings by which the two variables are con-
any other variables, so these direct effects are also total nected in the model, such that the value from each valid
effects. tracing is the product of the coefficients from the con-
stituent paths. A valid tracing is defined next (Kenny,
Predicted Covariances 1979):
and Correlations
RULE 9.1 A valid tracing means that a variable is
The standardized total effect of one variable on another not entered
estimates the part of their observed correlation due
to presumed causal effects. The sum of the standard- 1. Through an arrowhead and exited by the same
ized total effects and all other noncausal associations arrowhead
transmitted through back-door (biasing) paths, such 2. Twice in the same tracing
as common cause confounding, implied by the model
equal predicted correlations, also called fitted cor- Thus, no valid tracing can include more than a single
relations, that can be compared against correlations unanalyzed association ( ). An alternative defini-
observed in the sample. Predicted covariances, also tion comes from Chen and Pearl (2014): A valid tracing
called fitted covariances, have the same general mean- does not involve colliding arrowheads, such as
ing, but they involve the unstandardized solution.
Essentially all SEM computer tools that calculate →← ← → or
predicted correlations or covariances use matrix alge-
bra methods (e.g., Bollen, 1987). There is an older Recall that paths blocked by a collider do not convey
method for recursive path models with continuous out- statistical association between the variables at either
Pt2Kline5E.indd 144 3/22/2023 3:44:10 PM

end of the path, if the collider is not included among model does not perfectly reproduce the observed corre-
covariates (Chapter 6). lation is not surprising because there is no direct effect
Two principles follow from the tracing rule: (1) The between these variables (i.e., this part of the model is
predicted correlation for two variables connected by all not saturated).
possible paths in a just-identified portion of the model Use of the tracing rules is error-prone because it
will typically equal the observed (sample) value. If the can be tough to spot all of the valid tracings in larger
whole model is just-identified (i.e., df M = 0), then each models, and these rules do not apply to nonrecursive
and every predicted correlation will exactly equal its models with causal loops. A more complicated ver-
observed counterpart. (2) But if the variables are not sion of the tracing rule that includes the variances of
connected by all possible paths in an overidentified part measured exogenous variables and also unstandardized
of the model, then predicted and observed correlations disturbance variances is needed to generate predicted
may differ. variances and covariances (Bollen, 1989, pp. 85–88;
As an example of the application of the tracing rule Mulaik, 2009b, pp. 127–134). These are reasons to
to calculate predicted correlations with the standardized appreciate the fact that many SEM computer tools auto-
solution, look back at Figure 8.1(b) and find the variables matically calculate predicted correlations or covari-
hardy and illness. There are two valid tracings between ances for either recursive or nonrecursive path models
them. One of them is the indirect causal pathway (and other kinds of structural equation models, too).
Output from lavaan for this example analysis includes
Hardy → Stress → Illness the fitted covariance matrix, fitted correlation matrix,
correlation residuals, and the three additional types of
The product of the standardized coefficients from ML residuals defined next (see analysis 1, Table 9.1).
estimation in Table 9.2 for this pathway is
Covariance, Standardized,
–.230 (.311) = –.072 and Normalized Residuals
The other valid tracing is the noncausal path Correlation residuals are standardized versions of cova-
riance residuals, also called fitted residuals or raw
Hardy Exercise → Fitness → Illness residuals, which are differences between observed and
predicted covariances. It can be difficult to interpret
and the product of the standardized coefficients1 for the covariance residuals because they are not standard-
backdoor path just listed is ized. This difficulty in interpretation arises because the
metric of a covariance residual depends on the scales of
–.030 (.390) (–.253) = .003 both variables that contribute to it. That is, covariance
residuals for different pairs of variables are not directly
The predicted correlation between hardy and illness is comparable unless the original metric of all variables
the sum of the two products just calculated, or is the same. For example, a covariance residual of, say,
–17.50, for one pair of variables does not necessarily
–.072 + .003 = –.069 indicate greater model–data discrepancy than a covari-
ance residual of, say, –5.25 for a different pair, if scores
The sample correlation between these two variables is from those variables are not all based on the same met-
–.16 (see Table 4.2), so the correlation residual is ric. In contrast, correlation residuals are standardized
and thus are directly comparable across different pairs
–.16 – (–.069) = –.091 of observed variables regardless of their original (raw
score) metrics.
Thus, the model underpredicts the association between Many SEM computer tools print standardized
hardy and illness by this amount (–.091). That the residuals, or ratios of covariance residuals over the
standard errors of those differences. In large samples,
1 Remember that the standardized estimate of the unanalyzed this ratio is interpreted as a z (normal deviate) test. If
association between exercise and hardy (both exogeneous) is this test is statistically significant, then the null hypoth-
their observed correlation, –.030 (Table 4.3). esis that the population covariance residual equals
Pt2Kline5E.indd 145 3/22/2023 3:44:10 PM

zero is rejected. This test is sensitive to sample size provides an alternative, but more conservative, signifi-
for models with specification error, which means that cance test, if a significance test is needed at all.
covariance residuals close to zero could be significant Correlation residuals for this example analysis are
in a large sample. Likewise, a relatively large cova- reported in the top part of Table 9.4. There are two
riance residual could fail to be significant in a small options in lavaan for computing correlation residu-
sample. The interpretation of correlation residuals is als—see Topic Box 9.3 for descriptions. The absolute
not as closely bound to sample size but, until recently, residual for fitness and stress, –.133 (shown in bold-
significance tests for correlation residuals were gener- face), exceeds .10; thus, the model does not explain
ally unavailable. Recent, more advanced methods to very well the observed association between these two
test correlation residuals for statistical significance by variables. Exercise 2 asks you to reproduce this resid-
Maydeu-Olivares et al. (2018) and Shi et al. (2020) are ual correlation. Two other absolute correlation residu-
described in Chapters 17–18. als are close to .10, including .082 for hardy and fitness
Under the null hypothesis that the researcher’s model and –.091 for hardy and illness. (Earlier, we calculated
perfectly fits the population covariance matrix, stan- the correlation residual of –.091 using the tracing rule.)
dardized residuals should be normally distributed, but These results are generally consistent with piecewise
not correlation residuals under the same hypothesis. SEM results about sample partial correlations that cor-
Some SEM computer programs can, as an option, print respond to predicted conditional independencies for
histograms of standardized residuals. Ideally these the same model and data (see Table 7.2); that is, local
histograms would be roughly symmetrical in shape, model fit is problematic.
although some departure from symmetry is likely, espe-
cially for smaller models with relatively few pairs of
observed variables. But extremely skewed distributions TABLE 9.4. Correlation, Standardized,
of standardized residuals could suggest specification and Normalized Residuals for a Recursive
error. The LISREL program (Jöreskog & Sörbom, 2018) Path Model of Illness
can, as an option, print quantile-quantile (Q-Q) plots Variable 1 2 3 4 5
of standardized residuals, which displays the percentiles
of empirical (sample) standardized residuals against Correlation
those expected in a normal distribution. In a correctly 1. Exercise 0
specified model, the points in a Q-Q plot of standardized 2. Hardy 0 0
residuals should all fall along a diagonal line. The dis- 3. Fitness 0 .082 0
tribution is otherwise nonnormal, but actually it can be 4. Stress –.057 0 –.133 0
hard to discern the degree of nonnormality due to skew 5. Illness .016 –.091 –.038 .030 0
or kurtosis in Q-Q plots. This is because visual interpre-
tations of Q-Q plots can be rather subjective. Standardized
Some SEM computer tools can also print normal- 1. Exercise 0
ized residuals, or ratios of covariance residuals over 2. Hardy 0 0
the standard error of the sample covariance, not the 3. Fitness 0 1.710 0
standard error of the difference between sample and 4. Stress –1.128 0 –2.552 0
predicted values. (The latter is the denominator for 5. Illness .331 –1.905 –2.135 1.988 1.221
standardized residuals.) For the same covariance
residual, a normalized residual is usually less than the Normalized
corresponding standardized residual in absolute value.
1. Exercise 0
Accordingly, normalized residuals are more conserva-
2. Hardy 0 0
tive as significance tests than standardized residuals;
3. Fitness 0 1.574 0
that is, p values for normalized residuals are generally
higher than p values for the corresponding standard- 4. Stress –1.098 0 –2.541 0
ized residuals. For complex models with many com- 5. Illness .296 –1.758 –.757 .607 .279
mon factors, the computer may be unable to calculate Note. Values in boldface for correlations residuals > .10 in absolute
the denominator of a particular standardized residual. value and for both standardized residuals and normalized residuals >
In this case, the corresponding normalized residual 1.96 in absolute value.
Pt2Kline5E.indd 146 3/22/2023 3:44:10 PM

TOPIC BOX 9.3
Options in lavaan for Correlation Residuals

There are two basic lavaan options for computing correlation residuals (Rosseel et al., 2023). The speci-
fication
type = “cor.bollen”
instructs the computer to separately convert the sample covariance and model-implied covariance matrices
to correlation matrices before residuals are calculated. This means that each matrix is separately standard-
ized based on the variances (squared standard deviations) in its own main diagonal. All variances in the
sample covariance matrix are observed, but variances for endogenous variables in the model-implied
covariance matrix are predicted by the model, which may differ from the corresponding observed vari-
ances. A different lavaan option is
type = “cor.bentler”
which means that both the sample and model-implied covariance matrices are standardized based on the
variances in just the sample covariance matrix. Because not all elements of the main diagonal in the model-
implied covariance are observed variances, some values of Bentler-type correlation residuals may not
equal zero, but values of off-diagonal residuals for the two methods are usually similar. Unless otherwise
stated, correlation residuals in lavaan output for example analyses in this book are Bollen-type residuals.
Standardized residuals for the example analysis are the results about local fit (i.e., the residuals) considered
reported in the middle part of Table 9.4. The test for to this point.2 Models can theoretically “pass” global
the fitness–stress covariance residual is significant at fit testing but still “fail” at the level of local fit testing.
the .05 level, z = –2.552. Other significant z tests (also They say the devil is in the details, and those details
shown in boldface) indicate that the model does not regarding model fit are directly examined in local fit
adequately explain the covariance of fitness with illness testing. See Maydeu-Olivares and Shi (2017) for more
or the covariance of stress with illness, but the corre- information about roles for correlation residuals and
sponding correlation residuals (i.e., standardized effect standardized residuals in model evaluation.
sizes) are not glaringly large (respectively, –.038, .030).
As expected, values of absolute normalized residuals,
reported in the bottom part of Table 9.4, are smaller INTRODUCTION
than the corresponding absolute standardized residu- TO MEAN STRUCTURES
als, and just one normalized residual, for the stress
and fitness variables, is significant, z = –2.541 (shown No intercepts for the outcome variables (fitness,
in boldface). Thus, the different types of residuals in stress, and illness) were generated in analysis 1 for the
Table 9.4 all indicate that the relation between fitness
detailed example (Table 9.1). Estimation of intercepts
and stress in the data is poorly predicted by the model
(among other problems already mentioned). 2 Tobe fair, the d-sep test in the piecewise SEM analysis of the
We will see in the next chapter that the values of same model and data also indicated a problem (C = 19.521, p =
some, but not all, global fit statistics indicate problems .034), and this test concerns global fit at the level of all model-
for the same model and data. But I would conclude now implied conditional independencies for the union basis set in
that the fit of the example model is unacceptable, given Table 8.2.
Pt2Kline5E.indd 147 3/22/2023 3:44:10 PM

using SEM computer tools generally requires (1) the the computer is instructed to omit from the unstandard-
input of the raw data or a summary covariance matrix ized regression equation the intercept term it would
plus the means of all variables (i.e., a mean vector), and otherwise automatically calculate. In standard regres-
(2) specification that the model has both a covariance sion analysis (i.e., with an intercept), regressing a vari-
structure and a mean structure. A mean structure for a able on the constant “1” would cause the analysis to
path model includes intercepts of endogenous variables fail. This is the because the denominators of regression
and means of measured exogenous variables. In path coefficients in a standard analysis would be zero for
models, these intercepts and means are for observed predictors with no variance (i.e., the constant), and the
variables, but mean structures for other types of struc- attempt to divide by zero generates an error condition
tural equation models can represent the estimation of in computer arithmetic. Eisenhauer (2003) described
intercepts or means for latent variables, a topic covered other uses for RTO regression, such as when applying
in Chapters 21–22. methods that correct for heteroscedasticity or autocor-
A mean structure for a path model is specified by relation in time series data. Your results for Exercise 3
including the constant “1” in the equation for all mea- illustrate the two basic properties stated next:
sured variables, exogenous and endogenous, that make
up the model’s covariance structure. (Disturbances are RULE 9.2 For a continuous criterion and predictor
not included in mean structures because their means and assuming RTO regression,
are assumed to equal zero.) These specifications tell
1. If the criterion is regressed on both the predictor
the computer to analyze both covariances and means.
and constant “1,” the unstandardized coefficient
If a raw data file is analyzed, the computer will append
for the constant is the intercept
to it in memory a new column where the score is “1”
for every case (i.e., the constant). When analyzing a 2. If the predictor is regressed on the constant, the
covariance matrix and mean vector instead of raw data, unstandardized coefficient for the constant is the
some SEM computer tools, such as LISREL (Jöreskog mean of the predictor
& Sörbom, 2018), can also analyze an augmented
moment matrix (AMM), which is an average sum of We need modified rules for counting the numbers of
squares and cross products matrix (SSCP) that includes observations and free parameters for models with both
a row and column for the constant. The average sum covariance and mean structures:
of squares for the constant is 1.0, and its average cross
products are the sample means—see Ghisletta and RULE 9.3 If v is the number of observed variables,
McArdle (2012) for examples. the number of observations when means and
In lavaan, the syntax “meanstructure = true” covariances are analyzed equals v (v + 3)/2
automatically adds a mean structure as just described.
This command instructs the computer to add the con- The value of the expression just stated gives the total
stant “1” to the equations that specify the model. A more number of variances, nonredundant covariances, and
laborious alternative is for the researcher to manually means of observed variables. Model parameters are
specify the mean structure in syntax. For example, the defined next.
specification “illness ~ 1” in lavaan includes the
constant in the equation for this endogenous variable RULE 9.4 The free parameters of a model with both
and tells the computer to estimate its intercept. Like- covariance and mean structures include the
wise, the lavaan syntax “exercise ~ 1” includes the
1. Intercepts of the endogenous variables
constant in the equation for this exogenous variable and
tells the computer to estimate its mean. 2. Means of the exogenous variables (excluding
The basic logic of including the constant “1” in equa- error terms)
tions for variables in a simple regression model as a 3. Number of parameters in the covariance structure
way to tell the computer to estimate intercepts and counted in the usual way for that type of model
means along with regression coefficients is demon-
strated in Exercise 3. This exercise relies on a method For the path model in the ongoing detailed example,
called regression through the origin (RTO), where v = 5, so the total number of observations is 5(8)/2,
Pt2Kline5E.indd 148 3/22/2023 3:44:10 PM

or 20 when means and covariances are analyzed (see displayed in their proper places. Note in the figure that
Table 4.3). Free model parameters include a total of unstandardized estimates for the covariance structure
are unchanged compared with the analysis without
1. 3 intercepts for fitness, stress, and illness, the means (see Table 9.2). Figure 9.1 includes the special
endogenous variables McArdle–McDonald RAM symbol for a mean struc-
2. 2 means for exercise and hardy, the exogenous vari- ture, the triangle shape with “1” (the constant) shown
ables in the center. The symbol just described is hereafter
3. 10 free parameters in the covariance structure only referred to in text as “delta-1.”
as previously counted for this example. Coefficients in Figure 9.1 that appear next to the
lines with single arrowheads that point from delta-1 to
The grand total of free parameters is thus 3 + 2 + 10, the exogenous variables, exercise and hardy, are just
or 15. Thus, the model degrees of freedom are df M = their respective sample means (Table 4.3). Values of
20 – 15 = 5, which is the same as for example model intercepts for the endogenous variables are displayed
with just a covariance structure (i.e., means are not ana- along the lines with single arrowheads in the figure
lyzed). That is, df = 0 for the mean structure because that point from delta-1 to fitness, stress, and illness. For
there are as many observations (means of 5 variables) example, the intercept for the regression of illness on its
as parameters (2 means, 3 intercepts). In general, ana- parents, fitness and stress, is 114.874. Exercise 4 asks
lyzing means for path models in a single sample neither you to interpret the intercept for illness in Figure 9.1,
changes the value of df M nor the estimates for model’s and Exercise 5 concerns standardized estimates for the
covariance structure, if no constraints are imposed on mean structure.
estimates for the mean structure. Three points warrant special mention here:
Listed in Table 9.1 are the input and output files for
analysis 2 with means for the example path model in 1. The RAM delta-1 icon is a pretty common graphi-
lavaan. Presented in Figure 9.1 are the unstandard- cal symbol for a mean structure, but not all authors
ized solutions for parameters of both the mean struc- use it. For example, sometimes just the covariance
ture (shown in boldface) and covariance structure all structure is shown in a model diagram and infor-
287.065
4,422.250 DF
1
.108 3,212.567
Exercise Fitness
−.849
40.900 DI
62.686 1
−75.810 114.874
1 Illness
0 24.000
.574
Hardy Stress
−.203 1,062.883
1
1,440.000 DS
FIGURE 9.1. Recursive path model of illness with unstandardized parameters for the covariance structure and the mean struc-
ture. Estimates for the mean structure are shown in boldface. Values for exercise and hardy (exogenous variables) are means,
and values for fitness, stress, and illness (endogenous variables) are intercepts.
Pt2Kline5E.indd 149 3/22/2023 3:44:11 PM

mation about means or intercepts is reported in an Let’s try an example. For the fitness variable in
accompanying table. So it not absolutely necessary Figure 9.1, the coefficient for the direct pathway from
to represent a mean structure in model diagrams. delta-1 is 62.686 (i.e., the intercept for regressing fitness
2. Estimation of intercepts makes it easier to calculate on exercise). It has one parent, exercise. Because exer-
a predicted score for each case in the metric of the cise is exogenous, its predicted mean equals 40.900, the
corresponding outcome variable. This can be done coefficient for the direct pathway from delta-1 to this
manually in a data editor where the weights match variable. The product of the predicted mean for exer-
those of the path coefficients and intercept for a par- cise and the path coefficients for its direct effect on fit-
ticular outcome. The sem and gsem (generalized ness is 40.900 (.108), or 4.417. The predicted mean for
SEM) commands in Stata (StataCorp, 1985–2021) fitness is the sum of the two coefficients just described,
can automatically save predicted scores to the raw 62.686 + 4.417, or 67.10 at 2-decimal accuracy. This
predicted mean for fitness equals the sample mean
data file for either observed or latent endogenous
(Table 4.3) because the mean structure in Figure 9.1 is
variables (i.e., factor scores).
just identified. Exercise 6 asks you to calculate the pre-
3. Robust ML estimation generally adds a mean struc- dicted mean for illness in the figure.
ture to the model even if the original model has
just a covariance structure. In such cases, means
and intercepts are printed in the output along with PRÉCIS OF GLOBAL ESTIMATION
estimates for the covariance structure. The Amos
program requires explicit specification of a mean Using an SEM computer tool to estimate a path model
structure when analyzing incomplete data with the brings with it a few amenities compared with local esti-
FIML (i.e., casewise ML) method (Arbuckle, 2021). mation. One is that results such as effect decomposi-
tions, predicted covariances or correlations, and residu-
Predicted Means als (raw, correlation, standardized, or normalized) are
automatically computed and printed in the output.
Just as a model’s covariance structure generates pre- Some of these results can be calculated by hand in local
dicted covariances that can be compared with sample estimation, but doing so for larger models is tedious.
covariances, a mean structure gives rise to predicted Another advantage of using SEM computer tools is that
means, also called fitted means, that can be juxtaposed values of global fit statistics, described in the next chap-
with the observed means. A mean residual is the dif- ter, are automatically computed and printed in the out-
ference between the observed and predicted means for put, but this “advantage” can be a double-edged sword:
the same variable. If df = 0 for the mean structure (as Focusing exclusively on global model fit is a common
in the example analysis), all mean residuals are zero; but poor practice in SEM.
otherwise, some predicted means may differ from their A drawback is that SEM computer tools do not typi-
observed counterparts. Computers use matrix algebra cally inform the researcher about multiple estimates of
to compute predicted means, but the method described the same effect. For example, basically all SEM com-
next can be applied by hand without great effort to the puter programs estimate direct effects controlling for
unstandardized solution in smaller models with explicit all parents of each outcome variable. If the same direct
mean structures (e.g., Figure 9.1): effect is statistically identified by other sets of covari-
ates or by instruments, the computer program will
RULE 9.5 The predicted mean for a target variable not tell you about those other estimates. The same is
is the sum of the true for total effects: They are typically estimated by
SEM computer tools as sums of direct effects as just
1. coefficient for the direct pathway from the described and total indirect effects estimated as sums
constant, and of product indicators, not through covariate adjustment
2. for each parent, if any, the product of the or instruments (e.g., Tables 7.3, 7.5). But knowing about
predicted mean for that parent and the path graphical identification rules can help the researcher to
coefficient from the parent to the target variable, avoid this limitation.
summed over all parents You may be thinking, is a researcher required to
Pt2Kline5E.indd 150 3/22/2023 3:44:11 PM

use an SEM computer tool to analyze a path model? are not directly verifiable. Adding a mean structure to
The answer is no. The use of single-equation estima- a covariance structure tells the computer to estimate
tors, either with covariate adjustment or instruments, intercepts for endogenous variables and means for
is a perfectly acceptable way to estimate the param- exogenous variables. Optional output from SEM com-
eters of a path model (e.g., piecewise SEM). Doing so puter tools often includes correlation, standardized,
is probably more familiar in biology or epidemiology or normalized, or mean residuals, which can provide
than in the social sciences such as psychology or educa- invaluable information about local fit of the model at
tion. Although SEM computer tools offer conveniences, the level of all pairs of observed variables. Evaluating
there are also drawbacks, namely, absence of multiple global fit and model hypothesis testing are the focus of
estimators of the same effect. the next chapter.
SUMMARY LEARN MORE
The default method in most SEM computer tools is Finney and DiStefano (2013) describe estimators for non-
maximum likelihood estimation, which is a simulta- normal or categorical data, Lang and Little (2018) outline
modern options in SEM for incomplete data, and Maydeu-
neous, full-information, normal-theory, and iterative
Olivares and Shi (2017) discuss correlation and standard-
method for continuous outcomes. Major variations on
ized residuals as effect sizes of model misfit.
the default method include (1) robust estimation, which
estimates the degree of nonnormality in the data and Finney, S. J., & DiStefano, C. (2013). Nonnormal and
accordingly corrects the values of parameter standard categorical data in structural equation modeling. In G.
errors and model test statistics; and (2) full-information R. Hancock & R. O. Mueller (Eds.), Structural equation
(casewise) estimation of incomplete data sets under the modeling: A second course (2nd ed., pp. 439–492). IAP.
assumption that the data loss mechanism is missing at
random. A recent development is that versions of full- Lang, K. M., & Little, T. D. (2018). Principled missing data
information ML estimators for data that are missing treatments. Prevention Science, 19(3), 284–294.
not at random are starting to appear in SEM computer Maydeu-Olivares, A., & Shi, D. (2017). Effect sizes of
tools, but such methods are relatively complex and rely model misfit in structural equation models. Methodology,
on distributional assumptions for latent variables that 13(Suppl.), 23–30.
EXERCISES
1. Calculate the predicted variance for the illness vari- a. Regress Y on X in a standard (i.e., with inter-
able, given the results in Table 9.2. cept) linear regression analysis and record the
unstandardized regression coefficient, intercept,
2. Reproduce the correlation residual of –.133 in Table and R2 (i.e., the squared bivariate correlation).
9.4 for the variables fitness and stress. b. Next, regress Y on X and UNIT, but specify that
the intercept (constant) should not be included
3. Listed in parenthesis below for N = 5 are raw scores in the equation (i.e., RTO regression). Hints:
on variables X, Y, and UNIT, a constant. Enter these In the options menu for SPSS Linear Regres-
scores in the data editor of computer program for sion, uncheck the box titled “Include constant
general statistical analyses: in equation.” Specify “0” in the equation of the
“lm( )” function in R to exclude the intercept
(3, 24, 1), (8, 20, 1), (10, 22, 1), (15, 32, 1), (e.g., “Y ~ 0 + X”). What is the unstandardized
(19, 27, 1) coefficient for UNIT?
Pt2Kline5E.indd 151 3/22/2023 3:44:11 PM

c. Now regress X on UNIT in a second RTO 5. What are the standardized estimates for the mean
regression. What is the coefficient for UNIT? structure in Figure 9.1? Assume the Std.all solution
d. Assume that X is a cause of Y. Draw the path in lavaan.
diagram with a mean structure using full RAM
graphical notation. Include values of all unstan- 6. Calculate the predicted mean for illness in Figure
dardized parameter estimates in your diagram. 9.1.
Generate the predicted mean for Y.
4. Interpret the intercept for illness in Figure 9.1.
Pt2Kline5E.indd 152 3/22/2023 3:44:19 PM

Appendix 9.A that data loss depends on the observed variables but not
also on latent variables—including the theoretically
complete versions of incomplete variables in the data
set—the expected information matrix may be incorrect
Types of Information Matrices unless the precise nature of the MAR pattern is known
and Computer Options (Kenward & Molenberghs, 1998). In Savalei’s (2010)
simulation study, standard errors based on observed
information were generally more accurate for MCAR
There are two basic kinds of information matrices: data, and the advantage for observed over expected
The expected information matrix is computed at the information was even greater for MAR data, under
estimated parameter values for the analyzed model in conditions of both normal and nonnormal distributions.
a particular distribution, such as the normal curve for Savalei (2010) noted that (1) whether standard errors
data that are both complete (no missing values) and are based on observed or expected information is not
normally distributed. The expected information esti- reported in most SEM studies, and (2) most researchers
mates do not depend on the raw data except through probably rely on computer program defaults. For true
the parameter estimates. But the observed informa- models analyzed in large samples with complete and
tion matrix is generated from the likelihood functions normally distributed data, there is little expected dif-
for the data, and thus observed information estimates ference between observed and expected standard errors
depend on both the parameter estimates and the raw and none in model test statistics. But in smaller samples
data (Savalei, 2010). Estimated standard errors in or when the data are incomplete or nonnormal, the dis-
ML estimation may be labeled in computer program tinction between observed and expected information
documentation or output as “expected” or “observed,” can have greater impact. We will revisit this issue in
depending on the information matrix that was their the next chapter when different kinds of model test sta-
source. tistics are introduced.
In large samples with complete data, standard errors Analysis options are described next for lavaan
based on observed or expected information are asymp- (Rosseel et al., 2023), but there are similar choices in
totically equivalent whether the data are normal or Mplus (Muthén & Muthén, 1998–2017) and some other
nonnormal. In small samples, though, standard errors SEM computer tools. The keyword “ML” in lavaan
based on observed information may be more accurate specifies default ML as the method for continuous out-
than estimates based on expected information (Dolan comes with normal distributions. If there are no miss-
& Molenaar, 1991). In computer simulation studies of ing data, standard errors are based on the expected
measurement models with either 2 or 4 common factors information matrix, but this default can be changed to
where each factor had, respectively, 8 or 4 indicators, the observed information matrix through the command
Savalei (2010) found a slight advantage for standard
errors based on observed information at smallest sam- information = “observed”
ples sizes (N = 100, 200) for complete normal data, but
that advantage was even more pronounced for complete The default treatment of missing data in lavaan is list-
nonnormal data. wise deletion, but the specification
Savalei (2010) noted that things are more com-
plicated for incomplete data because observed ver- missing = “ML”
sus expected information is no longer asymptotically
equivalent, so the choice between the two can make a instructs the computer to (1) base standard errors on
bigger difference. The pattern of missing data is also the observed information matrix, and (2) use casewise
an issue—specifically, expected information can be ML (i.e., FIML estimation) that assumes the data loss
readily computed only if data are missing completely mechanism is MCAR or MAR.
at random (MCAR), which means that the data loss A second option in lavaan for normally distrib-
mechanism is independent of both observed variables uted, continuous outcomes is method “MLF.” Param-
(i.e., the data) and unobserved (latent) variables. But if eter estimates using this method are the same as those
the pattern is missing at random (MAR), which means in default ML, but the “MLF” method approximates
Pt2Kline5E.indd 153 3/22/2023 3:44:19 PM

the information matrix using computationally simpler corresponds to standard errors based on cross-prod-
algorithms. Specifically, it relies on the covariances ucts information for the “MLF” method. In very large
of the first-order derivatives of the ML log-likelihood samples with no missing data, expected, observed, and
function to estimate covariances based on the second- cross-products information are asymptotically equiva-
order derivatives (i.e., the information matrix). May- lent (Greene, 2012). But in smaller samples or for larger
deu-Olivares (2017b) referred to the approximation just models, “MLF” standard errors are probably not as
described as cross-products information. In lavaan, accurate as those based on information matrices (i.e.,
the specification estimator “ML”)—see Maydeu-Olivares (2017b) for
more information.
information = “first.order”
Pt2Kline5E.indd 154 3/22/2023 3:44:19 PM

Appendix 9.B case. Because dropout could occur at different points

in time (including none), the dummy variables partition
the sample into subgroups based on patterns of com-
plete or incomplete data. Latent variables that represent
Casewise ML Methods initial status and subsequent changes are regressed on
for Data Missing Not the dropout dummy variables (e.g., Muthén et al., 2011,
p. 20). Thus, random means for these latent variables
at Random are estimated as mixtures over the various patterns of
missingness.
In Diggle–Kenward selection modeling (Diggle
Special estimation methods for MNAR data generally & Kenward, 1994) (“type = sdropout” in Mplus),
require the analysis of two models, one for the data loss latent growth variables are regressed on discrete-time
mechanism with auxiliary variables and the other for survival indicators, which equal 0 before dropout, 1
the outcomes. Such missing data models also rely on when dropout occurs, and missing for subsequent occa-
generally untestable assumptions about the distribu- sions. They are part of a logistic regression model for
tions for observed variables with missing data versus dropout. Outcomes are represented both as observed
latent variable representations of those outcomes with for some cases before dropout and as latent for other
no missing data (Enders, 2010, chap. 10; Tang & Ju, cases after dropout (e.g., Muthén et al., 2011, p. 21).
2018). Some of these methods are based on FIML esti- Random means are estimated given the logistic model
mators (Morikawa et al, 2017; Riddles et al., 2016), and for dropout. Both pattern-mixture modeling and the
they are starting to appear in SEM computer tools. An selection model assume normality for outcomes, but
example follows. part of these assumptions concern latent variables or
Muthén et al. (2011) described the analysis of latent particular dropout models, so they are not directly test-
growth models for a longitudinal clinical trial where the able. Thus, Muthén et al. (2011) recommended sensitiv-
data were MNAR with two different methods imple- ity analysis of different models that vary somewhat in
mented in Mplus (Muthén & Muthén, 1998–2017). their assumptions. If the results vary appreciably over
These methods represent data not only as outcomes but models, then the findings are not robust. Estimators for
also as binary missing response indicators (0, 1) at each MNAR data are thus complex and require diligence
measurement occasion. Use of the pattern-mixture in their application (i.e., they are not panaceas)—see
modeling method (“type = dropout” in Mplus), a Gottfredson et al. (2014) for more information, cau-
series of dummy variables (codes) represents the mea- tions, and examples.
surement occasion(s) when dropout occurred for each
Pt2Kline5E.indd 155 3/22/2023 3:44:19 PM

10
Model Testing and Indexing
Introduced in this chapter are the two main categories of global fit statistics: model test statistics and approxi-
mate fit indexes. They correspond to, respectively, model testing and model fit indexing (Hayduk, 2014). The
outcome of model testing is the binary decision about whether to reject or retain null hypotheses about the
model based on p values in significance testing. In contrast, model fit indexing is based on continuous
measures of model–data correspondence; thus, it is more analogous to quantitative effect size estimation as
opposed to dichotomous significance testing. Because global fit statistics in both categories measure only
average or overall model–data correspondence, their values may fail to indicate poor local fit, or pairs of
observed variables for which the model inadequately explains their observed (sample) associations. Models
with poor local fit should not be retained regardless of their global fit. How to adjudge global fit without
neglecting local fit is the main goal of this chapter. A second aim is to describe two methods for planning
sample size in SEM: power analysis and accuracy in parameter estimation, also called precision in planning.
MODEL TESTING assumptions or share the same weight matrix (among

other differences), values of the model chi-square and
Model testing corresponds to the classical school in p—and sometimes df M, too—can vary over methods
statistics, which deals mainly with inferential tests of for the same model and data. This is why the subscript
single hypotheses and emphasizes decision rules that in the symbol for the model chi-square used from this
should be followed by all (Little, 2013). Most research- point on usually indicates the estimation method. For
ers in psychology and related disciplines are trained in instance, chiML refers to the model chi-square in the
the classical school, which also makes them generally default ML method. Chi-squares for the robust ML
familiar with the basic logic and rules of significance method are considered later.
testing, if not always correct in their understanding of
its results (i.e., p values; Kline, 2013a). Model testing
is based on the chi-square test described by Jöreskog MODEL CHI‑SQUARE
(1969) for evaluating CFA measurement models in
large samples. Shortly afterward, LISREL III—the Depending on the SEM computer tool, chiML is calcu-
first commercially available version for use on main- lated using one of the two different ways listed next:
frame computers—was published (Jöreskog & Sör-
bom, 1976), and the only global fit statistic it printed (N – 1) FML or N (FML) (10.1)
was the model chi-square, its degrees of freedom
(df M), and p value. Because different global estima- where FML is the value of the fit function minimized
tion methods do not all rely on the same distributional in default ML estimation. In very large samples, the
156
Pt2Kline5E.indd 156 3/22/2023 3:44:19 PM

Model Testing and Indexing 157
two products in Equation 10.1 are asymptotic and, fied models (df M ≥ 1) is the exact-fit hypothesis. For
assuming multivariate normality, both follow central models with just a covariance structure (means are not
chi-square distributions with degrees of freedom equal analyzed), the exact-fit hypothesis can be expressed as
to df M. Under the assumptions just stated, both expres-
sions in the equation are called the minimum fit chi- H0: S = S(q
q) (10.3)
square. In smaller samples, the factor by which FML is
multiplied (i.e., N – 1 vs. N) can explain variations in which predicts no difference between the population
chiML over different SEM computer tools for the same covariance matrix, S, and the model-implied covari-
model, data, and estimation method. Some computer ance, S(q
q), where q represents the model parameters.
tools, such as lavaan, allow the user to choose either Another expression is
expression in Equation 10.1 for computation of chiML
(the default is N (FML); Rosseel et al., 2023). H0: S – S(q
q) = 0c (10.4)
The value of chiML for a just-identified model equals
zero, but technically it is not defined for models with where 0c is the zero matrix of population covariance
no degrees of freedom. If chiML = 0, the model per- residuals where every element is zero.
fectly fits the data: Each observed variance or covari- For models with both covariance and mean struc-
ance exactly equals its predicted counterpart. If the fit tures (means are analyzed along with covariances), the
of an overidentified model that is incorrectly specified exact-fit hypothesis is
becomes increasingly worse, chiML increases in value;
thus, chiML is a badness-of-fit statistic (i.e., the higher H0: S = S(q
q), m = m(q
q) (10.5)
the value, the worse the fit).
Another way to view chiML is as a likelihood ratio where m(qq) represents the vector (1-dimensional array)
test that compares the difference in fit between the of model-implied means and m is mean vector in the
researcher’s overidentified model, represented by the population. An equivalent definition is
null hypothesis H0, and whatever unspecified model,
represented by the alternative hypothesis H1, that H0: S – S(q
q) = 0c, m – m(q
q) = 0m (10.6)
would generate a predicted covariance matrix that
exactly matches the sample covariance matrix. Sup- where 0m is the zero vector of population mean residu-
pose that chiML > 0 and df M = 5 for the researcher’s als in which every element is zero.
model (H0). Adding five more free parameters to this The exact-fit hypothesis is an example of a nil
model would make it just-identified—thereby making hypothesis that predicts no difference between popu-
its covariance implications perfectly match the data lation and model-implied covariances or means. Hay-
covariance matrix, even if that model were incorrectly duk (2014) noted that “nil” in this case does not mean
specified—and reduce both chiML and df M to zero. An “unimportant” because the model and parameter esti-
unrestricted model with perfect fit corresponds to H1 as mates represent the researcher’s hypotheses, and he
do all equivalent versions of that model. Thus, chiML suggested notable-null hypothesis as an alternative
can also be defined as description. Semantics aside, I agree that the exact-fit
hypothesis is not a trivial statement.
L  You should know that the chi-square test concerns
−2 ln  0  =−2 ln L 0 + 2 ln L 1 (10.2)
 L1  not just the researcher’s model, but also all equivalent
 
versions that explain the data just as well despite mak-
where L 0 is the likelihood function maximized in ML ing contradictory causal claims. There may also be
estimation for the researcher’s model under H0, and L1 near-equivalent models with similar fit to the data, but
represents the same quantity for an unrestricted model not identical. Unfortunately, the existence of equiva-
with perfect fit under H1. lent or near-equivalent models is rarely acknowledged
in published SEM studies. How to generate equiva-
lent versions of path models with manifest variables is
Exact‑Fit Hypothesis
explained in Chapter 11, and these same principles can
In large samples and assuming multivariate normal- be applied to structural models with observed variables
ity, the null hypothesis tested by chiML for overidenti- or common factors, too.
Pt2Kline5E.indd 157 3/22/2023 3:44:19 PM

Effect of Sample Size chiML (5) at the .05 level is 11.071.1 Thus, the exact-fit
hypothesis would be rejected at the .05 level if
Unless the sample size is small, the value of chiML
for true models is not affected by sample size. This is
chiML (5) ≥ 11.071
because the quantity estimated by FML for true models
is approximately df M/N (Hayduk, 2014), which basi-
(the test is failed), but the exact-fit hypothesis is retained
cally removes the effect of sample size in the formula
for chiML (see Equation 10.1). Thus, the expected value (the test is passed) if
of chiML over random samples is df M. This means that
chiML (5) < 11.071
1. chiML ≤ df M is expected in roughly half of random
samples drawn from a population where the model Issues in applying the model chi-square test are
has perfect fit. listed next and discussed afterward:
2. p values for chiML are ≥ .05 in about 19 out of 20 1. Passing the chi-square test does not mean that the
samples. model also has satisfactory local fit.
3. The exact-fit hypothesis will be rejected for correct
2. It is less impressive when models with lower df M
models in less than 1 out of 20 samples when testing
pass the chi-square test compared with models with
at the .05 level (i.e., a Type I error).
greater df M.
If N < 200 or so, then the expected values of chiML tend 3. Passing the chi-square test says basically nothing
to exceed those of df M even when the data are multi- about predictive accuracy for individual outcomes
variate normal (Curran et al., 1996). Thus, values of (e.g., R2) or about accuracy of predictions for indi-
chiML are generally too high and their p values are too vidual cases.
low for true models analyzed in small samples. 4. It can happen in very large samples that a model
For misspecified models—those that do not per- fails the chi-square test, but the residuals indi-
fectly fit the population covariance matrix—the value cate trivial discrepancies between observed and
of chiML is affected by sample size. This is because expected values in local fit.
FML > df M/N for such models, so sample size is not can- 5. It is illogical to ignore a failed model chi-square
celled out in the derivation of chiML (Equation 10.1). test while touting statistically significant parameter
Instead, given constant values of FML and df M, the estimates.
value of chiML increases and its p value decreases as N
gets larger for incorrect models. This characteristic of 6. Unfortunately, there is little or no correspondence
chiML is not an aberration. This is basically how all sig- between chi-square test outcomes and the types or
nificance tests work when the null hypothesis is false: severities of specification error.
Values of test statistics are determined by products of 7. The conventional standard of a = .05 may not be
sample size and effect size, and effect size in this case optimal for the chi-square test in SEM.
means the amount of departure from perfect fit (i.e.,
FML). It can, and does, happen—especially in smaller sam-
ples where the power of the chi-square test to detect
appreciable model–data discrepancies is low—that
Logic of the Chi‑Square Test
passing models have poor local fit (i.e., the residuals are
In practice, it is rarely known whether the researcher’s problematic). This is because chiML —and basically all
model is true. With this reality in mind, the logic of other global fit statistics—collapse many discrepancies
the model chi-square test is summarized next: Assum- into a single measure (Steiger, 2007). Thus, a model
ing the model has perfect fit in the population, if the p could adequately explain associations for some, but not
value for chiML (df M) is less than the criterion level of all, pairs of observed variables. Such models should not
statistical significance, such as a = .05, then the exact- be retained even though they passed the chi-square test.
fit hypothesis is rejected and the model fails the chi-
square test; otherwise, the model passes the chi-square 1 You can use the chi-square calculator at https://statpages.info/
test (i.e., p ≥ .05). For example, the critical value of pdfs.html
Pt2Kline5E.indd 158 3/22/2023 3:44:19 PM

Models with very few degrees of freedom can have ence that “big N, small p” implies trivial model–data
near-perfect fit, such as discrepancies has no basis. Unfortunately, many—and
perhaps most—authors who dismiss failed model chi-
chiML (1) = .250, p = .617 square tests report nothing about local fit (e.g., Seixas
et al., 2018).
Such models may have so many free parameters rela- It is illogical to ignore covariance evidence against
tive to the number of observations that they can hardly the model—a failed chi-square test—but then tout
fail to explain the data to a close degree; that is, they the statistical significance of individual parameter
are overparameterized. This explains the general pref- estimates as supporting the researcher’s hypotheses
erence for models with greater degrees of freedom. (Markland, 2007). To do so is another form of confir-
Specifically, obtaining relatively high p values in exact- mation bias, in this case disregarding evidence against
fit testing is more impressive for models with greater the whole model that could foster the publication of
rather than fewer degrees of freedom. In this instance, false claims (Hayduk, 2014). The solution is to take
models with a greater number of degrees of freedom seriously the results of the model chi-square test while
have withstood a more stringent test of model–data dis- also transparently reporting all evidence about model
agreement over more dimensions (i.e., df M) than mod- fit, including the residuals.
els with fewer degrees of freedom (Raykov & Marcou- Although p values are continuous measures of depar-
lides, 2006). ture from exact fit, they do not reflect either whether
Knowing that model and data are consistent within the specification error is smaller or larger or the nature
the limits of sampling error (e.g., p = .75) does not indi- of such error. One reason is equivalent models, which
cate that the equations for endogenous variables have all have identical chi-square p values even though
relatively high R2 values. In fact, global model fit and those equivalent values represent contradictory causal
R2 for individual outcomes are essentially independent. hypotheses. Just about any just-identified model (df M =
For example, disturbances in structural models with 0) will perfectly fit the data, although that model—and
perfect global fit can still be large (i.e., R2s are low), all equivalent versions, too—could be very wrong. So
which means that the model accurately captures the even perfect fit does not rule out severe misspecifica-
relative lack of predictive validity in the data. Model- tion (Hayduk, 2014). By the same logic, slight discrep-
predicted variances for endogenous variables include ancies between model and data in local fit testing do
both explained and error variance or, respectively, R2 not rule out possible serious specification error, such as
and 1 – R2 in standardized terms. Thus, a model can reversal of true causal effects.
closely predict the variance of an endogenous vari- Although p < .05 is a common standard in signifi-
able even though R2 for that outcome is close to zero. cance testing, it is not a golden rule (Wasserstein et
Global model fit also says little about person-level fit, al., 2019), and in SEM the convention p < .05 for the
or whether the model generates accurate predictions for chi-square test may be too lax. True models are just as
cases. Rensvold and Cheung (1999) described methods likely to have a p value in the region of .05 as in the
to assess the impact of individual records on global fit, .95 range. This is also true for .25 region and the .75
and Coffman and Millsap (2006) demonstrated how region. These statements reflect the fact that p values
latent growth models with relatively poor global fit can are uniformly distributed under the null hypothesis;
nevertheless closely match records at the case level. that is, all values are equally likely over the interval
For models that fail the chi-square test in very large [0, 1.0] (Hung et al., 1997). So, striving for correctly
samples, it can happen that the residuals indicate rel- specified models in SEM is striving for models with
atively small or even trivial local fit problems. If so, p values that should ideally be considerably > .05
the researcher might reasonably argue to retain the (Hayduk, 1996).
model. McIntosh (2012) cautioned against a fallacy that
could motivate authors of SEM studies to report, but
Accept–Support Testing and Power
then almost immediately discount, a failed model chi-
square test: The myth is that a significant model chi- The model chi-square test is an accept–support test,
square in a large sample means that deviations from where it is the failure to reject the null hypothesis that
exact fit must be trivial in magnitude. Perhaps that is supports the researcher’s model. This is the opposite of
true, but without examining the residuals, the infer- a more conventional reject–support test, where reject-
Pt2Kline5E.indd 159 3/22/2023 3:44:19 PM

ing the null hypothesis supports the researcher’s pre- The model size effect concerns the potential impact
dictions. Of the two, accept–support testing is logically of fitting models with relatively large numbers of vari-
weaker because the failure to disprove an assertion (the ables, such as ≥ 30, in smaller samples, such as N <
exact-fit hypothesis) does not substantiate the truth of 200, on values of global fit statistics (Shi et al., 2019).
that assertion (Steiger, 2007). Specifying lower crite- Values of chiML tend to be inflated under the condi-
rion values for statistical significance in reject–support tions just stated for normally distributed data (Herzog
testing, such as p < .001, guards against false claims. et al., 2007; Kenny & McCoach, 2003). This means
This is because Type I error means in this context that that (1) the actual rate of Type I error is higher than
the researcher’s theory is wrong. But in accept–sup- the nominal level set by the researcher, such as .05;
port testing, we should worry more about Type II error and (2) the chi-square test will reject too many true
because false claims in this circumstance arise from models. In Monte Carlo simulations, Moshagen (2012)
not rejecting the null hypothesis. reported evidence that the model size effect is more a
Potential consequences of low power are also differ- function of the size of covariance matrix, or the num-
ent: In accept–support testing with chiML, low power ber of observed variables, than of the total number
means that there is little chance of detecting a false of free parameters, df M, or type of model (e.g., path
model. This fact implies that analyzing a model in a vs. measurement models) in sample sizes of N = 200.
sample that is too small (i.e., low power) makes it more Empirical rates for the rejection of true factor analy-
likely that the model will be retained, which favors sis models were generally > .90 for models with 60 or
the researcher’s hypotheses. In reject–support testing, more observed variables.
though, the penalty for low power due to an insufficient Bigger correlations among observed variables gener-
sample size is that the researcher’s hypotheses are less ally lead to higher values of chiML for incorrect mod-
likely to be supported. The implications of low power els. This happens because larger correlations allow a
are another reason why insisting on p < .05 as a gold greater potential for discrepancies between observed
standard in accept–support testing for SEM is generally and predicted correlations (and covariances, too). Ana-
a mistake. lyzing variables with relatively high proportions of
unique variance—which could be due to score unreli-
Other Factors That Affect ability—generally results in loss of statistical power.
the Model Chi‑Square Thus, the chi-square test could potentially “reward” the
selection of measures with poor psychometrics because
Listed next and considered afterward are factors other low power in this case favors the model. This is the reli-
than sample size and specification error that can affect ability paradox: Imprecise factor measurement sup-
values of chiML: ports the retention of misspecified models, but use of
more precise indicators is potentially punished because
1. Violation of multivariate normality. the model is easier to reject (Hancock & Mueller, 2011;
2. Model size. Heene et al., 2011). Grewal et al. (2004) noted that infla-
3. Correlation size. tion of Type II error (the exact-fit hypothesis is retained
for false models) may be especially severe when
4. Measurement error (unique variance) and high mul-
ticollinearity.
1. Absolute correlations between exogenous variables
5. Interactive effects. exceed .90.
2. Reliability coefficients are < .70.
Depending on the pattern and severity of nonnormal-
ity, the value of chiML can be either increased so that 3. Proportions of explained variance for individual
model fit appears worse that it really is or decreased so endogenous variables are relatively low, such as
that model fit looks better than it really is (Finney & R2 < .25.
DiStefano, 2013). This is the reason why it is so impor- 4. The ratio of sample size to the number of free
tant to screen your data for severe nonnormality when parameters is < 3:1.
using default ML. Corrected model chi-squares asso-
ciated with methods that do not assume normality are Mooijaart and Satorra (2009) reported that the
described later in this chapter. model chi-square test is relatively insensitive to inter-
Pt2Kline5E.indd 160 3/22/2023 3:44:19 PM

action misspecification; that is, the test is generally chi ML

chi SB = (10.7)
unable to detect the presence of true interaction effects c
between pairs of exogenous variables. This happens
because central chi-square sampling distributions for Distributions of chiSB over random samples only approx-
chiML are not necessarily distorted when severe inter- imate central chi-square distributions but have asymp-
action misspecification occurs. Thus, models with no totically correct means. A different mean-adjusted
interactive effects could appear to fit the data based on chi-square by Asparouhov and Muthén (2005) is not
the model chi-square test even when there are strong based on chiML. Instead, in very large samples their
interactive effects in the population. The insensitiv- scaled chi-square equals Yuan and Bentler’s (2000) T2*,
ity to interactive effects also holds for corrected chi- which accommodates nonnormal or missing data. The
squares in the robust ML method. Presented in Topic degrees of freedom for both chiSB and T2* are df M.
Box 10.1 is a resumé of strengths and limitations of the Mean- and variance-adjusted chi-squares have
model chi-square test. different scaling factors, and over random samples they
generally follow central chi-square distributions with
asymptotically correct means and variances. They
Normed Chi‑Square
require more computing resources than methods for
A brief mention of a statistic known as the normed chi- generating mean-adjusted chi-squares, but over large
square (NC) is needed mainly to discourage you from samples mean- and variance-adjusted chi squares
ever using it. In an attempt to reduce the sensitivity of should be more accurate (Asparouhov & Muthén, 2013;
the model chi-square to sample size, some researchers Maydeu-Olivares, 2017b). The degrees of freedom for
in the past divided this statistic by its expected value, the mean- and variance-adjusted chi-square gener-
such as chiML/df M, which reduced the value of this ratio ated in the Asparouhov and Muthén (2010) method are
for df M > 1 compared with chiML. There are three prob- df M. Satorra and Bentler (1994) described an alterna-
lems with the NC: tive method where the rescaled chi-square has an esti-
mated degree of freedom that may not be an integer
1. chiML is affected by N only for false models. (e.g., 15.752). In simulation studies where population
2. df M has nothing to do with N. continuous variables were categorized into 2–7 catego-
3. There were never clear-cut guidelines about maxi- ries, Savalei and Rhemtulla (2013) reported relatively
mum “acceptable” values of the NC (e.g., < 2.0?— small differences (< 1%) in the performance between
< 3.0?). the method that does not require an estimated df (Aspa-
rouhov & Muthén, 2010) and the method that estimates
Because there is little statistical or logical foundation df (Satorra & Bentler, 1994).
for the NC, it should have no role in model evaluation. Described next are options for the robust ML method
in lavaan. Most of these methods are also available
in Mplus—see Maydeu-Olivares (2017b) and Muthén
SCALED CHI‑SQUARES and Muthén (1998–2017). There is also the option in
AND ROBUST STANDARD ERRORS lavaan
FOR NONNORMAL DISTRIBUTIONS
mimic = “Mplus”
Recall that (1) parameter estimates are the same in
default ML and robust ML, but (2) robust ML methods that reproduces results from Mplus. Syntax keywords
generate scaled model chi-squares and robust standard for other methods in lavaan are described next:
errors that correct for nonnormality (Chapter 9). An
early method by Satorra and Bentler (1988, 1994) that 1. Option “MLM” is for complete data sets only and
requires complete data generates the Satorra-Bentler generates the mean-adjusted Satorra-Bentler scaled
scaled chi-square by applying a scaling correction chi-square.
factor, c, to the unscaled model chi-square. The value 2. Option “MLR” can be applied in complete or
of c reflects average multivariate kurtosis in the raw incomplete data sets, and it generates a mean-
data. The specific relation is adjusted chi-square based on the Yuan-Bentler T2*
Pt2Kline5E.indd 161 3/22/2023 3:44:19 PM

TOPIC BOX 10.1
Resumé of the Chi‑Square Test

Strengths of the model chi-square test are summarized next:
1. It is a statistically principled argument, one based on sampling theory, the population inference
model, and a frequentist view of probability, all of which are well established and generally famil-
iar to many researchers.
2. Although rarely seen in SEM studies, it possible to specify criterion levels of statistical significance
using methods that balance the relative costs of Type I versus Type II error. For example, if test-
ing at p < .55 gives the desired balance of error for the exact-fit test, then it should replace the
“default” of p < .05—see Mudge et al. (2012) and Aguinis et al. (2010) for more information.
3. The chi-square test may be especially useful in “typical” sample sizes in SEM studies, say, N =
200–300. Such sample sizes are not at all “large,” especially for complex models with many
variables. But a failed exact-test fit, especially if the power of that test is relatively low, could signal
the presence of serious model–data discrepancy at the level of local fit.
Listed next are limitations of the chi-square test:
1. While introducing the chi-square test, Jöreskog (1969) himself mentioned a drawback: If the inter-
est in factor analysis is mainly the estimation of major factors that explain most of the observed
variation, then the presence of minor factors, over which the researcher may have little or no
control, could cause the model to fail the significance test. Such models could still be of practical
value, and the criteria for evaluating the utility of imperfect models depends at least as much on
rational considerations as on statistical results.
2. The basic assumptions of the chi-square test are implausible in many, and probably most, studies:
Few researchers analyze correct models over truly random samples. If the basic assumptions of
a significance test are untenable, then p values from that test may be untrustworthy (Cumming &
Calin-Jageman, 2017; Kline, 2013a). Thus, making hairsplitting distinctions among continuous p
values subject to error is hard to defend. For example, observing p = .04 versus p = .06 should
not encourage researchers to make very different claims about the model even when testing at the
.05 level (Hayduk, 2014).
3. It is possible—but not guaranteed—in very large samples that failing the chi-square test signals
departures from perfect fit that would be considered inconsequential or trivial, but only careful
inspection of the residuals will address this question.
4. The outcome of the chi-square test is the binary decision to retain or reject the exact-fit hypothesis,
nothing more. For reasons already explained, these two outcomes do not directly translate to the
decision to, respectively, retain or reject the model. Indeed, there is no global fit statistic of any
kind in SEM that can be reliably interpreted or applied in this way.
Pt2Kline5E.indd 162 3/22/2023 3:44:19 PM

statistic. Because this method accommodates miss- ues indicate better fit. Values of some goodness-of-fit
ing data, it is the most flexible option listed here. indexes are more or less standardized so that their range
3. Option “MLMV” is for complete data sets only and is 0–1.0, where a value of 1.0 indicates the best result.
computes a mean- and variance-adjusted scaled chi Unlike the model chi-square test, there is no single,
square. unifying, or coherent theoretical framework for apply-
ing or interpreting values from approximate fit indexes.
4. Option “MLMVS” generates a mean- and variance-
Thus, model fit indexing is more like what Little (2013)
adjusted chi-square with a correction for heterosce-
called the modeling school, which deals with the eval-
dasticity by Satterthwaite (1941). Model degrees of
uation of entire statistical models in a context where
freedom are estimated, and the method is for com-
the decision rules are fuzzier and less clear cut. This
plete data sets.
approach can provide flexibility because not all statisti-
The default information matrix in lavaan for cal- cal models are alike and models must be adapted to dif-
culating standard errors in robust ML methods is ferent kinds of research questions. Consequently, there
expected, but the observed information matrix is used is greater ambiguity about rules for evaluating statisti-
when there are missing data (i.e., option “MLR”). The cal models. As we will see, though, flexibility in model
user can override the default for incomplete data and fit indexing has not always translated to good practice.
specify the use of the expected matrix.2 There are also Another matter is the philosophical question of
separate options in lavaan to specify methods for cal- whether perfection (i.e., the exact-fit hypothesis) is a
culating robust standard errors. For example, the com- reasonable standard for statistical models, which are
mand “robust.huber.white” specifies robust stan- probably all wrong at least to some degree; that is, mod-
dard errors corrected for heterogeneity of the type gen- els are generally imperfect approximations that help
erated in the “MLR” estimator while “robust.sem” researchers to structure their thinking about the target
specifies “classical” robust standard errors without phenomenon. If that approximation is too coarse, the
such corrections (options “MLM,” “MLMV,” and model will be rejected, but an overly complex model
“MLMVS”). Researchers should avoid “shopping” that closely mirrors the phenomenon of interest is also
for the combination of scaled chi-squares and robust of little scientific value. Box (1976) elegantly said
standard errors that best supports their hypotheses. If
Since all models are wrong the scientist cannot obtain a
the results vary appreciably over different options, then “correct” one by excessive elaboration. On the contrary . . .
that fact should be disclosed. [the researcher] should seek an economical description of
natural phenomena. Just as the ability to devise simple but
evocative models [emphasis added] is the signature of the
MODEL FIT INDEXING great scientist so overelaboration and overparameteriza-
tion is often the mark of mediocrity. (p. 792)
Model fit indexing involves the application of approxi-
mate fit indexes, which are not significance tests, so Box (1976) went on to say
there is no inherent binary decision about whether to
retain or reject a null hypothesis just as there is generally Since all models are wrong the scientist must be alert to
no demarcation of the limits of sampling error (Hayduk, what is importantly wrong [emphasis added]. It is inap-
propriate to be concerned about mice when there are tigers
2014). Instead, these indexes are intended as continu-
abroad. (p. 792)
ous measures of model–data correspondence. Some are
scaled as badness-of-fit statistics where, just like the
In his comments on Box (1976), Hayduk (2014) objected
model chi-square, higher values indicate worse fit, but
to the blanket assertion that “all models are wrong”
others are goodness-of-fit statistics where higher val-
because no one will ever see all possible models in all
2 Maydeu-Olivares (2017b) reported that standard errors in the conceivable research situations. Hayduk (2014) also
“MLR” method of Mplus were generally more accurate when noted that “tigers” in SEM correspond to model mis-
based on observed information except when sample average specification while emphasizing the need to detect
absolute correlation residuals exceeded .08 or so. what is importantly wrong with our models.
Pt2Kline5E.indd 163 3/22/2023 3:44:19 PM

Brief History of Approximate quantitative magnitudes of target phenomena that could

Fit Indexes supplement or even replace binary significance testing
(Kelley & Preacher, 2012).
The first approximate fit indexes were introduced in LIS-
The addition of the GFI and AGFI to LISREL coin-
REL V by Jöreskog and Sörbom (1981). Among them
cided with over three decades of extensive research and
are the goodness-of-fit index (GFI) and the adjusted
publications about approximate fit indexes in SEM.
goodness-of-fit (AGIF) index. As their names sug-
Literally dozens, if not hundreds, of approximate fit
gest, both indexes are goodness-of-fit statistics where
indexes have been developed—see Bentler and Bonett
higher values indicate better fit, and they estimate the
(1980), Hu and Bentler (1995), Sun (2005), West et
proportion of variance in the sample covariance matrix
al. (2012), and West et al. (2023) for summaries over
explained by the model-implied (predicted) covari-
different points in time. There are also contempo-
ance matrix. The GFI is analogous to R2 in multiple
rary reports of new fit indexes. An example is Falke
regression analysis for a single criterion. The AGFI is a
et al. (2020), who developed an estimation accuracy
correction to the GFI based on df M and the number of
fit index (EAFI) that takes into account both global
observed variables. For models where df M ≥ 1, AGFI <
model fit and the precision of factor measurement in
GFI, and the difference between their values becomes
latent variable models. Another is Gomer et al. (2019),
larger as df M decreases (i.e., the “penalty” is greater
for models with fewer degrees of freedom). Thus, the who described effect size measures for SEM that are
AGFI is analogous to an adjusted R2 that controls for generalizations of standardized mean differences (d)
N and number of predictor variables in regression. The statistics, or ratios of the expected differences in the
theoretical range for both the GFI and AGFI is 0–1.0, ML fit function under null versus alternative hypoth-
where the result 1.0 indicates perfect model fit, but their eses over standard deviations of those functions that are
values can be < 0 for models with extremely poor fit relatively insensitive to sample size.
(West et al., 2012).
An anecdote from Sörbom (2001) about a 1985 Types of Approximate Fit Indexes
workshop with Karl Jöreskog provides some context for
the approximate fit indexes just described: Although neither exhaustive nor mutually exclusive,
perhaps most approximate fit statistics can be classi-
We had just added GFI and AGFI to the program. In his fied into the types listed next—see Tanaka (1993) for
lecture Karl would say that the chi-square is really all descriptions of other dimensions along which fit statis-
you need. One participant then asked, “Why have you tics can vary:
then added GFI?” Whereupon Karl answered, “Well,
users threaten us saying they would stop using LISREL 1. Absolute fit indexes, such as the GFI, measure how
if it always produces such large chi-squares. So we had to well the model explains the data with no other point
invent something to make people happy. GFI serves that of reference.
purpose. (p. 10)
2. Parsimony-adjusted indexes compare df M to the
A less generous interpretation of Sörbom’s (2001) maximum possible number of degrees of freedom
story is that the GFI and AGFI were added to mollify available in the data (Mulaik, 2009b). As the ratio
researchers (i.e., LISREL’s customers) whose factor of the two quantities just stated approaches 1.0,
models routinely failed the chi-square test and who there is greater relative parsimony of using the data
were reluctant to report those failures (Hayduk, 2014). to estimate model parameters. Although the AGFI
But some of those “failures” could have happened in features a “penalty” for model complexity, it is not
very large samples, so the same issues raised earlier a parsimony-adjusted fit statistic as just defined.
about the meaning of the chi-square test in this context 3. Incremental (relative, comparative) fit indexes
are relevant. I think there was also legitimate academic directly compare the fit of the researcher’s model
interest in developing continuous measures of model– to that of a baseline model, which is typically an
data discrepancy as adjuncts to the chi-square test in independence (null) model that assumes covari-
SEM. There are similar justifications in more tradi- ances of zero among observed exogenous or endog-
tional statistical analyses for estimating effect size as enous variables. The researcher can choose a dif-
Pt2Kline5E.indd 164 3/22/2023 3:44:20 PM

ferent baseline model, although hand calculation of SEM computer tools. They were selected for review
the index may be required if the desired baseline next because
model differs from the computer default baseline
model. 1. They are reported in many published SEM studies.
4. Noncentrality fit indexes estimate the degree to 2. Reviewers of submissions to journals may inquire
which the exact-fit hypothesis is false, given the about them, if they are not reported in manuscripts.
model and data, and these quantities approximate 3. They are all standardized in that their scales are not
parameters in noncentral chi-square distribu- based on those of the observed or latent variables.
tions that also describe sampling distributions for
4. At least of one them (RMSEA) has a broader statis-
approximate fit indexes of this type.
tical rationale and interpretive framework for inter-
5. Predictive fit indexes, also called information- val estimation, hypothesis testing, and planning for
theoretic fit indexes, are derived from information sample size.
theory and they estimate model fit in hypotheti-
cal replication samples of the same size that are But there is an important caveat: Researchers should
randomly drawn from the same population as the not blindly apply thresholds or cutting points, either
original sample. They are used in SEM mainly to static or dynamic, to values of approximate fit indexes
compare alternative models based on the same vari- that supposedly differentiate between models with
ables and fitted to the same data, but where those “good” fit to the data versus models with poor fit. Such
models are not hierarchically related; that is, one thresholds for approximate fit indexes, although seen
model is not simply a subset of the other. Predictive often in literature, are generally invalid; that is, they do
fit indexes are covered in the next chapter. not work for all models and data sets, and their misuse
can lead to bad decisions, especially if the residuals are
Not all approximate fit indexes have withstood the ignored. These ideas are elaborated later in this chapter.
test of time. For example, parsimony-adjusted indexes The core set of three approximate fit indexes is listed
never really “caught on” among applied researchers, so next:
they remain relatively obscure. Others fit indexes were
found wanting in computer simulation studies of their 1. Steiger–Lind Root Mean Square Error of
statistical properties. This includes both the GFI and Approximation (RMSEA) (Steiger, 1990) and its
AGFI, which are affected by sample size and the num- 90% confidence interval. The RMSEA is an abso-
ber of indicators in factor analysis models, among other lute fit index and a noncentrality fit index. It fea-
problems (Sharma et al., 2005). tures a penalty for model complexity, but it is not a
Modern SEM computer tools vary widely in the num- parsimony-adjusted index. The RMSEA is scaled as
ber of approximate fit indexes printed in the program out- a badness-of-fit index where zero is the best result;
put. Some programs, such as Amos and LISREL, print there is no theoretical maximum value. Under cer-
relatively long lists of approximate fit indexes (> 12). tain conditions, distributions of the RMSEA follow
But other computer tools, such as lavaan and Mplus, noncentral chi-square distributions.
print values of just a few approximate fit indexes, about
4–5 each. The availability of so many approximate fit 2. Bentler Comparative Fit Index (CFI) (Bentler,
indexes in computer output risks cherry-picking, or the 1990). The CFI is an incremental fit index and a
selective reporting of values for those indexes that favor noncentrality fit index. It is also a goodness-of-fit
the researcher’s model. A way to avoid cherry-picking is statistic scaled over the interval 0–1.0, where 1.0
to report just a core, minimalist set of indexes while also is the best result. There is no penalty for model
thoroughly examining the residuals. complexity. The computer default baseline model
against which the researcher’s model is compared
can be respecified, if a different standard is desired.
Recommended Core Set
3. Standardized Root Mean Square Residual
Among various approximate fit indexes, I think that a (SRMR) (Jöreskog & Sörbom; 1981). The SRMR
core set consists of only three that are printed by most is an absolute fit index and a badness-of-fit statistic
Pt2Kline5E.indd 165 3/22/2023 3:44:20 PM

that is scaled in the metric of the correlation residu- RMSEA

als. A result of zero indicates no difference between
observed and predicted correlations (i.e., perfect Assuming multivariate normality, a true model, and
fit), so it is a badness-of-fit statistic. large random samples, the sampling distribution for
chiML is central c2 (df M). If df M = 5, for example, the
The RMSEA and CFI include the model chi-square sampling distribution is c2 (5), which is depicted in Fig-
and its degrees of freedom in their formulas. This ure 10.1 with a solid line. If the model is false, however,
means that they share the distributional assumptions of then chiML follows the noncentral distribution
their corresponding test statistic, and if those assump-
tions are untenable, then values of both the approximate c2 (df M, d) (10.8)
fit index and the corresponding test statistic (and its p
value) may be inaccurate. Both indexes were originally where (1) the expected value is df M + d, and (2) d (low-
defined for continuous outcomes with normal distribu- ercase Greek letter delta) is the unnormalized (raw)
tions analyzed with default ML; that is, the test statistic noncentrality parameter that indicates the degree to
is chiML. If the data are appreciably nonnormal, though, which the exact-fit hypothesis is false. If d = 0, then the
then values of chiML, RMSEA, CFI can all be severely exact-fit hypothesis is true and the distribution is cen-
distorted. tral c2 (df M), but d > 0 means that the exact-fit hypoth-
Some SEM computer tools perform ad hoc cor- esis is false.
rections for nonnormality that replace chiML with a An estimator of d that avoids negative values is
rescaled chi-square, such as chiSB, when the RMSEA
and CFI are computed. Although intuitive, such ad d̂ = max (chiML – df M, 0) (10.9)
hoc corrections may estimate different parameters
compared with the original versions based on chiML. In other words, if chiML ≤ df M, its expected value under
Nonnormality-corrected versions of the RMSEA and the exact-fit hypothesis, then d̂ = 0; otherwise, d̂ equals
CFI that estimate the same parameters as their coun- the amount by which chiML exceeds df M. For example,
terparts based on chiML are described by, respectively, if
Brosseau-Liard et al. (2012) and Brosseau-Liard and
Savalei (2014). chiML (5) = 11.107
.20
.15 δ=0
Probability density
δ = 6.107
.10
.05
0
0 5.0 10.0 15.0 20.0 25.0 30.0
Chi-square
FIGURE 10.1. Distributions of central and noncentral c2 for 5 degrees of freedom and where the noncentrality parameter (d)
equals 0 for central c2 and d = 6.107 for noncentral c2.
Pt2Kline5E.indd 166 3/22/2023 3:44:20 PM

then d̂ = 11.107 – 5, or 6.107. If we treat this value as the The statistic ê as a point estimate of e is subject to
population raw noncentrality parameter (d), then chiML sampling error. The interval estimate for the parameter
is distributed over random samples as c2 (5, 6.107), e is the 90% confidence interval computed in noncen-
which is depicted in Figure 10.1 with the dashed line. tral chi-square distributions that takes the form
The normalized noncentrality parameter is less
affected by sample size: [êL, êU] (10.12)
dˆ
dˆ nor = (10.10) where êL and êU, are, respectively, the lower and upper
N −1 bounds. If ê = 0, then êL = 0 and the whole interval
Another interpretation is that d̂nor is a part of a function is a one-sided confidence interval where êU > ê. This
called population discrepancy that estimates errors of explains why the confidence level is 90% instead of the
approximation, or the difference between the sample more typical 95%, the conventional level for two-sided
predicted covariance matrix and the population covari- confidence intervals. If ê > 0, then êL ≥ 0 and the value
ance matrix when the model is fitted to the population of ê does not typically fall at the exact center of the con-
covariance matrix. Any discrepancy reflects deficien- fidence interval. Over random samples where N > 200
cies in the model that have nothing to do with sampling and for models with continuous outcomes that are not
error (Mulaik, 2009b). In contrast, error of estimation severely misspecified, 90% confidence intervals for e
involves the differences between sample and popula- as just described are reasonably accurate (Curran et al.,
tion estimates of the model parameters. It reflects sam- 2003).
pling error because data matrices vary over samples. Browne and Cudeck (1993) suggested that ê ≤ .05 is
The sum of errors of approximation and errors of esti- a favorable result, but results of later computer simula-
mation contribute to overall discrepancy between the tion studies by Chen et al. (2008) indicated little support
population covariance matrix and the sample model- for a universal threshold of .05—or any other value—
implied covariance matrix (see Equations 10.3–10.4 for regardless of whether ê is used alone or jointly with its
the chi-square test). 90% confidence interval. Browne and Cudeck (1993)
Dividing the normalized noncentrality parameter in also suggested that ê ≥ .10 may indicate poor model
Equation 10.10 by df M as a correction for model com- fit, but there is no guarantee. From the perspective of
plexity yields an estimate of the normalized centrality interval estimation, it would make more sense to apply
parameter for each model degree of freedom. Note that this particular heuristic to êU, the upper bound of the
the effect of correcting for df M diminishes as N becomes confidence interval, than to ê, the point estimate. This
increasingly larger (Mulaik, 2009b). Next, taking the is because the whole confidence interval establishes
square root of the ratio just described scales the result limits on the amount of sampling error associated with
in the metric of the correlation residuals and gives us ê, and êU provides a more conservative result (it less
the final composition for the RMSEA, designated next favors the model) because êU > ê. Options for signifi-
as ê (lowercase Greek letter epsilon): cance testing based on ê and its 90% confidence inter-
val are described in Appendix 10.A. Also described
dˆ nor dˆ there is a method for equivalence testing of structural
=eˆ = equation models based on the RMSEA and CFI as an
df M df M ( N − 1)
(10.11) alternative to the standard chi-square (exact-fit) test.
max (chi ML − df M , 0) Other characteristics of the RMSEA are summa-
= rized next:
df M ( N − 1)
Equation 10.11 is not an unbiased estimator of e, the 1. Interpretation of ê, êL, and êU relative to specific
population parameter, due to the restriction that ê ≥ 0, fixed thresholds requires that the RMSEA fit statistic
but it is an optimal estimator (Steiger, 2000). Also, follows noncentral chi-square distributions over large
bias in estimating e from ê decreases with sample size. random samples. Olsson et al. (2004) found in computer
Exercise 1 asks you to compare the meaning of the best simulation studies that empirical distributions of ê for
result for the RMSEA (ê = 0) with that of the corre- smaller models with relatively less specification error
sponding outcome chiML = 0 for the same model and generally followed noncentral chi-square distributions;
data. otherwise, the empirical distributions did not generally
Pt2Kline5E.indd 167 3/22/2023 3:44:20 PM

match up with noncentral distributions, especially for Greenland, 2018). In the AIPE method by Kelley and
models with greater specification error. These results Lai (2011), the researcher specifies e, the population
and others (Chen et al., 2008; Curran et al., 2003; Yuan, RMSEA, and a sufficiently narrow confidence interval
2005; Yuan et al., 2007) question the generality of uni- for sample estimates of this parameter. Next, the com-
versal thresholds for the RMSEA. puter generates the minimum sample sized required for
the specified degree of precision. Their AIPE method
2. Breivik and Olsson (2001) found in Monte Carlo
for the RMSEA is also described later in this chapter.
studies that the RMSEA tends to impose a harsher pen-
alty for complexity on smaller models with relatively
few variables. This is because smaller models tend to CFI
have fewer degrees of freedom, but larger models have
more “room” for higher df M values. Results of both The CFI measures the proportionate reduction in the
theoretical analysis and Monte Carlo simulations by raw noncentrality parameter for the researcher’s model,
Kenny et al. (2015) indicated that the RMSEA is not or
very accurate for models with only a few degrees of
freedom, such as df M = 1, especially when the sample d̂M = max (chiML – df M, 0) (10.13)
size is not large. Specifically, RMSEA values were
generally inflated, which could result in the overrejec- compared with the baseline model, or
tion of correct models. They recommended not calcu-
lating the RMSEA for models with very low df values d̂B = max (chiB – df M, 0) (10.14)
in small samples. Instead, researchers should attempt,
if possible, to identify missing model parameters (i.e., where chiB is the default ML chi-square for baseline
fixed to zero in the original models) and directly esti- model with degrees of freedom equal to df B. The equa-
mate them—see Kenny et al. (2015) for examples. tion is
3. Nevitt and Hancock (2000) evaluated in computer
simulation studies the performance of robust forms of dˆ B − dˆ M dˆ
CFI = = 1− M (10.15)
the RMSEA corrected for nonnormality, one of which dˆ B dˆ B
is based on the Satorra–Bentler scaled chi-square. This
population-corrected robust RMSEA generally out- So defined, the range of the CFI is 0–1.0. The best result
performed the uncorrected version (Equation 10.11), is CFI = 1.0, which happens whenever the chi-square
which tends to be inflated under conditions of nonnor- for the researcher’s model does not exceed its expected
mality. Brosseau-Liard et al. (2012) found that a differ- value, or chiML ≤ df M. The result CFI = .90 says that
ent robust index, one by Li and Bentler (2006) called the the researcher’s model reduces the raw noncentrality
sample-corrected robust RMSEA that also generates parameter by 90% compared with the baseline model
adjusted confidence intervals, better reduced bias due when both are fitted to the same data.
in nonnormality in smaller samples than the popula- The baseline model is usually an independence
tion-corrected version. Robust versions of the RMSEA (null) model that assumes covariances of zero among
including the Li–Bentler statistic just described are all measured endogenous variables (and also with the
printed using lavaan for robust ML methods. exogenous variables). The null model in Mplus uses
the sample covariances among the exogenous vari-
4. Methods to estimate the statistical power of the ables, but other computer programs, such as LISREL,
various null hypotheses described in Appendix 10.A fix the covariances between pairs of measured exog-
based on the RMSEA are elaborated on later in this enous variables to zero, too (Jöreskog & Sörbom, 2018;
chapter. These methods can also generate the minimum Muthén & Muthén, 1998–2017). Check the definition of
sample sizes required to obtain target levels of statisti- the baseline model in the documentation of your SEM
cal power (e.g., what N is needed for power ≥ .90?). An computer tool. The value of the CFI and other incre-
alternative to power analysis in planning for sample size mental fit indexes depend in part on the independence
is accuracy in parameter estimation (AIPE)—also model analyzed along with the researcher’s model.
called precision for planning—which involves obtain- Independence models as just described are unrealis-
ing narrow (precise) confidence intervals (Rothman & tic in probably many studies, where the assumption of
Pt2Kline5E.indd 168 3/22/2023 3:44:21 PM

zero covariances is implausible. As a result, Miles and df B, the maximum possible number of degrees of free-
Shevin (2007) described the CFI and related indexes as dom available in the data as represented by the baseline
answering the question, “How well is my model doing, model. An equation is
compared with the worst model [i.e., the null model;
(chi B / df B ) − (chi ML / df M )
emphasis added] that there is?” (p. 870). Some computer TLI = (10.16)
tools, such as lavaan, permit the user to define a base- (chi B / df B ) − 1
line model that replaces the default baseline model in where the quantity “1” is the expected value of c2/df
automatic calculations of chiB, df B, and values of incre- under the exact-fit hypothesis. Values of the TLI can
mental fit indexes such as the CFI. Otherwise, calcu- potentially fall below zero if the baseline model (not
lation of the CFI for a user-defined baseline model is the researcher's model) fits very well—which is not
accomplished by (1) specifying that baseline model in likely to happen in practice—or TLI values can exceed
program syntax, (2) fitting it to the data, and (3) comput- 1.0 if the researcher’s model closely fits the data (West
ing the value of the CFI by hand using Equation 10.15. et al., 2012). Values of the TLI are relatively unaffected
The independence model is more difficult to define by sample size (Marsh & Balla, 1994). Kenny (2020)
for models with mean structures. For example, an inde- noted that (1) both the CFI and TLI are affected by cor-
pendence model where both covariances and means relation size, that is, higher average correlations among
for measured variables are all fixed to zero may be measured variables are associated with higher CFI and
extremely unrealistic. An alternative for outcomes that TLI values, and vice versa. Also, (2) values of the CFI
are not repeated measure variables allows for means and TLI are highly correlated, so only one of these two
to equal their observed (sample) values. For repeated fit indexes should be reported.
measures, though, the null model should set means as
equal (i.e., no change in level over time), such as when
means from the same variable are fixed to equal the SRMR
value at the first measurement. The same is true for the
variances, which should be set as equal in a compa- Neither the model chi-square nor degrees of freedom
rable way (i.e., no change in variability over time). The values contribute to the SRMR; instead, it is based solely
baseline model just described is the longitudinal inde- on the residuals. It is a standardized version of the root
pendence model (Little et al., 2007). If the observed mean square residual (RMR), which is the square root
means or variances change appreciably over time, of the unweighted average squared covariance residual.
there is more potential information to be extracted by Perfect model fit is indicated by RMR = 0, and increas-
a researcher’s model over the independence model as ingly higher values mean worse fit. But interpreting
just defined; otherwise, there is less information to be RMR > 0 can be tough unless all the observed variables
recovered by the researcher’s model (Little, 2013). have the same raw score metrics. The SRMR is com-
Hu and Bentler (1995) suggested that CFI ≥ .95 is puted as the square root of the average squared covari-
a favorable result, a benchmark generally consistent ance residual standardized by the product of the stan-
with results from some later Monte Carlo studies, such dard deviations for the two corresponding measured
as Hu and Bentler (1999). But results from even more variables. It approximates the square root of the average
recent simulation studies failed to support the general- squared correlation residual, but the two statistics may
ity of any specific cutoff for the CFI over variations in not be exactly equal in the same analysis. If SRMR =
models and degrees of nonnormality in the data (Fan & 0, then the model has perfect fit. It is easier to interpret
Sivo, 2005; Yuan, 2005). Robust forms of the CFI for SRMR > 0 because correlation residuals are already
nonnormal described by Brosseau-Liard and Savalei standardized, so the SRMR has a common meaning
(2014) are printed by lavaan for robust ML methods. regardless of the original (unstandardized) metrics. If
Some researchers prefer the Tucker–Lewis Index SRMR = 1.0, then the correlation residuals are as large
(TLI)—an incremental fit index originally described as the elements in the sample correlation matrix, which
by Tucker and Lewis (1973) for factor analysis models indicates extremely poor model fit (West et al., 2012).
and then later generalized to covariance-based SEM by Pavlov et al. (2021, pp. 114, 127) described slight differ-
Bentler and Bonnett (1980) and renamed as the non- ences in the computation of the SRMR by Mplus com-
normed fit index (NNFI)—over the CFI. The TLI is pared with some other SEM computer programs, such
also a parsimony-adjusted index that compares df M to as LISREL, and they described how to generate SRMR
Pt2Kline5E.indd 169 3/22/2023 3:44:21 PM

values in Mplus that are directly comparable with those factors, or errors; (3) whether some indicators mea-
computed in LISREL. sured one or two factors; and (4) types of specification
A value of the SRMR < .08 is generally considered errors in sample models. They determined cutoff points
as favorable (Hu & Bentler, 1999), but averages like the for approximate fit indexes that detected most incor-
SRMR can hide appreciable variation among individ- rect models while “accepting” most correct models for
ual correlation residuals. Suppose that the correlation individual approximate fit indexes and combinations of
residuals range from .02 to .04 and that SRMR = .03, two such statistics.
or less than .08. Here, the residuals are uniformly low Based on their simulation results, Hu and Bentler
and relatively close to zero in value. These results seem (1999) suggested RMSEA < .06 as a heuristic based
to favor the model. Now imagine that SRMR = .03, but on this single index, which is similar to the threshold
now the residuals range from –.12 to .18. Here it is obvi- described by Browne and Cudeck (1993) for the same
ous that some absolute correlations residuals are > .10, fit statistic (≤ .05). Evidence that this threshold is not
which could indicate more serious local fit problems. universal over variations in models and data was cited
Thus, it is better in written reports to describe the cor- earlier. Hu and Bentler (1999) also suggested a heu-
relation residuals—or even better, present the whole ristic for the CFI and SRMR as a set. Their rationale
matrix—rather than to rely solely on an average value was that the CFI seemed to be most sensitive to mis-
such as the SRMR. specified factor loadings, whereas the SRMR seemed
to be most sensitive to misspecified factor covariances.
Their combination rule or two-index strategy for
THRESHOLDS FOR APPROXIMATE concluding that there is a “relatively good fit” between
FIT INDEXES model and observed data based on these indexes was
CFI ≥ .95 and SRMR ≤ .08. This combination rule was
A natural question about continuous approximate fit not supported in later Monte Carlo studies by Fan and
indexes involves the range of values indicating “accept- Sivo (2005), who suggested that the original Hu and
able” or even “good” model fit. Well, unfortunately, there Bentler (1999) findings about the CFI and SRMR for
is no simple answer to this question because there is no factor analysis models were artifacts due to confound-
direct correspondence between values of approximate ing between types of misspecification and severity of
fit indexes and the seriousness or types of specification misspecification for simpler versus more complex mod-
error, just as for the chi-square test. Also, many interpre- els. Results of other simulation studies also do not sup-
tive guidelines or rules of thumb for the RMSEA, CFI, port the respective thresholds just listed (Yuan, 2005).
and SRMR, among other indexes, seen in the literature To be clear, Hu and Bentler (1999) never intended
or web pages about SEM are untrustworthy; that is, there their fixed thresholds for the RMSEA, CFI, SRMR,
is little evidence that they actually differentiate between and other approximate fit indexes to be treated as any-
model “good” versus “poor” fit to the data across the thing other than rules of thumb. One reason is that it
wide range of models and data analyzed in different dis- is impossible in Monte Carlo studies to evaluate the
ciplines. whole constellation of model and data analyzed in real
studies. It is also true that values of basically all global
fit statistics (including the model chi-square) can be
Fixed Thresholds
affected by a host of nuisance factors, such as sample
Most interpretive guidelines about fixed thresholds size, the number of indicators per factor models, and
stem from a series of computer simulation studies con- magnitudes of indicator-factor associations, that are
ducted in the 1980s and 1990s on the distributional unrelated to the actual extent of model misspecification
characteristics of approximate fit indexes under vary- (Beauducel & Wittman, 2005; Greiff & Heene, 2017;
ing data and model conditions. Gerbing and Ander- Yuan, 2005). The commonly used cutoff values are
son (1993) reviewed many of these early studies, and also generally insensitive to violations of assumptions
more recent examples include Hu and Bentler (1998) about uncorrelated errors in CFA models (Heene et al.,
and Marsh et al. (1996). The most influential early 2012). Researchers rarely analyze models known to be
simulation study is undoubtedly Hu and Bentler (1999), correct in the population, which is the starting point
who studied 3-factor, 15-indicator CFA models under in simulation studies. Fixed thresholds are associated
varying conditions of (1) sample size; (2) distributional with ML estimation for continuous outcomes, and there
characteristics (normal vs. nonnormal) for indicators, is evidence that they do not generalize to other methods
Pt2Kline5E.indd 170 3/22/2023 3:44:21 PM

or when analyzing ordinal outcomes (Nye & Drasgow, specified model, given model and data characteristics
2011; Xia & Yang, 2019). in a particular study (Niemand & Mai, 2018).
Blind reliance on fixed thresholds for approximate The method by Niemand and Mai (2018) generates
fit indexes that supposedly indicate “good” model fit is one-sided confidence intervals—lower-bound intervals
regrettably widespread. Presented next is a hypotheti- for goodness-of-fit statistics, upper-bound intervals for
cal example of the kind of reporting that is both overly badness-of-fit statistics—for approximate fit indexes
telegraphic and deficient; yet it is seen in far too many over a range of correctly specified CFA models and
SEM studies: sample sizes for normal or nonnormal data. The widths
of these intervals can be varied over different levels of
The final model had good fit to the data, c2 (20) acceptable Type I error, and the lower or upper mar-
= 5.605, p = .001, CFI = .964, SRMR = .075, gins of the intervals define the flexible thresholds for
and RMSEA = .057 (Hu & Bentler, 1999). specific fit statistics. For instance, in relatively small
samples (N < 300) or for very large models, such as
The parenthetical citation in this example highlights the 100 indicators for 20 factors, flexible thresholds for the
fact that values of the RMSEA, CFI, and SRMR all fall CFI are generally < .95, the Hu and Bentler (1999) fixed
on the “good” side of the Hu and Bentler (1999) rules. cutoff for the same index. Over the variations in models
But there are serious problems here: Not only is the and data, the hit rates for flexible cutoff values were
failed chi-square test ignored, but also the 90% confi- generally higher than for fixed cutoff values. There
dence interval for the RMSEA is not reported and there is a website where the researcher describes the CFA
is no mention of the residuals. There is also no explana- model, the degree of nonnormality, and the amount of
tion of how the decision to retain the model is related accepted uncertainty.3 Tailored cutoff values are gener-
to the string of numerical values stated in the text for ated for the RMSEA, CFI, TLI, and SRMR.
this example. That is, the reader is given no explanation A different method, also for CFA and other kinds of
about why model fit is deemed to be “good.” structural equation models models, by McNeish and
The hypothetical example just considered matches Wolf (2020) conducts Monte Carlo simulations for both
reality, too: Among 75 published SEM studies reviewed the researcher’s model with standardized parameter
by Ropovik (2015), a model was retained in 97% of the estimates and a misspecified version of that model with
studies, but out of these models, 80% failed the chi- an additional parameter. Precision of factor measure-
square test. Most authors of these studies ignored the ment is estimated from the standardized factor load-
failed chi-square test or neglected to mention that fact ings. Distributions of the RMSEA, CFI, and SRMR are
at all, and about half of these authors relied on approxi- generated for both models, and the algorithm identifies
mate fit indexes to justify retaining the model but with values of each index that (1) reject at most 10% of the
no explicit justification. Although universal agreement researcher’s model, but (2) reject at least 90% of the mis-
among SEM specialists is rare, I believe the consensus specified models, if any such values exist. Otherwise,
is that uncritical reliance on fixed thresholds is not up the model and data are insufficient for unambiguous
to standard. discrimination between true and misspecifed versions.
There is a web-based Shiny application (interactive web
site), Dynamic Model Fit, where the researcher submits
Dynamic Thresholds
a text file that specifies the sample size, CFA model, and
Dynamic (flexible) thresholds for CFA measurement the standardized estimates (Wolf & McNeish, 2020).4
models are intended to avoid the limitations of fixed It is too early to tell whether flexible thresholds for
thresholds for approximate fit indexes by tailoring cut- approximate fit indexes will have a lasting impact in
off points to more specific combinations of models and SEM research. As of this writing, they were available
data. Such adjustments take into account the sample for some, but not all kinds of structural equation mod-
size, numbers of factors or indicators, magnitudes of els. Although dynamic thresholds may have advantages
factor loadings and error variances, and data nonnor- over fixed thresholds, dynamic thresholds are not sub-
mality, all of which can distort values of approximate stitutes for carefully examining all information about
fit indexes (West et al., 2023). Dynamic thresholds are
not intended to serve as the sole basis for retaining or
3 https://flexiblecutoffs.org/
rejecting models; instead, such thresholds estimate
what the value of a fit index should be for a correctly 4 https://www.dynamicfit.app/
Pt2Kline5E.indd 171 3/22/2023 3:44:21 PM

model fit, including the model chi-square test and the the manuscript. If the model is so large that doing so
residuals. That is, the same mistakes that have been would be unwieldy, then (a) provide such tables in the
made with fixed thresholds for approximate fit indexes supplemental materials and (b) describe in the manu-
should not be repeated with dynamic thresholds—see script the pattern of residuals, such as the locations of
West et al. (2023) for additional discussion of these larger residuals and their signs. Look for patterns that
issues. Depaoli et al. (2023) described model fit evalu- may be of diagnostic value in understanding how the
ation assessment from a Bayesian perspective in which model may be misspecified. Any report of the results
theoretical posterior distributions for various fit statis- without information about the residuals is deficient.
tics can be analyzed. Unfortunately, incomplete reporting in this area is the
norm rather than the exception. For example, in our
review of 144 published SEM studies in the area of
RECOMMENDED APPROACH organizational management, residuals were mentioned
TO FIT EVALUATION in only about 17% of these works (Zhang et al., 2021).
5. If you report values of approximate fit indexes,
The method outlined next is consistent with reporting then include those for the minimal set described earlier
standards for SEM (Appelbaum et al., 2018) and also in this chapter. But do not try to justify retaining the
calls on researchers to report more information about model by depending solely on thresholds, either fixed
model fit than has been true in past experience: or dynamic, for such global fit statistics. This is espe-
cially true if the model failed the exact-fit test and the
1. If you use a simultaneous estimation method, pattern of residuals suggests a specification error that
report the model chi-square with its degrees of freedom is not trivial.
and p value. There are some analyses, such as when
outcome variables are dichotomous and path coeffi- 6. If you respecify the initial model, explain the
cients are estimated using methods for binary logistic rationale for doing so. You should also explain the role
or probit regression, where a model chi-square may that diagnostic statistics, such as residuals, played in
be unavailable—see Muthén and Muthén (1998–2017, the respecification. That is, point out the connections
chap. 3) for examples—but these are exceptional cases. between the numerical results for the model, relevant
theory, and modifications to the original model (Chap-
2. If the model fails the exact-fit test, then (a) ter 3). If you retain a respecified model that still fails
directly say so and, regardless of sample size, (b) ten- the exact-fit test, then demonstrate that model–data
tatively reject the model. Next, (c) diagnose both the discrepancies are truly slight; otherwise, you have
magnitude and possible sources of misfit (inspect local neglected to show that there is no appreciable covari-
fit). The rationale is to detect statistically significant but ance evidence against the model.
slight model–data discrepancies that explain the fail-
ure. This is most likely to happen in a large sample. The 7. Statistical evidence about model fit is important,
initial decision to reject the model could be rescinded, but it is not the sole factor in deciding whether to retain
but only based on local fit evidence along with explana- a model. For example, the parameter estimates should
tion about why observed model–data discrepancies are make sense, given the research problem. Models with
actually inconsequential. reasonable prospects for fitting data sets in future repli-
cations generated by the same causal processes should
3. If the model passes the exact-fit test, you still have be preferred over models that fit a particular data set
to inspect local fit. The rationale is to detect model– well. This is especially true for complex, overparam-
data discrepancies that are not statistically significant eterized models that could potentially fit just about any
but still great enough to cast doubt on the model. This arbitrary data. Such models are (a) less falsifiable than
most likely occurs in a small sample. If evidence about more parsimonious models and (b) less likely to gener-
local fit indicates appreciable discrepancies, then the alize over variations in samples and settings (Preacher
model should be rejected even though it passed the chi- et al., 2013).
square test.
8. If a model is retained, then the researcher should
4. Report a matrix of residuals, such as correlation, explain why that model should be preferred over equiv-
standardized, or normalized residuals, in the body of alent or near-equivalent versions that, respectively,
Pt2Kline5E.indd 172 3/22/2023 3:44:21 PM

explain the same data exactly as well or nearly as well. the hypothesis of no direct causal effect between X and
This step is much more logical than statistical, and it Y may be cast in doubt, and a possible respecification
also involves describing what might be done in future is to add a direct causal effect between them. Another
research to differentiate between any serious compet- possibility consistent with the same positive residual is
ing models. Complete reporting about equivalent or to specify a disturbance correlation, if both variables
near-equivalent models is rare; thus, conscientious are endogenous. But just which type of effect to add to
readers can really distinguish their own SEM analyses the model (causal vs. noncausal) or their directionali-
by addressing this issue. The generation and assess- ties (e.g., X causes Y versus Y causes X) are not things
ment of equivalent versions of structural models are that residuals can tell us. Just as there is no magic in
covered in the next chapter. global fit statistics, there is also none in diagnostic
statistics, at least none that would relieve researchers
9. If no model is retained, then your skills as a
from the burden of having to think very carefully about
scholar are needed to explain the implications for the respecification.
theory tested in your analysis.
At the end of the day, regardless of whether or not GLOBAL FIT STATISTICS
you have a retained a model, the real honor comes from FOR THE DETAILED EXAMPLE
following, to the best of your ability, a thorough evalu-
ation process to its logical end. The poet Ralph Waldo The previous chapters dealt with the ongoing analysis
Emerson put it this way: The reward of a thing well of the Roth et al. (1989) recursive path model of illness
done is to have done it (Mikis, 2012, p. 294). (Figure 8.1), and we reviewed default ML parameter
estimates (Table 9.2) and residuals, including correla-
Tips for Inspecting the Residuals tion, standardized, and normalized residuals (Table
9.4). The residuals indicated local fit problems, espe-
Although it is critical to report on the residuals, you cially for the variables fitness and stress. Next, we con-
should know that just as with outcomes from the chi- sider evidence about global fit. One of the lessons of
square test and values of approximate fit indexes, there this example is that some, but not all, global fit statis-
is no dependable or trustworthy connection between the tics signal problems about model fit in this example.
size of the residuals and the type or amount of model Another is that through selective reporting of these
misspecification. For example, the degree of specifica- results, a researcher could potentially justify reten-
tion error indicated by a relatively small correlation tion of the model if the residuals are ignored, which I
residual may be slight and yet may be severe. One rea- believe would be a poor decision in this case.
son is that values of residuals and other diagnostic sta- Listed for analysis 1 in Table 10.1 is the lavaan
tistics, including modification indexes—defined in the syntax file for fitting the Roth et al. (1989) path model
next chapter—are themselves affected by misspecifica- in Figure 8.1 to the data in Table 4.3 for estimation
tion. An analogy in medicine is a diagnostic test for a in default ML. The output file includes parameter
disease that is less accurate in patients who have that estimates, global fit statistics, residuals, and modifi-
illness. A second reason is error propagation in global cation indexes.5 Reported in Table 10.2 are values of
estimation where misspecification in one part of the the global fit statistics described earlier. The model
model distorts estimates in other parts of the model. A just fails the chi-square test at the .05 level, chiML (5)
third is equivalent models, which have identical residu- = 11.107, p = .049. This outcome for a global signifi-
als but also contradictory patterns of causal effects. But cance test of the whole model is consistent with prob-
we do not generally know in advance which parts of the lems apparent in the residuals. Exercise 2 asks you to
model are incorrect, so it can be difficult to understand reproduce the model chi-square from the log-likelihood
exactly what the residuals are telling us. values for H0 (researcher model) and H1 (unrestricted
Inspecting the pattern of residuals can sometimes be model) reported in the table.
helpful. Suppose that a pair of variables X and Y where Values of approximate fit indexes in Table 10.2 sug-
rXY > 0 are connected by indirect causal pathways only. gest a mixed picture. The value of the RMSEA is .057,
The residual for that pair is positive, which says that
the model underpredicts their association. In this case, 5 Analysis 1 in Table 10.1 is the same as analysis 1 in Table 9.1.
Pt2Kline5E.indd 173 3/22/2023 3:44:21 PM

TABLE 10.1. Analyses, Script Files, and Packages in R for

Computing Global Fit Statistics, Power, and Precision
for a Recursive Path Model of Illness
1. Global fit statistics, residuals, and roth-cov-ml.r lavaan
modification indexes
2. Power and precision estimates for roth-power-precise.r semTools
whole model and individual parameters lavaan
WebPower
MBESS
Note. Output files have the same names except the extension is “.out.”
which does not seem terrible, but the upper bound of its lem. For Exercise 3, you are to reproduce the RMSEA
90% confidence interval, .103, is high enough to war- and CFI values from other information in Table 10.2.
rant concern. Outcomes of significance tests based on Exercise 4 involves fitting the baseline model for this
the RMSEA are also mixed. For example, the model example to the data and reproducing the result chiB (9)
passes the close-fit test but fails both the not-close-fit = 165.608 in the table.
test and also the poor-fit test—see Appendix 10.A. The Through the use of selective reporting and relying on
tested path model reduces the raw noncentrality param- fixed thresholds for approximate fit indexes, it could be
eter by 96.1% compared with the baseline model (CFI possible to justify retaining the model in this example
= .961), which in lavaan allows the two exogenous by (1) ignoring the failed chi-square test; (2) neglect-
variables, exercise and hardy, to covary, but all other ing to inspect the residuals; and (3) touting RMSEA <
variables are assumed to be independent. The value of .06, CFI > .95, and SRMR < .08 (i.e., thresholds from
the CFI in this analysis does not suggest a glaring prob- Hu & Bentler, 1999) while not reporting the unfavor-
able upper bound of the confidence interval based on
the RMSEA, or .103. For reasons explained (i.e., the
TABLE 10.2. Values of Global Fit Statistics residuals in Table 9.4), though, I believe that a decision
for a Recursive Path Model of Illness to retain the model in this example would be a mistake.
N 373
ln L0 –9,429.689
POWER AND PRECISION
ln L1 –9,424.135
Model chi-square This presentation emphasizes a priori (prospective)

power, which is estimated before the data are collected.
chiML 11.107, p = .049
Some granting agencies require estimates of a priori
dfM 5
power as part of the research plan, which makes sense:
Approximate fit indexes Why fund the proposal if there is little chance in signif-
icance testing of detecting the effect of interest (i.e., low
RMSEA (ê) [90% CI] .057 [.003, .103]
power)? Results from power analysis can also provide a
CFI .961
rationale for the proposed sample size in a grant appli-
SRMR .051
cation (i.e., power is high, given the planned N)—see
Baseline (independence) model National Institute of Health (2018) for examples.
In contrast, a retrospective (post hoc, observed)
chiB 165.608, p < .001
power analysis is conducted after the data are col-
dfB 9
lected. This means that sample statistics are treated
Note. The measured exogenous variables, exercise and hardy, are alas they though were population parameters; that is,
lowed to covary for the independence model. observed effects sizes are assumed to be population
Pt2Kline5E.indd 174 3/22/2023 3:44:21 PM

values. There are three problems: (1) Statistics merely 2. The MacCallum–RMSEA method by MacCal-
estimate parameters, and those estimates are subject lum et al. (1996) and later extended by Hancock and
to many potential kinds of distortion (Anderson et al., Freeman (2001), among others, is based on the popu-
2017). (2) It is a common but false belief that higher lation RMSEA and noncentral chi-square distributions
observed power implies stronger evidence for null for tests of the exact-fit hypothesis (e0 = 0), the close-
hypotheses that are not rejected (Hoenig & Heisey, fit hypothesis (e0 ≤ .05), the not-close-fit hypothesis
2001). (3) The method is more like an autopsy than (e0 ≥ .05), or the poor-fit hypothesis (e0 ≥ .10) for the
a diagnostic procedure because if observed power is whole model (Appendix 10.A). The analysis for any of
low, it is too late because the data are already collected the null hypotheses just listed is conducted by specify-
(Kline, 2013a). For all these reasons, observed power is ing N, a, df M, and a suitable value for e1 under the alter-
not considered further. native hypothesis. For example, e1 could be specified for
In SEM, a priori power is estimated by specifying in a the close-fit hypothesis as .08, which exceeds the fixed
computer tool (1) characteristics of the population model threshold value of .05 for “close fit” but is lower than the
based on theory or results from prior empirical studies; threshold for “poor fit.” For the not-close-fit hypothesis,
(2) specific null and alternative hypotheses about indi- e1 could be specified as .01, a value that represents even
vidual parameters or the whole model; (3) a criterion better approximate fit than e1 = .05. Another option is to
level of statistical significance, a; and (4) a planned sam- determine the minimum sample size needed to reach
ple size. One variation specifies a target level minimum a target level of power, given a, df M, e0, and e1. In con-
level of power, such as ≥ .90, and the computer estimates trast to the Satorra–Saris method, the researcher is not
the minimum sample size required to obtain target required to specify numerical values for model param-
power. A total of three different methods are described eters in the MacCallum–RMSEA method.
next. The first two methods (Satorra–Saris, MacCal- A limitation of the MacCallum–RMSEA method is
lum–RMSEA) were originally described for models the reliance on fixed threshold values for e0 that may
with continuous outcomes estimated with default ML, not be very meaningful. (Recall the earlier discus-
but the third method (Monte Carlo simulation) can be sion about limitations of fixed thresholds for sample
applied to models with categorical or continuous out- RMSEA statistics.) For example, Kim (2005) studied
comes or estimated with default ML or other methods: a total of four approximate fit indexes, including the
RMSEA and CFI, in relation to power estimation and
1. The Satorra–Saris method (Satorra & Saris, determination of sample size requirements for target
1985) estimates the power of the likelihood ratio test for levels of power. Kim (2005) found that estimates of
a single parameter. Suppose that a researcher believes power and minimum sample sizes varied as a function
that the population unstandardized direct effect of X on of choice of the index, number of observed variables
Y is 5.0. Using this and other a priori values of param- and df M, and magnitude of covariation among the vari-
eters in the researcher’s model, the method then gener- ables. This result is not surprising given that (1) differ-
ates the model-implied population covariance matrix ent approximate fit indexes reflect different aspects of
under the alternative hypothesis that corresponds to the model–correspondence, and (2) little direct correspon-
full model with the parameter X → Y. Next, the reduced dence exists between values of various fit statistics and
model without X → Y (it is fixed to zero) that corre- df M or types of model misspecification. As Kim (2005)
sponds to the null hypothesis is fitted to the predicted notes, a value of .95 for the CFI does not necessarily
matrix. The statistical discrepancy between the null indicate the same misspecification as a value of .05 for
model without the target parameter and the data matrix the RMSEA.
predicted by the alternative model with that model
parameter estimates a noncentrality parameter for the 3. The Monte Carlo simulation method is a mod-
model chi-square, which in turn is converted to the esti- ern and flexible alternative that assumes neither con-
mated power of the likelihood ratio test for the model tinuous outcomes nor default ML estimation (i.e., nor-
parameter of interest. The method must be repeated for mality is not required except if the method is default
every individual model parameter for which an esti- ML). It estimates the proportion of generated samples
mate of power is desired. A variation allows for restric- where the null hypothesis that some parameter of inter-
tions on multiple model parameters in the null model est is zero is correctly rejected. The method can also
(Saris & Sattora, 1993). be applied to estimate the power of significance tests
Pt2Kline5E.indd 175 3/22/2023 3:44:21 PM

for the whole model based on various global fit sta- researcher. The input for the method includes df M,
tistics; that is, different criteria can be simultaneously the confidence level (i.e., 1 – a), w, and a presumed
analyzed. population RMSEA value, e*. The value of e* could be
As in the Satorra–Saris method, the researcher specified based on the findings of prior empirical stud-
using the Monte Carlo method specifies the model ies conducted with a similar population, meta-analytic
with values of its parameters based on theory or prior results, or fixed thresholds of the kind described ear-
research. Next, the computer generates simulated ran- lier for the RMSEA, such as e* = .05 for “close fit”
dom samples, given the model and its parameters as (Appendix 10.A). Kelley and Lai (2011) suggested val-
specified. Finally, the model is estimated in each of ues for w over the range .02–05, where .02, .035, and
the generated samples, and results for global fit sta- .05, respectively, correspond to the qualitative descrip-
tistics and parameter estimates are aggregated and tors “minimum” (unnecessarily narrow), “ideal,” and
saved to a file. The method can be repeated under “maximum” (unnecessarily imprecise). If uncertainty
different scenarios concerning varying sample sizes, remains about values for e* or w, the method can be
parameters, or a, among other possibilities (Bandalos repeated over a range of plausible alternative values.
& Leite, 2013; Hancock & French, 2013; Leite et al., Results of Monte Carlo studies by Kelley and Lai
2023). There are also special Monte Carlo methods (2011) over varying types of models, degrees of mis-
for estimating power in models with multiple indirect specification, and sample sizes indicate that their
effects (Thoemmes et al., 2010). RMSEA-based precision method was most accurate
when the specified value e* is closer to the real value e
There are increasingly more computer tools for and when sample RMSEA statistics follow noncentral
power analysis in SEM. For example, some SEM chi-square distributions. (Recall that this is probably
software, such as Mplus and LISREL, have built-in unlikely in small samples; Curran et al., 1996, 2003.)
capabilities for Monte Carlo simulation (Myers et al., There are tables in Kelley and Lai (2011, p. 24) for
2011). Wang and Rhemtulla (2021) described a Shiny looking up minimum sample sizes over a range of e*
app, pwrSEM, for estimating power for tests of single and w values for models where df M = 30 or 60. Their
parameters through simulations of models where com- precision method is also implemented in the MBESS
mon factors have multiple indicators.6 The MacCal- package for R (Kelley, 2022). A limitation is that the
lum–RMSEA method is implemented in the semTools method assumes normally distributed, continuous out-
package for R (Jorgensen et al., 2022), and there are comes (i.e., it is based on default ML estimation).
freely accessible calculating web pages for the meth-
od.7 The R package WebPower (Zhang et al., 2023) has Power and Precision Estimates
capabilities for all three SEM power analysis methods for the Detailed Example
just described. There are also versions of WebPower
for Andoid devices and for online use within a Web The annotated script file and output file for estimating
browser (Zhang & Yuan, 2018).8 The R package pow- power and precision in the continuing example listed
erMediation can estimate power for tests of mediation in Table 10.1 for analysis 2 can be downloaded from
the book’s website. All analyses assumed continuous
effects in logistic, Poisson, and Cox (proportional haz-
outcomes and default ML, but the WebPower package
ards) regression and also for the Sobel test with con-
can also simulate nonnormal distributions analyzed by
tinuous variables (Qiu, 2021). See Feng and Hancock
robust ML.
(2023) for descriptions of additional procedures and
Reported in Table 10.3 are power estimates for the
computer tools for a priori power analysis in SEM.
exact-fit, close-fit, not-close-fit, and poor-fit hypotheses
The Kelley–Lai precision method by Kelley and
in the MacCallum–RMSEA method at a = .05 for the
Lai (2011) generates the minimum sample sizes needed
recursive path model of illness. Results for the original
to estimate e, the parameter for the RMSEA statistic,
sample size, N = 373, are presented in the fifth column
within a target margin of error, w, specified by the
of the table, and low power is apparent for all tests. For
6 https://yilinandrewang.shinyapps.io/pwrSEM/ example, if the model does not have exact fit in the pop-
ulation, there is only a .338 chance over random sam-
7 http://quantpsy.org/rmsea/rmsea.htm
ples of detecting this outcome with the chi-square test.
8 https://webpower.psychstat.org/ Only the test of the poor-fit hypothesis has somewhat
Pt2Kline5E.indd 176 3/22/2023 3:44:21 PM

TABLE 10.3. Power Analysis Results Based on the MacCallum–RMSEA

Method for a Recursive Path Model of Illness
Fit hypothesis Type e0 e1 Power at N = 373 Minimum N for power ≥ .90
Exact AS 0 .05 .338 1,319
Close AS .05 .08 .317 1,997
Not close RS .05 .01 .229 1,527
Poor RS .10 .05 .613 758
Note. AS, accept–support; RS, reject–support. The criterion level of statistical significance is a = .05.
higher power at the original sample size, or .613; that of indirect effects in the model—see the table. Other
is, if the model does not have poor fit in the population, results are less satisfactory. For instance, the power for
the significance test based on the RMSEA would be the Sobel test of the indirect effect of exercise on illness
correct 61.3% of the time. The minimum sample sizes through fitness is only .349.
needed in order for power to be at least .90 for test of Results in Tables 10.3 and 10.4 reflect an overall
the exact-fit and poor-fit hypotheses are, respectively, trend in SEM that power for significance tests of the
1,319 and 758 (last column in the table). Exercise 5 whole model may be low when there are relatively few
asks you to interpret results for the test of the close-fit degrees of freedom (here, df M = 5), even for a sample
hypothesis in Table 10.3. size that is large enough (N = 373) for reasonable power
Displayed in Figure 10.2(a) are standardized val- in tests for individual model parameters (Wang &
ues for all parameters of the Roth et al. (1989) recur- Rhemtulla, 2021). In general, for models with only one
sive model of illness. These values are not empirical or two degrees of freedom, sample sizes in the thou-
estimates; instead, they are based on plausible values, sands may be required in order for power at the model
given the variances of the observed variables in the level to be > .80 or so (MacCallum et al., 1996, p. 144).
model (see Table 4.3). Also, the standardized path coef- If an analysis in global fit testing has a low probability
ficients in Figure 10.2(a) are generally more conserva- of rejecting a false model or detecting a true model, this
tive than their empirical counterparts (see Figure 8.1). fact should temper the researcher’s enthusiasm for their
They are treated here in this pedagogical example as preferred model.
a priori standardized parameters in the Monte Carlo Although the Kelley–Lai precision method is not
simulation method. Following guidelines for starting based on significance testing, it does concern the whole
values outlined in Topic Box 9.1, values for the unstan- model instead of individual parameters in the model.
dardized parameters in Figure 10.2(b) are based on The result in this method is that a sample size of
the corresponding standardized parameters in 10.2(a). N = 2,760 is needed so that the width of 95% confidence
Computer simulations were conducted with the unstan- intervals in samples is no greater than w = .035, assum-
dardized values over 1,000 generated samples at a = ing the population RMSEA is e* = .04 (see the output
.05, N = 373, and default ML. file for analysis 2 in Table 10.1). This target sample size
Estimated power for significance tests of individual is just over 7 times larger than that in the original study,
variances or covariances, such as the three disturbance or N = 373 (Roth et al., 1989).
variances in Figure 10.2(b), were all basically 1.0 (i.e.,
significant test outcomes are practically guaranteed,
assuming the model as specified is true). Power esti- SUMMARY
mates for tests of individual coefficients for direct or
indirect effects are reported in Table 10.4 along with Although optimal strategies for global fit testing are
values of coverage, or the proportion of 95% confi- still debated in the SEM literature, there is reasonable
dence intervals that contain the parameter over gener- agreement that some widespread practices are inad-
ated samples—all are < 1.0, but are still generally high. equate. One such practice involves ignoring a failed
For example, estimated power exceeds .90 for 3 out of 4 exact-fit (chi-square) test in samples that are not very
significance tests of direct effects and for 1 out of 2 tests large, and another is the claim for “good” fit based
Pt2Kline5E.indd 177 3/22/2023 3:44:21 PM

(a) Standardized parameters

.95
1.00 DF .90
.20
Exercise Fitness DI
−.10
.10 1.00 Illness
−.20 .20
Hardy Stress
.95
DS
(b) Unstandardized parameters

304.00
4,500.00 DF
1 3,600.00
.05
Exercise Fitness DI
−.35 1
251.00 1,400.00 Illness
−.18 .38
Hardy Stress
1.045.00
1
DS
FIGURE 10.2. Standardized and unstandardized parameters for Monte Carlo simulations of a recursive path model of ill-
ness.
on values of approximate fit indexes that exceed—or, TABLE 10.4. Power Analysis Results Based
in some cases, fall below—fixed thresholds based on on the Monte Carlo Method for Individual
prior computer simulation studies while ignoring the Direct or Indirect Effects in a Recursive Path
Model of Illness
residuals. A problem with fixed thresholds is that val-
ues of approximate fit indexes depend in part on factors Effect Power Coverage
that have little to do with the approximate truth of the Direct effects
model. Perhaps flexible thresholds for approximate fit Exercise → Fitness .962 .938
indexes will turn out be more accurate, but they would
Hardy → Stress .988 .949
still be no substitutes for direct and thorough inspec-
Fitness → Illness .520 .957
tion of the residuals for possible sources of misspecifi-
Stress → Illness .975 .938
cation. If the statistical power to reject a false model
(e.g., exact-fit test) is low, then little support for the Indirect effects
researcher’s model is indicated even if power for tests Exercise → Fitness → Illness .349 .931
of individual parameters is adequate. The next chap-
Hardy → Stress → Illness .906 .931
ter deals with comparisons among multiple structural
equation models all fitted to the same data. Note. The criterion level of statistical significance is a = .05.
Pt2Kline5E.indd 178 3/22/2023 3:44:22 PM

LEARN MORE Box, G. E. P. (1976). Science and statistics. Journal of the

American Statistical Association, 71(356), 791–799.
The classical work by Box (1976) is a must-read for research-
Greiff, S., & Heene, M. (2017). Why psychological assess-
ers who test statistical models, Greiff and Heene (2017) warn
ment needs to start worrying about model fit. European
against overreliance on fixed thresholds for approximate fit
Journal of Psychological Assessment, 33(5), 313–317.
indexes, and Hayduk (2014) cautions against ignoring the
model chi-square test in SEM. Hayduk, L. A. (2014). Shame for disrespecting evidence:
The personal consequences of insufficient respect for
structural equation model testing. Medical Research
Methodology, 14(1), Article 124.
EXERCISES
1. Interpret chiML = 0 versus RMSEA = 0 for the same 4. In lavaan, specify the baseline model, fit it to the
model and data. data, and generate chiB (9) = 165.608 in Table 10.2.
2. Calculate chiML from the log-likelihood values in 5. Interpret results for the test of the close-fit hypoth-
Table 10.2. esis in Table 10.3.
3. Calculate the RMSEA and CFI from information

reported in Table 10.2.
Pt2Kline5E.indd 179 3/22/2023 3:44:26 PM

Appendix 10.A If the upper bound of the 90% confidence interval

equals or exceeds a value that might indicate poor fit,
such as êU ≥ .10, then the model may warrant less confi-
dence. For example, the poor-fit hypothesis
Significance Testing Based
on the RMSEA H0: e0 ≥ .10 (10.20)
is a reject–support test of whether the researcher’s

If the lower bound of the 90% confidence interval model is just as bad or even worse than a poor-fitting
equals zero (i.e., êL= 0), the model chi-square test will population model. The test of the poor-fit hypothesis
not reject at the .05 level the exact-fit hypothesis can serve as a kind of reality check against the test of
the close-fit hypothesis. The tougher exact-fit test can
H0: e0 = 0 (10.17) serve this purpose, too.
Do not expect results for these various significance
The p value for an accept–support test of the exact-fit tests to be consistent for the same model and data.
hypothesis equals that of c 2ML(df M) for the same model Based on the results from lavaan and listed next for
and data. the detailed example,
Some SEM computer tools also print p values for the
test of the close-fit hypothesis, or the one-tailed null chiML(5) = 11.107, p = .049
hypothesis e = .057, 90% CI [.003, .103], pe0 ≤.05 = .336
H0: e0 ≤ .05 (10.18) we can say for this example that the recursive path
model of illness
Failure to reject the close-fit hypothesis supports the
researcher’s model; otherwise, a model could fail the 1. fails the exact-fit test because p < .05 and êL > 0;
more stringent exact-fit test but pass the less demand- 2. passes the close-fit test because pe0 ≤.05 > .05;
ing close-fit test. Hayduk et al. (2005) described such
3. fails the not-close-fit test because êU > .05; and
models as close-yet-failing models. Passing the close-
fit test does not justify ignoring a failed exact-fit test. 4. fails the poor-fit test because êU > .10.
As noted by Hayduk (2014, p. 3), “Even extremely
close covariance fit—whether . . . assessed by an index The only way to resolve these apparent contradictions
value or a confidence interval around an index [i.e., in significance testing is to consider the entire 90%
RMSEA]—may have resulted from a model containing confidence interval, which says for this example that
multiple serious misspecifications,” and this is true for the point estimate of ê = .057 is so imprecise that it is
reasons explained earlier in the chapter. just as consistent with the close-fit hypotheses as it is
The not-close-fit hypothesis is an inversion of the with the poor-fit hypothesis. A larger sample may be
close-fit hypothesis. It is expressed as needed to obtain more precise results.
Yuan et al. (2016) noted that the failure to reject the
H0: e0 ≥ .05 (10.19) null hypothesis in the standard chi-square test (Equa-
tions 10.3, 10.5) does not imply that the model is cor-
If the upper bound of the 90% confidence interval is rectly specified or that the degree of misspecification
less than .05 (i.e., êU < .05), then the hypothesis of not- is properly controlled. Instead, the interpretation is
close-fit is rejected, which supports the researcher’s simply that there is insufficient covariance evidence
model. This means that (1) the test of the not-close- to reject the exact-fit hypothesis at a predefined level
hypothesis is a reject–support test, and (2) low power of sampling error. As an alternative, Yuan et al. (2016)
works against the researcher’s model. Greater power described equivalence testing, which is based on T-size
here implies a higher probability of detecting a reason- indexes that define a minimum tolerable degree of mis-
ably correct model, or at least one that predicts a cova- specification based on the observed value of chiML for
riance matrix that approximates the sample data matrix the researcher’s model. Because chiML is part of the for-
within the limits of close fit. mulas for both the RMSEA and CFI (Equations 10.11,
Pt2Kline5E.indd 180 3/22/2023 3:44:26 PM

10.15), the corresponding indexes in equivalence test- used to describe varying degrees of model–data corre-
ing are T-size RMSEA and T-size CFI, which are des- spondence for normally distributed data into dynamic
ignated next as, respectively RMSEAt and CFIt. thresholds. For example, the fixed thresholds (.01, .05,
For normally distributed data, the value of RMSEAt .08, and .10) for the RMSEA supposedly distinguish
is just êU, the upper bound of the 90% confidence inter- between, respectively, excellent, close, fair, mediocre,
val based on the RMSEA for the target model. If and poor model fit. Each of the four static cutoff val-
ues just listed is converted to dynamic values, desig-
RMSEA < RMSEAt nated as RMSEAe, based on the sample size and the
model degrees of freedom. Corresponding dynamic
then we can say that with 95% confidence that the size thresholds for the CFI, or CFIe, are also derived.
of the misspecification is no more than the amount indi- Equivalence testing involves the specification of pairs
cated by RMSEAt. That is, if the researcher can tolerate of one-sided null hypotheses, one for the RMSEA and
a misspecification of RMSEAt, then they can proceed the other for the CFI, each of which contrasts the sam-
with the target model with 95% confidence (Yuan et al., ple values with their respective dynamic thresholds.
2016). There is a similar rationale for CFIt, except that If the observed values of RMSEA and CFI both fall
the observation within their respective “fair” rescaled benchmarks or
worse, then model fit is deemed to be unacceptable. I
CFI > CFIt urge great caution in applying this part of the method
because the unscaled (fixed) thresholds for the RMSEA
indicates that the population value for the CFI exceeds and CFI are basically qualitative “T-shirt” descriptors
that of CFIt, with 95% confidence (Marcoulides & for model–data correspondence that probably do not
Yuan, 2017). generalize to all models or data.
The Yuan et al. (2016) method also converts the
conventional fixed thresholds for the RMSEA and CFI
Pt2Kline5E.indd 181 3/22/2023 3:44:26 PM

11
Comparing Models
Researchers often compare alternative structural equation models comprised of the same variables and fitted
to the same data. The most frequent context occurs when a single initial model is tested over a series of steps.
At each step, the initial model is respecified either by adding ≥ 1 free parameters, which generally improves
fit, or by dropping (fixing to zero) ≥ 1 free parameters, which generally worsens fit. A pair of alternative
models so specified are called nested models, because the simpler of the two models, or the constrained
model, is a proper subset of the more complex model, or the unconstrained model. A different context occurs
when there are ≥ 2 initial models such that (1) each model is based on a different theory, and (2) the alterna-
tive models are not nested in their relation to each other. In both contexts, the choice between competing
models should be guided at least as much by conceptual bases as statistical considerations. A related topic
is the evaluation of equivalent models that fit the same data just as well as the researcher’s model but with dif-
ferent configurations of paths between the same variables. Although identical in both global fit and local fit,
there may be rational grounds to prefer one equivalent model over another. Given the widespread neglect of
equivalent models in the literature, readers can greatly distinguish their own SEM studies by directly address-
ing this issue in written reports.
NESTED MODELS 3. The free parameters of the constrained model are a

subset of those in the unconstrained model. It is also
Two nested structural equation models are hierarchi- true that the fixed parameters of the unconstrained
cally related such that all conditions listed next are model are a subset of those in the constrained
true (Lai et al., 2016; Levy & Hancock, 2007): model.
4. Over the two models, chiU ≤ chiC; that is, restrict-
1. The constrained model of the pair is generated by ing the unconstrained model either increases the
placing ≥ 1 restrictions on the unconstrained model. chi-square for the resulting constrained model or
Thus, the degrees of freedom for the constrained there is no change. This means that probability dis-
model (C) exceed those for the unconstrained tributions that could be implied by the constrained
model (U), or dfC > df U. model are also implied by the unconstrained model,
2. The quantity dfC – df U equals the number of restric- which could also imply additional distributions
tions imposed on the unconstrained model to gener- inconsistent with the constrained model.
ate the constrained model. It also equals the differ-
ence in the number of free parameters over the two A hierarchical relation as just defined corresponds to
models. parameter nesting (Bentler & Satorra, 2010), where a
182
Pt2Kline5E.indd 182 3/22/2023 3:44:26 PM

Comparing Models 183
free parameter in the unconstrained model is (1) fixed ity highlights the need for solid ideas when building
to zero, which eliminates the corresponding effect in (or trimming) models. Building could hypothetically
the constrained model, or is (2) added to a constraint continue until df M = 0, at which point a just-identified
specified by the researcher that reduces the effective model would perfectly fit the data.
number of free parameters, but the corresponding In model trimming or a backward search, the
effect remains in the constrained model. Consider the researcher begins with a more complicated, uncon-
unconstrained path model U with the direct effects strained model and then, over a series of steps, restricts
listed next: it by eliminating free parameters (a previously free
parameter is fixed to equal zero) or by imposing con-
X → Y1 → Y2 and X → Y2 straints in estimation (e.g., two previously free param-
eters are fixed to equality). Hypotheses should be pri-
The respecification X → Y2 = 0 drops this path from oritized but now in reverse order of importance; that
model U and generates the constrained model C1, is, free parameters are fixed or equality-constrained
which is nested under model U. A different respecifica- for effects of lower priority before doing so for effects
tion of model U imposes the equality constraint of higher theoretical importance.1 The initial model in
trimming should be consistent with the data; otherwise,
X → Y2 = Y1 → Y2 there is little point in constraining it further. As a model
is trimmed, its overall fit to the data usually becomes
which tells the computer that the unstandardized direct worse (chiML gets larger). The logical stopping point
effects of X and Y1 on Y2 are identical, so only one path for trimming occurs in the model just before the point
coefficient is needed, not two. The constrained model where the next, more constrained model does not fit the
just defined, C2, has one less free parameter than data, but model trimming requires good ideas for guid-
model U and is also nested under U even though C2 has ance just as much as model building.
all paths included in U. Next, we consider how to test The goal of both model building and model trimming
hypotheses about nested models as just defined. is to find the model that has a properly specified cova-
riance structure (and mean structure, too, if present)
and is theoretically justifiable. Ideally, the outcome of
BUILDING AND TRIMMING building versus trimming for the same variables and
data would converge to the same model, but this is not
Nested models fitted to the same data are compared in guaranteed. A risk in both approaches is hypothesiz-
one of two ways. Both represent specification searches ing after the results are known, or HARKing (Kerr,
for correct models. In model building or a forward 1998), where a model retained in a purely exploratory
search, the initial model is a relatively bare-bones, con- specification search is falsely presented as though it
strained model that represents only the most essential was hypothesized from the beginning. An antidote
hypotheses, given the substantive theory. Less central for HARKing is to preregister the analysis plan (i.e.,
hypotheses are intentionally left out of the initial model the planned steps for model building or trimming and
(i.e., the researcher prioritizes the hypotheses). If the fit rationale; Chapter 3).
of the initial model is poor—which might be of little Relative strengths of model building versus trim-
surprise—then free parameters that represent excluded ming are summarized next:
hypotheses are added to the initial model over a series
of steps in order of importance (i.e., initially excluded 1. The starting point in model building is a simpler
hypotheses should be prioritized, too). model, which may be easier for SEM novices to
As models are built, fit generally improves (chiML
determine whether it is statistically identified com-
decreases) as free parameters are added, even if those
pared with a more complex initial model (trim-
respecifications are wrong. That is, observing succes-
sively lower chiML values as more and more parameters 1 The alternative of dropping from the model effects with higher
are specified as free in model building is not evidence priority first leaves the less important effects in the model with-
by itself that the model is becoming increasingly more out the possibility of dropping them in a linear sequence of
correct. This means that closer to fit does not automati- trimming, if removing those high priority effects appreciably
cally mean closer to truth in model building. This real- degrades fit.
Pt2Kline5E.indd 183 3/22/2023 3:44:26 PM

ming). This could prevent a frustrating round of the equation. Model trimming corresponds to entering
effort to diagnose a failed analysis, if the initial all predictors into the regression equation at the first
model is not identified. It is also easier make a misstep but then manually removing predictors over subse-
take in syntax or in a drawing editor when specify- quent steps, again based on the researcher’s hypotheses
ing a more complicated model. (Chou & Huh, 2012).
2. Model trimming may be a better option for mea- This is not the case for empirically based respecifi-
surement models where observed variables are cation, in which free parameters are deleted or added
specified as indicators of a smaller number of com- based on statistical criteria. For example, if the sole
mon factors. This point is elaborated in Chapters basis for trimming paths is that their coefficients are not
15 and 22, but trimming may be more successful statistically significant, then respecification is guided
than building if the measurement model is correctly by completely empirical considerations. The compara-
specified first (Chou & Bentler, 2002; Chou & Huh, ble strategy in multiple regression is backward elimina-
2012). tion, where the computer—not the researcher—selects
the predictor with the smallest partial correlation with
the outcome variable for removal from the equation,
EMPIRICAL VERSUS THEORETICAL if a criterion for statistical significance is satisfied
RESPECIFICATION (e.g., p ≥ .05). Empirical model building corresponds
to forward selection in regression analysis, where the
Models can be built or trimmed according to one of two computer enters the predictor with the largest partial
different standards, theoretical (rational) or empirical. correlation, again based on a statistical criterion (e.g.,
The first provides tests of specific, a priori hypotheses. p < .05). The distinction between theoretically or
Let’s revisit the unconstrained path model U with the empirically based respecification in SEM has implica-
direct effects listed next: tions for interpreting the results of model building or
trimming, which are considered after a model compari-
son test is introduced.
X → Y1 → Y2 and X → Y2
If the researcher believes that the direct effect of X on

Y2 is purely indirect through Y1, then it is possible to
CHI‑SQUARE DIFFERENCE TEST
test this hypothesis by constraining the coefficient for
The chi-square difference statistic, chiD, can be used
the path X → Y2 to zero (the path is trimmed). If the fit
to test the statistical significance of the decrement in
of the model so constrained is not appreciably worse
global fit as free parameters are constrained in model
than the unconstrained one with the path X → Y2 as a
trimming or the improvement in fit as free parameters
free parameter, the hypothesis about a purely indirect
are added in model building. As its name suggests,
effect is supported, assuming the corresponding direc-
chiD is simply the difference between the chiML of two
tionality specifications are correct. The main point
nested models fitted to the same data. Its degrees of
here, however, is that respecification of a model to
freedom, df D, equal the difference between the respec-
test nested versions of it is guided by specific hypoth-
tive values of df M. The chiD statistic tests the equal-
eses. Jöreskog (1969) made the same point about model
fit hypothesis for two hierarchically related models;
building instead of trimming:
specifically, smaller values of chiD lead to the failure
When to stop fitting additional parameters cannot be to reject the equal-fit hypothesis, but sufficiently large
decided on a purely statistical basis. This is largely a mat- values result in its rejection. In model trimming, rejec-
ter of the experimenter’s interpretations of the data based tion of the equal-fit hypothesis suggests that the model
on substantive theoretical and conceptual considerations. has been restricted too much. But the same outcome in
Ultimately the criteria for goodness of the model depends model building supports retention of the free parameter
on the usefulness of it and the results it produces. (p. 201) that was just added at that step. Ideally, the least con-
strained of the two models compared with chiD should
Model building guided by substantive hypotheses is fit the data reasonably well; if not, then it makes little
analogous to hierarchical multiple regression, where sense to compare the relative fit of two nested models,
the researcher specifies the order of predictor entry into neither of which adequately explains the data.
Pt2Kline5E.indd 184 3/22/2023 3:44:26 PM

Suppose for an overidentified model 1 that cut guidelines, so interpreting any differences may be
quite subjective in this context. If the simpler model has
chiML1 (5) = 18.300, p = .003 obviously worse correspondence with the data based
on values of approximate fit indexes, the more complex
A direct effect is added to the model as a free parameter model would be preferred. This assumes that the local
(df M is reduced by 1), and the result for model 2 is fit of the more complex model is acceptable.
The interpretation of chiD (unscaled difference) or
chiML2 (4) = 9.100, p = .059 chiSD (scaled difference) as a test statistic depends in
part on whether the new model is derived empirically
Given both results, or theoretically. For example, if individual paths that
are not statistically significant are dropped from the
df D = 5 – 4 = 1 model, it is most likely that the chi-square difference
chiD (1) = 18.300 – 9.100 = 9.200, p = .002 statistic will also not be significant. But if the deleted
path is also predicted in advance to be zero, then the
which says that the overall fit of model 2 with an addi- difference test is of utmost interest. If respecification is
tional path is statistically better than that of model 1 at driven entirely by empirical criteria such as statistical
the .05 level. Model 2 also passes the chi-square test significance, the researcher should worry—a lot, actu-
ally—about capitalization on chance for two reasons.
at the .05 level (p = .059), but the residuals must be
First, it can and does happen in significance testing
inspected before deciding whether to retain this least-
that a result in a particular sample is significant due
constrained model. That is, a model that “passes” the
only to sample-specific variation; that is, the signifi-
chi-square difference test is not automatically retained
cant outcome is a Type I error unlikely to be found in
without considering all information about fit, both
another sample. Similarly, the coefficient for a path that
global and local.
corresponds to a true nonzero causal effect may not be
In the example just considered, the chi-square dif-
significant, and its exclusion would be a Type II error.
ference test is univariate because it concerned a single Second, overreliance on significance testing in
free parameter (df D = 1). When two nested models that a single sample can lead to the garden of forking
differ by two or more free parameters are compared paths, which refers to the existence of multiple ways
(df D ≥ 2), the chi-square difference test is a multi- to approach hypothesis testing combined with analysis
variate test of all added (or deleted) paths together. If plans that are contingent on the data in a particular sam-
p < .05 for chiD in this case, at least one of the cor- ple (Gelman & Loken, 2014). Researchers often make
responding parameters may be statistically significant several informal decisions about things such as how
at the .05 level if tested individually, but this outcome data are processed, which covariates are analyzed, or
is not guaranteed. how cases are included versus excluded that altogether
In robust ML estimation, the difference between generate many possible choices. These “forking paths”
scaled chi-squares adjusted for nonnormality (e.g., can proliferate even more if results of earlier analy-
chiSB) for two nested models cannot generally be inter- ses influence what is analyzed in later analyses. For
preted as a statistic that tests the equal-fit hypothesis. example, observing a significant result at a particular
This is because such differences do not approximate step, the researcher may decide to conduct additional
central chi-square distributions under conditions of tests aimed at “explaining” that result. In SEM, larger
nonnormality. Described in Topic Box 11.1 are meth- models may present the possibility for a relatively large
ods to calculate a scaled chi-square difference statis- number of respecifications, each of which leads to even
tic, chiSD, which follows approximate chi-square distri- more alternatives that could further change the model
butions. Exercise 1 asks you to conduct the scaled chi- in building or trimming. At the end of the analysis, the
square difference test for a pair of nested models. The findings can be so specific to a particular sample and
researcher in robust estimation can always compare the series of decisions behind those findings that repli-
the relative fit for two nested models, each fitted to the cation is unlikely. A way to protect against this problem
same data based on each model’s set of approximate fit is to base respecification more on theoretical guidance
indexes (RMSEA, CFI, SRMR, etc.), but these compar- than on outcomes in significance testing (Chou & Huh,
isons are not significance tests, and there are few clear- 2012).
Pt2Kline5E.indd 185 3/22/2023 3:44:27 PM

TOPIC BOX 11.1
Scaled Chi‑Square Difference Tests

The Satorra and Bentler (2001) method calculates by hand a scaled chi-square difference statistic when
comparing two hierarchical models in robust ML estimation. It is assumed next that model 1 is constrained
relative to model 2 (i.e., df M1 > df M2), the unscaled chi-squares are chiML, and the scaled chi-squares are
chiSB, the Satorra-Bentler test statistic:
1. Calculate the unscaled chi-square difference statistic and its degrees of freedom in the usual way,
that is:
chiD = chiML1 – chiML2 and df D = df M1 – df M2 (11.1)
2. Recover the scaling correction factor, c, for each model, as follows:

chiML1 chiML2
c1 = and c2 = (11.2)
chiSB1 chiSB2
3. Calculate the scaled chi-square difference statistic chiSD, as follows:
chiD
chiSD = (11.3)
(c1 dfM1 − c 2 dfM2 ) / dfD
where the probability for chiSD (df D) in a central chi-square distribution is the p value for the scaled
chi-square difference test.
It can happen in small samples or when the more constrained model is very wrong that the denomi-
nator of chiSD in the method just described is < 0, which invalidates the test. Satorra and Bentler (2010)
described a revised scaled chi-square difference test that avoids negative values, but it requires numerical
information that may not be available in the output of all SEM computer tools. Bryant and Satorra (2012)
provided syntax for EQS, Mplus, and LISREL for implementing the new difference test. Asparouhov and
Muthén (2013) described the implementation of the scaled chi-square difference test in Mplus, which is
based on the Satorra and Bentler (2001) statistic when it is greater than zero, and otherwise it is equal
to the strictly positive Satorra and Bentler (2010) statistic. Other options include the freely available pro-
gram SBDIFF.EXE for Windows platform computers (Crawford, 2007), and a calculating web page for
generating chiSD is available (https://www.thestatisticalmind.com/). The “difftest” option in Mplus auto-
matically computes values of scaled chi-square difference statistics for robust ML estimators and special
WLS methods for ordinal outcomes (Muthén & Muthén, 1998–2017) described in a later chapter, and the
“lavTestLRT( )” method in lavaan performs a similar function (Rosseel et al., 2023).
In a recent computer simulation study of nested 2-factor CFA models with continuous indicators (5 or
10 per factor) and correlated errors over generated samples that varied in size (N = 100, 200, 500, 1,000)
and degrees of nonnormality, Pavlov et al. (2020) compared the accuracies of three scaled chi-square dif-
ference tests, two based on mean-adjusted test statistics (Satorra & Bentler, 2001, 2010) and the third based
on mean-and-variance adjusted statistics (Asparouhov & Muthén, 2013), all as implemented in Mplus. All
three scaled difference tests just mentioned outperformed the test based on chiD, the unscaled chi-square
difference statistic, which increasingly favored the more complex of two nested CFA model as nonnormality
increased, model size increased, or sample size decreased, when in fact the two models had comparable
fit. Scaled difference tests based on mean-and-variance adjusted chi-squares were generally more accurate
than the methods based on mean-adjusted chi-squares, especially for larger measurement models.
186
Pt2Kline5E.indd 186 3/22/2023 3:44:27 PM

MODIFICATION INDEXES by purely statistical methods were wrong, which means

AND RELATED STATISTICS that they did not typically recover the true model. This
pattern was even more apparent in small samples (e.g.,
The issue of capitalization on chance in SEM is espe- N = 100). Silvia and MacCallum (1988) followed a simi-
cially relevant when the researcher uses an automatic lar procedure except that the use of automatic modifica-
modification option available in some SEM computer tion was guided by theory, which improved the chances
tools such as LISREL (Jöreskog & Sörbom, 2021). of recovering the true model. See Chou and Huh (2012,
These completely exploratory procedures drop or add pp. 239–243) for additional examples of specification
parameters according to empirical criteria such as sta- searches guided by theory.
tistical significance at the .05 level of a modification There are three additional cautions about MI statis-
index (MI) or score test, which is calculated for con- tics:
strained parameters (including those fixed to zero) in
the researcher’s model. A modification index is actu- 1. The computer may print the value of an MI for an
ally a univariate Lagrange Multiplier (LM) (named “illegal” parameter, such as a covariance between a
after the Italian mathematician and astronomer J.-L. measured exogenous variable and a disturbance. If
Lagrange), which is expressed as a chi-square statistic you actually tried to add that parameter, the analy-
with a single degree of freedom. It approximates the sis would fail.
amount by which chiML would decrease if a particular 2. Values of MI statistics may be printed for parame-
fixed-to-zero or equality-constrained parameter were
ters that, if actually freed, would make the respeci-
freely estimated; that is, an MI estimates chiD (1) for
fied model not identified.
freeing a single parameter. Thus, the greater the value
of an MI, the greater the predicted improvement in 3. Each individual MI assumes that the model as
global fit if that effect were added to the model. Like- analyzed is correctly specified except for the con-
wise, a multivariate LM estimates the effect of allowing strained parameter associated with that index.
≥ 2 constrained parameters to be freely estimated. These assumptions are contradictory over a whole
The automatic modification procedure in LISREL set of MI values for the same model.
(option “AM”) will respecify the model by freeing at
each step the single constrained parameter with the The first and second cautions just listed are explained
largest MI value. It will continue to add parameters if by the fact that MI values estimate chiD (1) for free-
any MI is statistically significant at the level specified ing the corresponding parameter. These estimates are
by the researcher, such as .05; otherwise, the procedure not derived by the computer actually freeing the cor-
stops, and the model at the last step is the final model. responding model parameter and rerunning the analy-
Except for an option that prevents specific param- sis. Instead, the computer uses a shortcut method based
eters from being changed, automatic modification in on linear algebra that “guesses” at the value of chiD
LISREL is purely empirical and directly analogous to (1), given the data matrix and estimates for the more
forward selection in multiple regression, which selects restricted (original) model.
predictors entirely based on significance test outcomes. The Wald test (after mathematician Abraham Wald),
I cannot recommend automatic modification, espe- based on the statistic W, is used in model trimming. A
cially in small samples where the method may so univariate W estimates the amount by which the over-
heavily capitalize on chance that the final model has all chiML would increase if a particular freely estimated
little, if any, prospect for replication. Two early com- parameter were fixed to zero (trimmed); that is, a uni-
puter simulation studies by MacCallum (1986) and Sil- variate W estimates chiD (1) for dropping (not adding)
via and MacCallum (1988) offer pertinent cautionary the corresponding effect from the model. A value of W
tales about automatic specification searches. They took that is not significant at, say, the .05 level predicts a dec-
known structural equation models, imposed specifica- rement in global model fit that is not significant at the
tion errors on them, and evaluated the erroneous mod- same level. Model trimming that is empirically based
els using data generated from populations in which the would thus delete paths with W statistics that are not
known models were true. In MacCallum (1986), models significant but, as mentioned, actually doing so may just
were respecified using empirically based methods, such capitalize on chance. A multivariate W approximates
as MI values. Most of the time the changes suggested chiD for constraining ≥ 2 parameters in the model.
Pt2Kline5E.indd 187 3/22/2023 3:44:27 PM

All the diagnostic test statistics just described are of predictors for a single outcome based on statisti-
affected by sample size. Accordingly, even a trivial cal criteria set by the researcher (Sheather, 2009). To
change in global model fit due to adding or dropping avoid repetition, the method employs a Tabu list where
a free parameter could be statistically significant in a recent selections are recorded as a simulated short-term
large sample. In addition to noting the level of signifi- memory, which may prevent the method from termi-
cance of an MI, the researcher should also consider the nating with a solution that is a local-but-suboptimal
absolute magnitude of the change in the coefficient for maxima (i.e., there are better solutions). Marcoulides
the parameter if it is allowed to be freely estimated, or and Falk (2018) described a package for R, ShortForm
the expected parameter change, which depending on (Raborn & Leite, 2020), that performs Tabu search and
the SEM computer tool used may be available in both ant colony optimization for CFA models. A different R
unstandardized and standardized form. If the expected package, autoSEM (Jacobucci, 2016), performs genetic
change (i.e., from zero) is considered trivial, the statisti- algorithm, ant colony, and Tabu specifications searches
cal significance of the MI may reflect the sample size for CFA models.
more than it does the magnitude of the corresponding These intelligent specification searches tend to work
effect (Kaplan, 2009). best in very large data sets that allow for the possi-
bility of replication. Applied in small samples, related
methods such as machine learning can yield rates of
INTELLIGENT AUTOMATED correct classifications that departure appreciably from
SEARCH STRATEGIES expected rates in very large samples (Combrisson &
Jerbi, 2015). Such methods also work better when
Automatic modification in LISREL is a “dumb” empir- guided by good hypotheses about presumed causes,
ical search strategy: It frees model parameters one so in this way smart specification searches are not
at a time based solely on p values from MI statistics. magic. Brandmaier and Jacobucci (2023) described
There are “smart” empirical search strategies in SEM two machine learning approaches as adjuncts to the-
based on heuristics in artificial intelligence that employ ory-based model building: regularized SEM, which
automated rule-based methods to store and manipulate is described in Chapter 17 in the context of applying
extant information in order to deduce new information, SEM in small samples, and SEM trees. Briefly, the
or model respecifications. Examples of a few intelli- latter is a multivariate search technique that com-
gent causal discovery methods are described next—see bines features of exploratory and more confirmatory
Marcoulides and Ing (2012) for more information. analyses. It recursively partitions a dataset into sub-
The genetic algorithm is an adaptive heuristic sets that explain the greatest differences in relation to
search method that respecifies models through succes- target parameters in the researcher’s model. Doing so
sive “generations” from parent model to child models. supports detection of heterogeneity among the cases
Generation and evaluation of child models follows with respect to potential covariates (Brandmaier et
genetic principles where “bad” solutions are allowed al., 2013). If appreciable heterogeneity is detected, the
to die (be eliminated) and where the whole population sample is split into subgroups and a multiple-group
undergoes evolutionary changes over iterations of the analysis is performed, the logic of which is explained
procedure (Chou & Yang, 2013). Ant colony optimi- in the next chapter. Bollen (2022) described machine
zation methods aim to maximize fit by converging on learning and related computational methods as poten-
the correct model in ways that mimic how ants forage tially setting new directions for SEM, but time will
for food by accumulating pheromones on the shortest tell whether this possibility is fulfilled.
route, stimulating other ants to take the same path.
In test construction, for example, virtual “ants” select
items instead of routes where each item has a corre- MODEL BUILDING
sponding “pheromone level” that determines the prob- FOR THE DETAILED EXAMPLE
ability of being selected. Potential items are evaluated
through several cycles of selection (Olaru et al., 2019). Recall that the Roth et al. (1989) recursive path model
Tabu search can be seen as a variation on the method of illness (Figure 8.1) does not have acceptable fit (e.g.,
of best-subsets regression, where the computer in Tables 8.2, 9.4, and 10.2). Reported in Table 11.1 are
regression analysis attempts to find the optimal subset values of MI statistics for all parameters that could be
Pt2Kline5E.indd 188 3/22/2023 3:44:27 PM

added to the original path model, yet remain recursive. But a different respecification, adding to the original
These results are from analysis 1 in Table 10.1 using model a direct effect between stress and fitness but in
lavaan. Values of the modification indexes in Table the opposite direction, or
11.1 are listed in descending order by magnitude. This
means that freeing parameters at the top of the list Fitness → Stress
would lead to greater estimated reductions in chiML
than freeing parameters lower in the list. Also reported is predicted to reduce the model chi-square by nearly
in Table 11.1 for each effect are (1) estimated parameter the same predicted amount that is also significant (MI =
changes, both unstandardized and standardized; and 5.110, p = .024)—see Table 11.1. A third respecification
(2) chiD (1), or the actual reduction in the model chi- that would reduce the model chi-square by a smaller
square after the corresponding parameter is freely esti- expected but still significant amount is to allow the dis-
mated. Although MI statistics only estimate chiD (1), turbances of fitness and stress to covary (MI = 3.907,
that approximation for this example is generally close. p = .048). Exercise 2 asks you to interpret the results
The respecification in Table 11.1 that would reduce in Table 11.1 for the parameter just mentioned. All
chiML by the largest estimated amount that is also sta- remaining respecifications in the table are predicted to
tistically significant (MI = 5.372, p = .021) is to add the reduce the model chi-square by smaller amounts, none
direct effect of which is statistically significant.
Now, which respecified model just described is cor-
Stress → Fitness rect—if any? It does make sense that fitness could
affect the experience of stress: People who are in bet-
In the original model, the direct effect just listed is ter physical shape may better withstand stress (Fit-
constrained to zero (Figure 8.1). The estimated change ness → Stress). But is it also not plausible that stress
after freeing this parameter in the unstandardized could affect fitness (Stress → Fitness), or that fitness
solution is –.061; that is, an increase in stress of 1 point and stress may have common unmeasured causes (cor-
in its raw score metric predicts a decrease in fitness by related disturbances)? Without theory as a guide, there
.061 points in its original units, while controlling for is no way to meaningfully select among the three alter-
the other direct cause of stress, hardy. In the standard- native respecifications just mentioned or among even
ized solution, the estimated parameter change is –.111, others. Also, the respecifications in Table 11.1 with MI
which says that for every increase in stress of 1 full values that are not significant at the .05 level were all
standard deviation, the level of fitness is expected to calculated with no additional paths between fitness and
decrease by .111 standard deviations, again while con- stress, but the omission of a parameter that involves
trolling for hardy. both variables could be a specification error. This is a
TABLE 11.1. Modification Indexes, Estimated Parameter

Changes, and Actual Chi-Square Difference Statistics
for a Recursive Path Model of Illness
Estimated parameter change
Effect MI p Unstandardized Standardized chiD (1)
Stress → Fitness 5.372 .021 –.061 –.111 5.424
Fitness → Stress 5.110 .024 –.207 –.114 5.170
DF DS 3.907 .048 –56.529 –.102 3.972
Hardy → Fitness 2.939 .087 .040 .082 2.950
Hardy → Illness 2.466 .116 –.125 –.077 2.477
Exercise → Stress 1.276 .259 –.029 –.057 1.278
Exercise → Illness .578 .447 .036 .039 .578
Note. MI, modification index.
Pt2Kline5E.indd 189 3/22/2023 3:44:27 PM

limitation of any MI statistic: Specification error else- one most likely to generalize over replication samples.
where in the model could affect its accuracy. Exercise 3 In this example, the value of the information criterion
involves (1) fitting the path model in Figure 8.1 but with for the simpler model will be lower because a greater
the added path Fitness → Stress to the data in Table 4.3, penalty for complexity was imposed on the fit of the
and (2) describing global and local fit for this version of more complex model. Thus, the model with the low-
a respecified model. est information criterion is preferred. In contrast, the
chi-difference test indicates whether the more complex
of two nested models is statistically better or that there
COMPARING NONNESTED MODELS is insufficient covariance evidence to prefer either one
(Merkle et al., 2016). Two widely reported predictive fit
Next, we consider fitting two different models, each indexes are described after a research problem is intro-
composed of the same variables, to the same data but duced next.
now where those two models are not hierarchically Presented in Figure 11.1 are two recursive path
related as defined earlier; instead, they are nonnested models of recovery after cardiac surgery evaluated
models.2 A frequent context is when researchers com- by Romney et al. (1992). The psychosomatic model
pare ≥ 2 models each of which is based on a differ- in Figure 11.1(a) represents the hypotheses that patient
ent theory. The values of chiML from two nonnested morale transmits the effects of neurological dysfunc-
models can be informally compared, but the difference tion and diminished socioeconomic status (SES) on
between them cannot be interpreted as a test statistic; illness symptoms and poor social relationships. The
namely, the chi-square difference test (unscaled or conventional medical model of Figure 11.1(b) repre-
scaled) does not apply. This is because the difference sents a different pattern of causal relations among the
between the test statistics from nonnested models do same variables. Specifically, both illness symptoms
not follow central chi-square distributions. Although and neurological dysfunction are specified as exog-
there have been efforts to develop significance tests for enous variables with direct effects on diminished SES,
comparing nonnested models, including approaches low morale, and poor relationships. Among these three
based on bootstrapped empirical distributions for the endogenous variables, diminished SES is hypothesized
RMSEA (Raykov, 2001), Bayesian estimation model to indirectly affect poor relationships through its prior
correctness probabilities (Rust et al., 1995), or arbitrary impact on low morale. There are additional indirect
distribution theory extensions of the likelihood ratio effects in the conventional medical model from exog-
test (Golden, 2003), such methods are not widely used enous to endogenous variables. The models in Figure
or result in complications in interpretation (Levy & 11.1 are not nested, so the chi-square difference test
Hancock, 2007). cannot be used to directly compare them. Exercise 4
A more practical alternative is the family of predic- asks you to calculate df M for Figure 11.1(b).
tive fit indexes, also called information-theoretic crite-
ria (Chapter 10). These fit indexes are not significance AIC and BIC
tests because their probability distributions over differ-
ent types of models and data are generally unknown. One of the best known predictive fit indexes based on
Instead, their values reflect both model fit and model ML estimation is the Akaike Information Criterion
complexity such that these two attributes are balanced (AIC), named after the statistician Hirotugu Akaike.
against each other. This means that a penalty for com- Confusingly, at least three different formulas for the
plexity is imposed so that model fit is adjusted for the AIC are presented in the SEM literature. The first is
number of free parameters. For example, given two from Akaike (1974, p. 719):
different nonnested models with similar fit to the same
AIC = – 2 ln L 0 + 2q (11.4)
data, the least complex model will be favored as the
2 Itis also theoretically possible to compare nonnested models where L 0 is the likelihood function maximized in ML
based on different subsets of variables measured in the same estimation for the researcher’s model and q is the num-
sample, but such comparisons become less and less meaningful ber of free model parameters. Note that the penalty for
as the common set of variables gets smaller to the point where no complexity in Equation 11.4, 2q, becomes relatively
variables are shared. smaller as the sample size increases (Mulaik, 2009b).
Pt2Kline5E.indd 190 3/22/2023 3:44:27 PM

(a) Psychosomatic model

DPR
1
Diminished DLM Poor

SES 1 Relationships
Low DIS
1
Morale
Neurological Illness
Dysfunction Symptoms
(b) Conventional medical model
DDS
1
Diminished
SES
Diminished DLM
SES 1
Low
Morale
Neurological DPR
Dysfunction 1
Poor
Relationships
FIGURE 11.1. Alternative nonnested recursive path models of adjustment after cardiac surgery.
The second formula is the key is that the relative change in the statistic over
different nonnested models fitted to the same data is the
AIC2 = chiML + 2q (11.5) same in both versions, and this change is a function of
model complexity.
which increases the model chi-square by a factor of A third version for the Akaike information statistic
twice the number of freely estimated parameters. is
Another way to express AIC2 is to substitute Equation
10.2 for chiML in Equation 11.5, or AIC3 = chiML – 2df M (11.7)
AIC2 = –2ln L 0 + 2ln L1 + 2q (11.6) where the degrees of freedom are subtracted from the
model chi-square. Thus, more complex models with
where L1 is the ML likelihood function maximized for fewer degrees of freedom are penalized with smaller
any just-identified version of the researcher’s model that reductions in their test statistics. The formula for AIC3
perfectly fits the data. Thus, the AIC2 adds the constant can also be expressed as
2ln L1 to the AIC for any model (compare Equations
11.4 and 11.6). Although their formulas are different, AIC3 = AIC0 – AIC1 (11.8)
Pt2Kline5E.indd 191 3/22/2023 3:44:27 PM

where AIC0 is the value of the AIC (Equation 11.4) ues are widely used to compare alternative hypotheses
for the researcher’s model and AIC1 is the correspond- or models (Dienes, 2016). Both the AIC and BIC are
ing statistic for all just-identified versions of the same based on distributional assumptions and asymptotic
model (Mulaik, 2009b). Again, the relative differ- calculations that may not hold in real-world studies
ence in AIC3 for the same two nonnested models is (Dziak et al., 2020).
unchanged compared with AIC and AIC2. The AIC When the AIC and BIC disagree in their selection
and AIC3 can assume values < 0, but AIC2 values are among competing models, the AIC tends to favor more
always > 0. Check the documentation of your SEM complex models than the BIC. If overfitting (i.e., the
computer tool to determine how it calculates Akaike’s model is overly complex) is the most likely error, then
information criterion. the BIC may be preferred. But if there is greater con-
A different information theory index that takes cern about selecting a model that is too small (underfit-
direct account of sample size is the Bayes Information ting), then the AIC has advantages (Dziak et al., 2020).
Criterion (BIC) (Raftery, 1993; Schwarz, 1978). The For example, a goal in exploratory factor analysis is
formula is to correctly estimate the number of latent variables
(Preacher et al., 2013). If estimating too many factors
BIC = – 2 ln L 0 + q ln N (11.9) is considered a more serious error than estimating too
few factors, then the BIC may be a better choice than
Compared with the AIC (Equation 11.4), the BIC the AIC when comparing alternative factor models that
imposes a greater relative penalty for model complex- are not nested. But if estimating too few factors is the
ity. Suppose that the number of freely estimated param- more serious specification error, then the AIC may have
eters is q = 10 and that N = 300. The AIC penalty equals the advantage. The results of simulations by Lin et al.
2(10), or 20.000 (Equation 11.4), but the BIC penalty (2017) with mediation path models are generally con-
for the same model is 10 (ln 300), or 50.038, more than sistent with this pattern: Under conditions with smaller
twice as large compared with the AIC. Relative values misspecified parameters, the AIC was generally more
of BIC penalties increase more slowly as sample size accurate because the BIC tended to favor overly par-
increases; that is, its penalty is asymptotic over ever simonious models due to its greater penalty. As the
larger samples (Mulaik, 2009b). A second formula for magnitude of misspecification increases, the BIC was
the BIC is more accurate because the AIC favored overly complex
models due to its relatively smaller penalty.
BIC2 = chiML + q ln N (11.10) Listed in Table 11.2 for analysis 1 are the script and
outputs file for fitting in lavaan the two nonnested
where the model chi-square is increased by the same path models in Figure 11.1 to the data summarized
penalty factor, or q ln N (Preacher & Merkle, 2012). in Table 11.3 from a sample of N = 469 patients with
The relative difference between the BIC and BIC2 for cardiovascular problems. Romney et al. (1992) did not
the same two nonnested models remains the same, so
they will both favor the same model (if there is a dif-
ference). TABLE 11.2. Analyses, Script Files,
Although not obvious in their respective formu- and Packages in R for Maximum Likelihood
Estimation of Two Nonnested Path Models
las, the AIC and BIC are actually based on different
of Adjustment After Cardiac Surgery
standards when comparing nonnested models, and
thus they may not be interchangeable in their applica- Analysis Script file
tion. For instance, the AIC estimates a constant plus 1. Compare two nonnested romney-nonnested.r
the relative distance between the fitted likelihood for path models, psychosomatic
the researcher’s model and the unknown likelihood model and conventional
of the true model. In contrast, the BIC is a function of medical model
the posterior probability of a model being true from a
more Bayesian perspective. The difference in the BIC 2. Equivalent versions of the romney-equivalents.r
conventional medical model
for two models estimates the Bayes factor (BF), or the
ratio of the likelihoods of the same data under each of Note. The lavaan package was used for all analyses. Output files have
two competing models. In Bayesian estimation, BF val- the same names except the extension is “.out.”
Pt2Kline5E.indd 192 3/22/2023 3:44:27 PM

TABLE 11.3. Input Data (Correlations, Standard Deviations) for Analysis

of Nonnested Recursive Path Models of Recovery after Cardiac Surgery
Variable 1 2 3 4 5
1. Low Morale —
2. Illness Symptoms .53 —
3. Neurological Dysfunction .15 .18 —
4. Poor Relationships .52 .29 –.05 —
5. Diminished SES .30 .34 .23 .09 —
SD 3.75 17.00 19.50 3.50 24.70
Note. Correlations are from Romney et al. (1992); N = 469. Standard deviations for variables 1–5 are from,
respectively, Lawton (1975), Derogatis et al. (1976), Hinson et al. (2005), Fraley et al. (2000), and Stevens
and Featherman (1981).
report standard deviations, and the analysis of a cor- of the chiML part of the BIC (Equation 11.7) increase
relation matrix with default ML is not recommended. with sample size for misspecified models that are not
For this pedagogical example, I specified the standard just-identified. Also, model selection uncertainty, or
deviations in Table 11.3 based on actual measures of sampling variation in rank order of models based on
the variables in Figure 11.1, so these values are not BIC values, did not generally decrease with increasing
arbitrary—see the table note for citations. Analyses sample size. Within tthe range of sample sizes analyzed
for both models converged to admissible solutions. The by Preacher and Merkle (2012) (N = 80–5,000), there
syntax and output files for this analysis can be down- was considerable variation in model rankings. Their
loaded from this book’s website. results suggest that caution should be exercised about
Values of selected fit statistics for the two alterna-
tive Romney et al. (1992) path models are listed in
Table 11.4. It is no surprise that the global fit of the TABLE 11.4. Values of Selected Fit Statistics
more complex conventional medical model (df M = 3) for Two Nonnested Recursive Path Models
is better than that of the simpler psychosomatic model of Adjustment after Cardiac Surgery
(df M = 5). But is the fit advantage of the more com- Model
plex model enough to offset the penalty for having Conventional
more fewer degrees of freedom imposed by predictive Psychosomatic medical
fit indexes? Yes: the conventional medical model has Statistic Figure 11.1(a) Figure 11.1(b)
lower values on both the AIC and BIC, so it is pre-
dfM 5 3
ferred over the psychosomatic model. The same pref-
erence is indicated by the relative values of AIC2 and q 10 12
BIC2 in the table. Exercise 5 asks you to reproduce ln L0 –8,572.844 –8,554.222
the values of the predictive fit indexes in Table 11.4, chiML 40.488, p < .001 3.245, p = .355
and Exercise 6 asks you to evaluate the local fit of the
conventional medical model (see the output file for AIC 17,165.687 17,132.444
analysis 1; Table 11.2) AIC2 60.488 27.245
Results of computer simulations by Preacher and BIC 17,207.193 17,182.251
Merkle (2012) indicated that model selections based on
BIC2 101.994 77.052
the BIC are subject to sampling errors in unexpected
ways, perhaps so much so that claims of model supe- RMSEA [90% CI] .123 [.090, .159] .013 [0, .080]
riority may not hold up. (This caution also applies to CFI .907 .999
the AIC and related information criteria.) One reason SRMR .065 .016
is that, unlike most statistics, variation in the BIC actu-
ally increases with sample size. This is because values Note. q, number of free parameters; CI, confidence interval; N = 469.
Pt2Kline5E.indd 193 3/22/2023 3:44:27 PM

declaring a particular model selected by the BIC or any model generated by them is identified. There are
related indexes as the clear “winner” over rival models. two parts to the replacing rules. The first involves just-
More recently, Lubke and Campbell (2016) described identified (saturated) blocks of variables in a part of the
nonparametric bootstrapping as a method to quan- structural model with all possible paths:
tify model selection uncertainty by fitting alternative
models in generated samples, and Lubke et al. (2017) RULE 11.1 Within a saturated block of variables
extended the bootstrap method to analyses of alterna- with no incoming effects, any direct effect can
tive models over multiple groups and mixture model be reversed or substituted for a covariance or an
comparisons. Results from both studies just mentioned equality-constrained reciprocal effect
seem promising, and the general problem of model
selection is gaining attention in the SEM literature The second part of the replacing rules involve pairs of
(Levy & Hancock, 2011; Merkle et al., 2016), but so endogenous variables with a path between them:
far there is no magic statistical bullet for selecting the
correct model among nonnested alternatives in single RULE 11.2 Let X and Y be two endogenous
samples when no replication data are available—see variables with uncorrelated disturbances and the
also Preacher and Yaremych (2023), who described same direct cause(s): Any direct effect can be
additional strategies and information criteria for model reversed or substituted for a disturbance covariance
selection in SEM. or an equality-constrained reciprocal effect
Examples of applying the replacing rules are consid-

EQUIVALENT MODELS ered next.
Depicted in Figure 11.2(a) is the original Romney
If a retained model is selected from nested or non- et al. (1992) conventional medical model shown with
nested alternatives, equivalent versions should be con- compact symbolism for disturbances. The other three
sidered. Equivalent models have the same degrees of models in the figure are equivalent versions generated
freedom (they are equally complex) but feature differ- by the replacing rules. For example, within the just-
ent causal directionalities among the same variables. identified block of the two exogenous variables (i.e.,
The most general form is observational equivalence, there are no incoming effects) in the original model,
which means that one model generates every probabil- the covariance is replaced by a direct effect from illness
ity distribution that can be generated by another model symptoms to neurological dysfunction in Figure 11.2(b)
(Hershberger & Marcoulides, 2013). Levy and Hancock (Rule 11.1). In the respecified model just described,
(2007) used the term completely overlapping to note neurological dysfunction is no longer exogenous—it is
that equivalent models share all the same distributions. endogenous, or an outcome. Because the endogenous
A particular form for linear models fitted to covariance variables of neurological dysfunction and diminished
matrices is covariance equivalence, which means that SES in Figure 11.2(b) have the same cause, or illness
every covariance matrix predicted by one model can symptoms, the direct effect between them is reversed
also be generated by another model. Two covariance in Figure 11.2(c) (Rule 11.2). That is, the direct effect
equivalent models also generate the same residuals and
conditional independencies, or sets of vanishing cor- Neurological Dysfunction → Diminished SES
relations (Pearl, 2009). The latter property refers to
d-separation equivalence. A related idea is moment in Figure 11.2(b) is respecified as
equivalence, which occurs when two different models,
regardless of the data, generate the same (1) predicted Diminished SES → Neurological Dysfunction
covariances, means, and correlations; (2) residuals; and
(3) values of global fit statistics (Hershberger & Mar- in Figure 11.2(c). In the model just mentioned, the
coulides, 2013). pair of endogenous variables diminished SES and low
In the SEM literature, the Lee–Hershberger replac- morale have the same cause, illness symptoms. Thus,
ing rules for generating equivalent versions of recur- an equivalent version can be generated by reversing
sive structural models are probably the most familiar the direct effect between them (Rule 11.2) That is, the
(Lee & Hershberger, 1990). Their use assumes that direct effect
Pt2Kline5E.indd 194 3/22/2023 3:44:28 PM

Diminished SES → Low Morale equivalent models in the figure are all fitted to the same
data (Table 11.3). Finally, the models in the figure are
in Figure 11.2(c) is replaced by d-separation equivalent because they share the same
union basis set of implied conditional independencies,
Low Morale → Diminished SES or 3 in total for the 3 pairs of nonadjacent variables in
each model (you should verify this statement). Thus,
in Figure 11.2(d). This respecification also changes the the models in Figure 11.2 are covariance equivalent,
status of diminished SES from endogenous in Figure which is the best case for the replacing rules.
11.2(c) to exogenous in Figure 11.2(d). Although equivalent models have the same global
All four models in Figure 11.2 have the same global and local fit, they differ in their sets of parameter esti-
fit to the data; for example, chiML(3) = 3.245 for each mates and patterns of statistical significance among
model. They all have the same local fit, too—specifi- those estimates—see the output file for analysis 2, Table
cally, their residuals are identical for each type, includ- 11.2. These results are expected because two equivalent
ing raw, standardized, normalized, and correlation models have different configurations of paths and, from
residuals. To verify these claims, please examine the a regression perspective, the roles of predictors or out-
output file for analysis 2 in Table 11.2 in which the comes change over equivalent models. It could happen
(a) Original (b) Equivalent 1

Diminished Diminished
SES SES
Illness Illness
Symptoms Symptoms
Low Low
Morale Morale
Neurological Neurological
Dysfunction Dysfunction
Poor Poor
Relationships Relationships
(c) Equivalent 2 (d) Equivalent 3

Diminished Diminished
SES SES
Illness Illness
Symptoms Symptoms
Low Low
Morale Morale
Neurological Neurological
Dysfunction Dysfunction
Poor Poor
Relationships Relationships
FIGURE 11.2. Four equivalent path models of adjustment after cardiac surgery shown with compact symbolism for distur-
bances. Paths rendered with dashed lines are reversed or changed relative to the original model.
Pt2Kline5E.indd 195 3/22/2023 3:44:28 PM

for a pair of equivalent models that, say, the coefficient tions cannot generally be inferred from data alone, that
for the direct effect X → Y is significant in model 1, but is, without assumptions. Specifying simpler structural
the coefficient for the opposite direct effect, Y → X, is models is one way to eliminate some equivalent ver-
not significant in model 2. Thoemmes (2015) noted that sions. Listed next are suggestions for possibly elimi-
there is no basis for preferring model 1 over model 2 in nating even more equivalent models (Hershberger &
this case because significance testing outcomes do not Marcoulides, 2013; MacCallum et al., 1993; Williams,
indicate which of two equivalent models is correct. The 2012):
reasons are explained in Chapter 20 about mediation
analysis, where equivalent models are a major validity 1. Temporal precedence in experimental or longi-
threat, but significance testing cannot generally detect tudinal designs precludes reversing direct effects
the “true” model among equivalent versions. between causes manipulated or measured before
Reapplying the replacing rules to the models in Fig- outcomes (no retrocausality, or backwards causa-
ure 11.2 could generate even more equivalent models. tion in time).
Relatively simple models may have few equivalent ver- 2. If any part of a model analyzed in a cross-sectional
sions, but more complicated ones may have hundreds design has been evaluated in prior experimental or
or even thousands (MacCallum et al., 1993). In general, longitudinal designs, results from those studies may
more parsimonious models tend to have fewer equiva- help to rule out some causal orderings.
lent versions. You will learn in Chapter 14 that CFA
3. Certain variables, such as demographic character-
models can have infinitely equivalent versions; thus, it
is unrealistic that researchers should generate all pos- istics or stable personality characteristics, may be
sible equivalent models. As a compromise, researchers unlikely or impossible to be endogenous. For exam-
should generate at least a few substantively meaningful ple, specifying a direct effect from an attitudinal
equivalent versions. Unfortunately, even this limited variable to chronological age in years is illogical.
step is usually neglected. 4. Some causal orderings may be theoretically doubt-
You should know that there are some problems with ful, given the nature of the variables. For example,
the replacing rules: They are not guaranteed to be tran- parental IQ may be more likely to affect child IQ
sitive (Hershberger & Marcoulides, 2013). This means than the reverse.
that reapplying the tracing rules to equivalent versions 5. Variables specified as mediators must be potentially
generated by previous applications of the rules is not changeable; otherwise, they are unlikely mediators.
guaranteed to spawn even more versions that are all For example, variables conceptualized as stable,
equivalent. The replacing rules do not cover models relatively unchanging traits could be specified as
with constrained parameters, such as equality con- causes, but not mediators (Topic Box 7.1).
straints, and they do not evaluate whether a pair of 6. Fixing some parameters to nonzero values compat-
target models are equivalent only for certain values of ible with theory or results from prior studies would
some of their parameters and nonequivalent for others rule out equivalent models involving those param-
(Raykov & Penev, 1999). Another greater concern is eters. This is because such fixed values are not suit-
that the replacing rules can generate a structural model able for arbitrary reconfigurations of the paths or
that predicts a different set of conditional indepen-
variables for which they were specified (Mulaik,
dencies than the original model. That is, applying the
2009b).
replacing rules can sometimes create or destroy implied
conditional independencies in the respecified model 7. Adding variables that are selectively associated
compared with the original model (Pearl, 2009)—see with just a few of the other variables in the model
Topic Box 11.2 for more information and an example. can help to reduce the number of equivalent ver-
sions. Suppose that a model has the path X → Y.
Adding a variable presumed to directly cause X but
COPING WITH EQUIVALENT not Y means that X has a unique parent compared
OR NEARLY EQUIVALENT MODELS to Y, so Rule 11.2 would not apply, if both X and
Y were endogenous. This strategy must usually be
Pearl (2009) reminded us that the existence of equiva- implemented before the data are collected.
lent models is inevitable if we agree that causal rela- 8. Although equivalent models have identical residuals
Pt2Kline5E.indd 196 3/22/2023 3:44:28 PM

TOPIC BOX 11.2
When the Replacing Rules Fail

Presented in Figure 11.3(a) is an original path model with the direct effect X → Y. The three paths in this
model between variables W and Z are listed next:
W→X→Z
W→Y←U→Z
W→X→Y←U→Z
where U represents an unmeasured common cause of Y and Z and thus replaces their disturbance covari-
ance in the figure and the specification that X is a direct cause of Y is underlined. The first path just listed
is the only open path between W and Z. The remaining paths are blocked by the collider Y, and X is not a
collider in either path, so both paths remain closed when controlling for X. Thus, the model in Figure 11.3(a)
predicts the conditional independence
W⊥Z|X
Because variables X and Y in Figure 11.3(a) have the same cause, W, we can apply Rule 11.2 and the
reverse the direct effect between them. The model so respecified is presented in Figure 11.3(b), and listed
next are all paths between W and Z in this generated model:
W→X→Z
W→Y←U→Z
W→X←Y←U→Z
where the reversed direct effect is underlined. Now controlling for X closes the first path just listed, but
doing so opens the third path where X is now a collider after reversing the direct effect between X and
(continued)
(a) Original (b) Modified

W┴Z|X W┴Z
W Y Z W Y Z
X X
FIGURE 11.3. An original path model (a) and a version generated by the replacing rules that is not d-separation
equivalent (b).
Pt2Kline5E.indd 197 3/22/2023 3:44:28 PM

Y. The second path just listed is blocked by the collider Y, and controlling for X changes nothing here (the
path remains closed). Thus, variables W and Z cannot be d-separated in Figure 11.3(b). Figures 11.3(a)
and 11.3(b) are not d-separation equivalent, even though one model generates the other using the replac-
ing rules.
Pearl (2009, pp. 145–149) described graphical criteria and rules for generating d-separation equiva-
lent models, but they vary depending on whether the structural model is recursive or nonrecursive and also
on the pattern of disturbance covariances, if any. Thus, they can be relatively complex to apply. Another
alternative is to use a computer tool such as dagitty in its web browser or R package versions (Textor et al.,
2016, 2021) to analyze both an original structural model and a second model generated by applying the
replacing rules to the original: If the two models have different sets of implied conditional independencies,
then they are not d-separation equivalent (e.g., Figure 11.3).
at the variable level, residuals at the case level can (2) nonnested models that are each based on different
vary over such models. Raykov and Penev (2001) theories; and (3) equivalent models generated by replac-
suggested that models with the lower standard- ing rules, where not all such models may be substan-
ized average individual case residuals would be tively meaningful or even plausible. Some additional,
preferred over equivalent versions with higher aver- but less frequently encountered, contexts for comparing
ages. A complication is that structural models with models with different relations to one another are sum-
latent variables do not generate unique predictions marized as an advanced topic in Appendix 11.A.
for individual cases due to factor score indetermi-
nacy, a concept explained Chapter 14. Applying the
Raykov–Penev method is more straightforward for SUMMARY
manifest-variable path models, where case residuals
are more directly analogous to regression residuals. Researchers often seek to select a structural equation
model from a set of alternative models all fitted to the
Apart from equivalent models, there may also be same data. The most frequent context occurs when
nearly equivalent models that do not generate the hierarchically related models are compared when the
exact same predicted covariances or conditional inde- restricted model is nested within the less restricted
pendencies, but they are similar enough in fit so that model. The chi-square difference test in this case evalu-
they are basically within the same equivalence class of ates the equal-fit hypothesis. It is critical to apply the
models (Breckler, 1990). Nearly equivalent models may chi-square difference test and related diagnostic sta-
arise as alternative respecifications at a particular point tistics, such as modification indexes, in ways that are
in model building or trimming (e.g., Table 11.1) or with guided by theory. Relying too much on empirical cri-
models each based on different theories but with very teria, such as statistical significance, risks excessive
similar fits to the data. Thus, there is no specific rule for capitalization on chance. The chi-square difference test
generating nearly equivalent models because their con- cannot be used to compare alternative nonnested mod-
texts can vary. In some cases, nearly equivalent models els, but predictive fit (information-theoretic) indexes
could be more numerous than truly equivalent models can be applied to evaluate such models. If a model is
and thus a more serious potential threat than equiva- retained, it is important to consider at least a few sub-
lent models. A method for estimating the relative fits of stantively meaningful equivalent models and to present
nearly equivalent models over a range of data values is arguments about why the researcher’s model would be
described in the chapter appendix. preferred over those equivalent versions. There are rel-
So far we have addressed comparing models based atively new methods for estimating the relative global
on the same variables and fitted to the same data in fit of alternative models that are partially overlapping
three main contexts: (1) parameter-nested models that (neither nested nor equivalent). These methods may be
are hierarchically related and based on the same theory; helpful in understanding when two different models
Pt2Kline5E.indd 198 3/22/2023 3:44:28 PM

may have nearly equivalent or dissimilar fit over data or estimate the relative fit of two models over various data or
parameter spaces specified by the researcher. The next parameter values.
chapter is about analyzing structural equation models
over multiple samples, each selected from a different Dziak, J. J., Coffman, D. L., Lanza, S. T., Li, R., & Jermiin,
population. L. S. (2020). Sensitivity and specificity of information cri-
teria. Briefings in Bioinformatics, 21(2), 553–565.
Hershberger, S. L., & Marcoulides, G. A. (2013). The prob-
LEARN MORE lem of equivalent structural models. In G. R. Hancock
& R. O. Mueller (Eds.), Structural equation modeling: A
Dziak et al. (2020) consider relative advantages and draw- second course (2nd ed.) (pp. 3–39). IAP.
backs of the AIC, BIC, and related predictive fit indexes
for comparing nonnested models; Hershberger and Mar- Lai, K., Green, S. B., & Levy, R. (2017). Graphical displays
coulides (2013) outline ways to deal with equivalent mod- for understanding SEM model similarity. Structural Equa-
els; and Lai et al. (2017) describe graphical methods that tion Modeling, 24(6), 803–818.
EXERCISES
1. Calculate the scaled chi-square difference statistic Stress. Describe the global fit and local fit of this
given the results listed next for two nested models: model (see also Table 11.1).
Model 1, df M1 = 17, chiML1 = 57.50, chiSB1 = 28.35
4. In Figure 11.1(b), calculate df M.
Model 2, df M2 = 12, chiML2 = 18.10, chiSB2 = 11.55
5. Reproduce the values of the predictive fit indexes in

2. In Table 11.1, interpret the results in the row for the
Table 11.4 for both models.
respecification of overlapping disturbances for fit-
ness and stress.
6. Describe global and local fit for the conventional
medical model in Figure 11.1(b) to the data in Table
3. Fit the path model of illness in Figure 8.1 to the data
11.3 based on output from analysis 1, Table 11.2.
in Table 4.3 but with the additional path Fitness →
Pt2Kline5E.indd 199 3/22/2023 3:44:28 PM

Appendix 11.A the possible predicted matrices under the less restricted
CFA model. Because the two models in Figure 11.4(a)
are nested, the chi-square difference could be applied
to directly compare their relative fits to the same data.
Other Types of Model Bentler and Satorra (2010) described a method for
Relations and Tests nesting and equivalence testing (NET) based on
bootstrapping or Monte Carlo simulation methods that
evaluates whether two models are equivalent or nested
A more complex type of a nested relation between (parameter or covariance matrix-based). The method
pairs of models is covariance matrix nesting, where could be especially useful when comparing models that
two models that can take completely different forms, are very different (e.g., Figure 11.4(a)) or when models
such as path model and a CFA model for the same are so large that is difficult to visually analyze them or
variables, but the set of possible covariance matrices inspect their equations. Note that the NET method is
for the constrained model is a subset of those for the data-dependent in that results can change if a different
unconstrained model. Moment matrix nesting has the covariance matrix is specified at the beginning (Lai et
same basic definition for models with both covariance al., 2017), but the method can be repeated with different
and mean structures. If a pair of models is covariance- matrices. Data dependence is not a unique limitation of
matrix nested, then the chi-square difference test could the NET procedure because the very concept of model
be conducted to directly compare their relative fits to similarity has the same characteristic: Relative similar-
the data (Bentler & Satorra, 2010). ity in fit between two models varies over the range of
An example of covariance matrix nesting for differ- data that could be collected and analyzed, so both model
ent model types based on the same variables described and data features must be considered together (Lai et
by Bentler and Satorra (2010) is presented in Figure al., 2016). Asparouhov and Muthén (2019) described
11.4(a). The path model on the left where df M = 1 is applications of the NET procedures to measurement
actually nested under the single-factor CFA model on models and outline the implementation of the procedure
the right composed of the same variables and where for both continuous and categorical outcomes in Mplus.
df M = 0. (Principles for counting free parameters in Direct comparison of very different types of struc-
CFA models are covered in Chapter 14.) Although the tural equation models is relatively rare, but we will
two models in Figure 11.4(a) look completely different, consider in Chapter 14 how the test for a single-factor
the set of all possible predicted covariance matrices CFA model is relevant for manifest-variable path mod-
under the more restricted path model is a subset of all els: The failure to reject such a single-factor model in
(a) Covariance matrix nested models
X1 X2 X3
X1 X2 X3 1
(b) Partially overlapping models
X1 X2 X3 X4 X1 X2 X3 X4
FIGURE 11.4. Two hierarchically related models based on covariance matrix nesting (a) and two nonnested but partially
overlapping models (b).
Pt2Kline5E.indd 200 3/22/2023 3:44:28 PM

this context means that the variables measure only one 3. The best fitting probability distributions of both
domain (they do not show discriminant validity), so it models belong to unique distributions, and it is an
may make little sense to proceed with the path analysis. open question whether one model will have better
For example, if a 1-factor CFA model were consistent fit to the data generating process. That is, fit could
with the data in Table 11.3, then neither 5-variable path be different or similar for the two models.
model in Figure 11.1 would actually reflect five distinct
target concepts or domains. Levy and Hancock (2007) described a series of steps
Partially overlapping models share some, but not and significance tests to distinguish the three cases just
all, probability distributions. Their common distribu- listed for partially overlapping models. They involve
tions can be obtained by constraining both models—not fitting each of the two models to the data and then fit-
just one of the two models as is true for pairs of nested ting to the same data the constrained version of the
models (Levy & Hancock, 2007). Likewise, there are models that generates the same predicted covariances
no sets of constraints such that nonoverlapping models or means, such as either model in Figure 11.4(b) but
would share any distributions. An example of partially with no disturbance covariance. Next, the chi-square
overlapping models is presented in Figure 11.4(b). Both difference test is applied to compare the relative fits of
4-variable path models are equally complex (df M = 2) both original models to that of the constrained version
and have a single disturbance covariance but for differ- just described. There are three possible outcomes:
ent pairs of variables, respectively, X2 and X3 (model on
the left) versus X3 and X4 (model on the right). The two 1. If the equal-fit hypothesis is retained for both tests
path models are clearly not nested, but constraining the just described, the inference is that the models are
covariance parameter in both models to zero would indistinguishable and do not differ in fit.
generate common probability distributions. 2. If the equal-fit hypothesis is retained for just one
For any structural equation model, its best fitting of the two models, the inference is that the models
probability distribution (BFPD) is the one that mini- are distinguishable in that the model that passed the
mizes the difference between the true data generating chi-square difference test fits better and its BFPD is
process that creates values for observed variables. That unique.
true process is defined by the model and a set of param- 3. If both models fail the chi-square difference test,
eter values that correspond to possible distributions then it is unknown whether the models differ in
where the model is correct (Levy & Hancock, 2007). their fit to the data generating processing. The next
In ML estimation, the BFPD is estimated as the distri- step would be to test the relative difference in fit
bution that maximizes the likelihood of the data. There of the models to the data using Bayesian methods
are three general cases for the best fitting probability for nonnested models—see Merkle et al. (2016) for
distributions of two partially overlapping models: examples of how to apply such methods.
1. If the best fitting probability distributions of two Levy and Hancock (2011) extended the model compari-
partially overlapping models are equal—that is, son framework just outlined to analyses of models over
they correspond to one of their shared distribu- multiple samples from known populations or over mix-
tions—then the two models are indistinguishable tures from an unknown number of populations.
with regard to the data generating process. This Lai et al. (2017) described a quantitative and graphi-
means that they will have similar fit across a wide cal approach to understanding model similarity over a
range of data, that is, they are not discriminable particular data space that complements the categorical
from one another (Lai et al., 2017). description of model relations as equivalent, nested,
2. The BFPD of the first model of the pair corresponds partially overlapping, or nonoverlapping. Their method
to a shared distribution, but the BFPD of the second builds on the insight by Preacher (2006) that fitting
model models belongs to a unique distribution that propensity, or a model’s average ability to fit a range
is not shared with the first model. The models are of different data patterns, has relatively little relation
distinguishable in this case, and the model with the to the number of free model parameters. That is, two
unique BFPD—the one outside of the range of the models of similar complexity in this regard can have
other model—is expected to fit better. very different potentials to closely explain data of
Pt2Kline5E.indd 201 3/22/2023 3:44:28 PM

varying characteristics. The Lai et al. (2017) method ing combinations of the characteristics just listed.
also relies on quantitative measures of model similar- These plots depict patterns of relative similarity or
ity evaluated by Lai et al. (2016) that included ML fit dissimilarity in fit between two models over varia-
functions (FML) and statistical criteria based on cova- tions just mentioned.
riance residuals, or differences between observed and
predicted covariances. There are critical limitations of the Lai et al. (2017)
Steps in the Lai et al. (2017) methods are summa- method. It deals solely with global fit; that is, it does
rized next: not indicate whether local fit at the level of the residu-
als is satisfactory for the compared models. Global fit
1. In Monte Carlo simulations, the two models to be statistics such as the FML, RMSEA, and CFI are for
compared are each fitted to a large number (e.g., normally distributed, continuous outcomes, so whether
100,000) of generated correlation or covariance their method yields meaningful results for nonnormal
matrices. The matrices can be generated over a outcomes, including categorical endogenous vari-
range of values of specified by the researcher, such ables, is unknown. Some examples described by Lai
as r = .40–.60 for a population correlation. Only et al. (2017) rely on fixed thresholds for approximate
admissible solutions are retained. fit indexes that are dubious (e.g., RMSEA < .05 means
2. The relative fit of each model to a data matrix is “good” fit). On the plus side, the method is potentially
measured with a statistical criterion specified by the useful for understanding when two models are nearly
researcher, such as FML, RMSEA, CFI, or SRMR. equivalent in fit and when they diverge in relative fit
Recall that the RMSEA and CFI are based in part over data and parameter spaces of interest. It also pro-
on FML through the presence of chiML in their for- vides a systematic way to better understand the roles of
mulas (Equations 10.11, 10.15), but the SRMR is specific model parameters when considering respecifi-
related to the correlation residuals. cations based on modification indexes or related statis-
tics. As I write, the Lai et al. (2017) method was not yet
3. Next, differences in fit statistics between the two
implemented in a freely available computer tool or R
models (e.g., DFML) can be plotted over a range of
package, but I expect that situation to change.
values for data, parameters, or residuals, includ-
Pt2Kline5E.indd 202 3/22/2023 3:44:28 PM

12
Comparing Groups
The capability to analyze a structural equation model across multiple groups extends even further the range
of hypotheses that can be tested in SEM. The main question in the analysis is whether values for target
parameters, or those of substantive interest, are appreciably different over groups; that is, does group mem-
bership moderate the directions or magnitudes of effects represented in the model? Perhaps the simplest
way to address these questions is to conduct a series of single-group analyses. This means that the model is
estimated within each of two or more different samples. Next, compare the unstandardized estimates across
the samples. Recall that unstandardized instead of standardized estimates should generally be compared
when the groups differ in their variabilities. For the same reason, covariance matrices (or the raw scores) for
each group should be analyzed when the model has only a covariance structure, and covariance matrices
and means (or the raw scores) should be analyzed when the model has both a covariance structure and a
mean structure. If the unstandardized estimates for the same parameter are meaningfully different, then the
populations from which the groups were sampled may not be equal on that parameter.
More sophisticated comparisons are available by using an SEM computer program that performs a
multiple-group analysis where the same model is simultaneously fitted to the data from all groups. Through
specification of cross-group equality constraints, group differences on any parameter can be tested. Cross-
group equality constraints instruct the computer to derive equal unstandardized estimates of that parameter.
The fit of the constrained model can be compared with that of the unconstrained model with the chi-square
difference test. If the fit of the constrained model is much worse than that of the unconstrained model, we
can conclude that the parameters may not be equal in the populations from which the samples were drawn.
This comparison assumes that the unconstrained model fits the data well in all groups, including both global
and local fit (i.e., the residuals); otherwise, there is little point in specifying a constrained model that would
fit the data even worse. Also, remember that estimates constrained to be equal in the unstandardized solu-
tion are typically unequal in the standardized solution. In general, standardized estimates should be directly
compared only across different variables within each sample.
The specific technique described in this chapter is continuous variables. We will deal with the more gen-
multiple-group path analysis, which treats group eral conditional process analysis framework in Chapter
membership (gender, ethnicity, marital status, etc.) as 20 which covers enhanced mediation analysis, but I can
a categorical moderator variable. It is a special case of say now that it provides a very flexible approach to esti-
moderated path analysis (Edwards & Lambert, 2007) mating moderation and mediation in the same analysis.
or conditional process analysis (Hayes & Rockwood, In multiple-group path analysis, the researcher can
2020), where moderators can be either categorical or directly test whether presumed causal effects are con-
203
Pt2Kline5E.indd 203 3/22/2023 3:44:28 PM

ditional. A conditional direct effect is indicated when ISSUES IN MULTIPLE‑GROUP SEM

the unstandardized coefficient for, say, the direct effect
X → Y varies in magnitude or direction over groups, Groups should occur naturally, such as when member-
such as basically zero for women versus an apprecia- ship is based on geographic region, ethnicity, marital
ble positive or negative coefficient for men. Likewise, status, or diagnosis, among other examples of distinc-
a conditional indirect effect refers to the observations observed in the real world and based on more-
tion that the unstandardized coefficient for an indirect or-less objective classification criteria. This definition
effect, such as does not include pseudo-groups, which are formed
through arbitrary categorization (discretization) of con-
X→W→Y tinuous variables, such as “high” versus “low” groups
based on a mean split (Kline, 2020a). Reasons why this
varies appreciably over groups. A related term is practice is often a bad idea are briefly summarized next:
moderated mediation (Edwards & Lambert, 2007; Numerical information about individual differences is
Preacher et al., 2007), which in this context implies lost when a continuous variable is categorized. Spe-
that group membership moderates an indirect effect. In cifically, all distinctions on the original variable among
this presentation, though, I use the more neutral term cases assigned to the same category, such as “low,” are
“conditional indirect effect” because the hypothesis of ignored after categorization. It can be difficult to detect
mediation assumes a great deal more than just indirect true curvilinear or interactive effects after discarding
causation (Topic Boxes 6.1, 7.1, and 8.1). so much quantitative information. Categorization can
There are many examples of multiple-group path induce spurious effects due to the specific method of
analysis in the research literature. For example, in partitioning cases. For instance, apparent main or inter-
a national study of Latino and Asian adults living in active effects after a mean split may not be observed
the United States, Molina et al. (2013) reported that following a median split for the same data (Altman &
(1) indirect effects of experiences of discrimination on Sauerbrei, 2006; MacCallum et al., 2002; Rucker et al.,
self-rated health through psychological distress were 2015). It is generally better to analyze continuous vari-
generally more consistent across Latino subgroups, ables in their original, quantitative form.
and (2) additional indirect effects through subjective Some SEM computer tools, as an option, can print
social status were apparent for Puerto Rican men, who special standardized solutions in multiple-group anal-
reported the highest levels of discrimination among all yses. For example, the LISREL program prints up to
participants. Wang et al. (2022) found that both mater- four different standardized solutions in this context
nal and paternal “phubbing”—also called phone snub- (Jöreskog & Sörbom, 2018). The within-group stan-
bing, or the act of using one’s cell phone during face- dardized solution and the within-group completely
to-face interactions—predicted levels of depression in standardized solution are both derived by standard-
both male and female adolescents (yes, more phubbing, izing each separate within-group covariance matrix
more depression). They also reported that negative except that only factors are standardized in the former
effects of phubbing by either parent on communication solution just mentioned versus all variables in the latter
quality were greater for female adolescents, who may solution. In the LISREL common metric standard-
have been more likely to interpret parental phubbing as ized solution, factors are rescaled such that the sum
a sign of exclusion or rejection than male adolescents. of the average of their variances weighted by group
The basic logic of multiple-group path analysis, size is 1.0, but all variables are so rescaled in the com-
including the specification of cross-group equality con- mon metric completely standardized solution. Com-
straints to test hypotheses about differences over popu- mon metric standardized results may be more directly
lations, extends to the analysis of other kinds of struc- comparable over samples than within-group standard-
tural equation models over multiple samples. There is ized results, but the unstandardized estimates are still
also the advantage that manifest-variable path models preferred for this purpose. Check the documentation of
are generally simpler than latent variable models, so your SEM computer to see how it calculates the stan-
path analysis is a good place to start. Reviewed next dardized solution in a multiple-group analysis.
are topics in multiple-group analyses for any kind of It can happen in multiple-group analysis that some of
structural equation model. the standardized disturbance variances in a structural
Pt2Kline5E.indd 204 3/22/2023 3:44:28 PM

Comparing Groups 205
model or error variances in a measurement model are ism, theft of a car, arson, or gang fighting at more seri-
> 1.0 in the common metric completely standardized ous levels of misconduct. The data for these groups are
solution, but this outcome is not necessarily a prob- summarized in Table 12.1.
lem (i.e., the solution may still be admissible). This is Depicted in Figure 12.1 with full McArdle–McDon-
because common metric error terms in each group are ald RAM graphical symbolism is a recursive path model
essentially products of the corresponding within-group with a mean structure (shown in dashed lines) based on
standardized error terms and the common metric vari- the variables in Table 12.1. The covariance part of the
ance (Pilgrim et al., 2006). So even if all standardized figure is one of the path models tested by Lynam et al.
error variances are ≤ 1.0 in the within-group standard- (1993, p. 195). They did not analyze means, but next
ized solution, some of these terms could exceed 1.0 in we do so in this example. Figure 12.1 represents the
the common metric standardized solution—see also hypotheses that SES, effort, and verbal IQ are corre-
the classic short work on standardized estimates in lated causes that affect delinquency both directly and
SEM by Jöreskog (1999). indirectly through school achievement. For example,
adolescent boys with poor verbal ability may be more
likely to drop out of school, which could contribute to
DETAILED EXAMPLE FOR A PATH delinquency because of reduced employment prospects
MODEL OF ACHIEVEMENT or more unsupervised time on the streets.
AND DELINQUENCY The design of Lynam et al.’s (1993) study was cross-
sectional with no temporal precedence in measurement,
Lynam et al. (1993) measured family socioeconomic so the only basis for directionality specification is argu-
status (SES), verbal IQ, test effort (i.e., examinee moti- ment. Fortunately, these authors gave a detailed account
vation during IQ testing), school achievement, and of their hypotheses, especially about the assumption
delinquency within samples of Black (n = 214) and that verbal IQ causes delinquency instead of the reverse
White (n = 181) male adolescents ages 12–13 years. or that these two variables have a purely spurious asso-
They were grade 4 students enrolled in urban public ciation due to common causes. Briefly summarized,
schools in the United States. They were also partici- Lynam et al. (1993) argued that their participants were
pants in a high-risk longitudinal study of early forms relatively young, which may rule out adverse effects of
of delinquency, ranging from vandalism or theft within delinquency that could lower verbal IQ such as drug
the home at less serious levels to shoplifting, vandal- use, school dropout, or head injuries from fights, among
TABLE 12.1. Input Data (Correlations, Standard Deviations, Means)

for a Multiple-Group Analysis of a Recursive Path Model of
Achievement and Delinquency
Black
Variable 1 2 3 4 5 M SD
1. SES — .08 .28 .05 –.11 31.96 10.58
2. Test Effort .25 — .30 .21 –.17 –.01 1.35
3. Verbal IQ .37 .40 — .50 –.26 93.76 13.62
4. Achievement .27 .28 .61 — –.33 2.51 .79
5. Delinquency –.11 –.20 –.31 –.21 — 1.40 1.63
White
M 34.64 .05 104.18 2.88 1.22
SD 11.53 1.32 16.32 .96 1.45
Note. Data are from Lynam et al. (1993). Black (above diagonal), n = 214; White (below diagonal),
n = 181.
Pt2Kline5E.indd 205 3/22/2023 3:44:29 PM

on key parameters in this example to test the hypothesis

DA that remaining in school is a relatively stronger protec-
1 tive factor against delinquency among young Black
males compared with young White males, especially in
SES Achieve
lower SES urban areas with relatively more single-par-
ent households where Black youth are disproportion-
ately represented (Lynam et al., 1993; Wilson, 1987).
The analysis strategy is outlined next:
Effort 1
1. For model 1, all 7 unstandardized direct effects are
constrained to equality over the two groups. It was
expected that model 1 would be inconsistent with
VIQ Delinq
the data, if the effect of achievement on delinquency
1 is different over the groups.
DD
2. For model 2, the equality constraint for the direct
effect of achievement on delinquency is released,
FIGURE 12.1. A recursive path model of achievement and
which was expected to appreciably improve fit over
delinquency evaluated across samples of Black and White that of model 1.
male adolescents. 3. Assuming satisfactory global and local fit of model
2 in both groups, two additional models are tested.
For model 3, the intercept for regressing achieve-
ment on SES, test effort, and verbal IQ is constrained
other possibilities. They also cited results from prior to equality. For model 4, the intercept for regress-
longitudinal research that verbal reasoning is a rela- ing delinquency on SES, test effort, verbal IQ, and
tively stable trait that predates serious antisocial acts by achievement is constrained to equality. Comparing
adolescent boys from lower SES parts of urban areas. the relative fits of models 3 and 4 to model 2 tests the
The rationale just summarized is subject to criticism. equality of regression intercepts for the endogenous
For example, Block (1995) argued that Lynam et al. variables over groups—see Figure 12.1.
(1993) overlooked the potential role of impulsivity as
a mediator for effects of verbal ability on delinquency. For model 1, the total number of observations is
That is, Block (1995) argued that it is not compromised the sum across both groups. With v = 5 observed
verbal ability per se that is causally related to delin- variables, there are 5(8)/2, or 20 observations in each
quency; instead, reduced capacity to delay gratifica- group (5 variances, 10 unique covariances, 5 means;
tion, think before acting, or remain focused on a goal Rule 9.3), which makes 20 × 2, or 40 altogether. Free
are more important factors in delinquency. Behavioral parameters include:
impulsivity was measured by Lynam et al. (1993) based
on adolescent, teacher, and parent reports, but the score 1. 3 variances and 3 covariances among measured
reliability for their composite measure was low, only exogenous variables (SES, effort, and verbal IQ)
.57. Block (1995) also suggested that the range of verbal over both groups, or 6 × 2 = 12;
ability among delinquent youth is so wide that low ver-
2. 2 disturbance variances for endogenous variables
bal IQ is not a necessary precondition. Thus, Lynam et
(achievement and delinquency) over groups, or 2 ×
al. (1993) gave a detailed-yet-not-indisputable account
2 = 4;
of their hypotheses. But to their credit, this depth of
explanation about directionality specifications is miss- 3. 7 direct effects on achievement endogenous vari-
ing from too many published SEM studies of cross- ables each constrained to pairwise equality over
sectional data (Chapter 3). groups; and
Without constraints, the path model in Figure 12.1 is 4. 3 means of measured exogenous variables and 2
just-identified and would perfectly fit the data in both intercepts for endogenous variables over groups, or
groups. Cross-group equality constraints were imposed 5 × 2 = 10,
Pt2Kline5E.indd 206 3/22/2023 3:44:29 PM

which altogether is 12 + 4 + 7 + 10 = 33. Thus, df M = Table 12.3 are the values of selected global fit statistics
40 – 33 = 7, which also equals the number of equality for all models. Model 1 with 7 equality-constrained
constraints for direct effects in model 1. direct effects passes the chi-square test, or chiML (7) =
11.736, p = .110. The contributions to the overall model
chi-square from each group are roughly comparable, or
Model Comparisons
6.456 in the Black sample and 5.279 in the White sam-
Listed in Table 12.2 for analysis 1 is the script file for ple (see the output for analysis 1). Thus, model–data dis-
fitting models 1–4 to the data in Table 12.1 for default crepancy as measured by chiML is not grossly dissimilar
ML estimation in lavaan. All script and output files in over groups. The upper bound of the 90% confidence
the table can be downloaded from this book’s website, based on the RMSEA for model 1 is .115, which is unfa-
and all analyses converged to admissible solutions. The vorable, but results for other approximate fit indexes do
computer was unable to print standardized residuals not seem problematic (CFI = .975, SRMR = .036; see
for all models and pairs of variables, but normalized Table 12.3). Next, we examine local fit.
residuals are also included in the output. Reported in Reported in the top part of Table 12.4 are correla-
tion residuals for model 1 in both groups, and values
of standardized residuals are reported in the lower
TABLE 12.2. Script Files and Packages in R part of the table. These results suggest local fit prob-
for Maximum Likelihood Estimation of a lems even though model 1 passed the chi-square test.
Recursive Path Model of Achievement and For example, an absolute correlation residual, shown
Delinquency Analyzed Over Multiple Groups in boldface in the table, exceeds .10 in the Black sam-
Analysis Script file ple for the variables achievement and delinquency.
The value is –.102, so model 1 underpredicts the
1. Multiple-group path lynam-multiple-grp.r
analysis, direct effects or
sample correlation between these two variables by this
intercepts constrained to amount. No absolute correlation residuals exceed .10
equality over groups in the White sample, but there are a total of 6 stan-
dardized residuals over both samples (3 in each group)
2. Generate conditional lynam-indirect.r where p < .05. These results are generally associated
indirect effects in each with the relatively larger absolute correlation residuals.
group
For example, the greatest absolute correlation resid-
Note. The lavaan package was used for all analyses. Output files have ual in the White sample is .083 for achievement and
the same names except the extension is “.out.” delinquency (i.e., overprediction), and the correspond-
ing standardized residual, 2.516, is significant at the
TABLE 12.3. Values of Selected Global Fit Statistics for Multiple-Group Recursive Path
Models of Achievement and Delinquency
Model RMSEA
Model Equality constraints chiML dfM p comparison chiD dfD p [90% CI] CFI SRMR
1 7 direct effects 11.736 7 .110 — — — — .059 [0, .115] .975 .036
2 6 direct effectsa 6.107 6 .411 1 vs. 2 5.629 1 .018 .010 [0, .093] .999 .029
3 6 direct effectsa 6.409 7 .493 2 vs. 3 .302 1 .583 0 [0, .083] 1.000 .030
Achieve intercept
4 6 direct effectsa 10.237 7 .176 2 vs. 4 4.130 1 .042 .048 [0, .107] .983 .033
Delinq intercept
Note. Model 3 is retained.

aAchieve → Delinq free.
Pt2Kline5E.indd 207 3/22/2023 3:44:29 PM

TABLE 12.4. Correlation and Standardized Residuals

for a Multiple-Group Recursive Path Model of Achievement
and Delinquency with All Equality-Constrained Direct Effects
Variable 1 2 3 4 5
Correlation residuals
1. SES — 0 0 –.073 –.034
2. Test Effort 0 — 0 –.009 –.012
3. Verbal IQ 0 0 — –.042 –.017
4. Achievement .080 .011 .042 — –.102
5. Delinquency .027 .011 .009 .083 —
Standardized residuals
1. SES — 0 0 –2.917 –.984
2. Test Effort 0 — 0 –.682 –.472
3. Verbal IQ 0 0 — –2.009 –.542
4. Achievement 2.611 .550 1.976 — –2.541
5. Delinquency .954 .446 .551 2.516 —
Note. These results are for model 1 in Table 12.3. Black (above diagonal), White (below
diagonal). Values in boldface for correlations residuals exceed .10 in absolute value and for
standardized residuals exceed 1.96 in absolute value.
.05 level. Based on all results described to this point, Results described so far are consistent with the
model 1 with all 7 equality-constrained direct effects hypothesis that the unstandardized direct effect of
is rejected. achievement on delinquency varies with group mem-
In model 2, the equality constraint for the direct bership, but we next consider two additional models
effect of achievement on delinquency is released—this before we settle on a final choice for this example. These
coefficient is now freely estimated in both groups— analyses are more exploratory because Lynam et al.’s
while still imposing equality on the remaining direct (1993) path models had no mean structures. Model 3,
effects, or 6 in total. Model 2 passes the chi-square relative to model 2, features equality-constrained inter-
test, or chiML (6) = 6.107, p = .411 (Table 12.3). Con- cepts for achievement. It was expected that constrain-
tributions to the overall chi-square from both groups ing this parameter would marginally degrade global fit
are roughly comparable (Black, 2.989; White, 3.118). compared with model 2, for which the same intercept is
The reduction in the model chi-square for model 2 freely estimated in both groups. This is because model
compared with model 1 is chiD (1) = 5.629, which is 2 with equal direct effects for causes of achievement—
significant at the .05 level. Other results are favorable, SES, test effort, and verbal IQ—over the two groups is
too: Values of approximate fit indexes for model 2 are generally consistent with the data. Results for model 3
not generally problematic, and the largest correlation listed in Table 12.3 support this prediction. For example,
residual is .073 in the Black sample and .080 in the when comparing models 2 and 3, chiD (1) = .302, which
White sample (see the output file). There are no sig- indicates practically no difference in the chi-square val-
nificant standardized residuals in the White group, but ues over these models. They also have essentially the
the computer was unable to generate all standardized same correlation, standardized, and normalized residu-
residuals in the Black group. One is significant for the als. But one more feature of model 3 needs comment.
pair achievement and delinquency, but the correspond- Because the intercept for achievement is equally
ing correlation residual is only .012. None of the nor- constrained in model 3, its whole mean structure is
malized residuals is significant in this group, but these no longer just-identified; specifically, it has df = 1, so
tests are more conservative. not all predicted means for the endogenous variables,
Pt2Kline5E.indd 208 3/22/2023 3:44:29 PM

achievement and delinquency, will exactly equal their unsatisfactory—see the output for this analysis (Table
observed counterparts. This is because means on 12.2). Thus, model 4 is rejected.
endogenous variables are functions of both their inter-
cepts and means on the exogenous variables (Rule 9.5).
Parameter Estimates
In contrast, mean residuals for the exogenous vari-
and Conditional Effects
ables—SES, test effort, and verbal IQ—will all equal
zero because no constraints affected their predicted Listed in the top part of Table 12.6 for model 3 are
means. Reported in the top part of Table 12.5 are the values for parameters freely estimated in both groups.
observed and predicted means for the endogenous vari- Results for the conditional direct effect of achievement
ables in the Black sample, and the corresponding results on delinquency are shown in boldface: In the Black
in the White sample are listed in the bottom part of the sample, the unstandardized coefficient for the direct
table. None of the means residuals is zero, but they are effect of achievement on delinquency is –.493, and the
relatively small. The standardized mean residuals corresponding estimate in the White sample is –.087.
are significance tests in the form of normal deviates (z) Exercise 1 asks you to interpret the path coefficients
for the mean residuals, and these values are relatively just mentioned. Thus, the unstandardized direct effect
small, too. Thus, model 3 closely predicts covariances of achievement on delinquency, controlling for SES,
and means in both groups, so it is retained effort, and verbal IQ, is about 5½ times greater among
In model 4, the intercept for regressing delinquency Black versus White male adolescents. This difference
on SES, test effort, verbal IQ, and achievement is con- in effect size over groups was considered meaning-
strained to equality over groups plus all direct effects ful by Lynam et al. (1993, p. 193). When scores on all
except that of achievement on delinquency, which is predictors (SES, effort, verbal IQ, and achievement)
freely estimated in both samples. Because the direct equal zero, the difference in intercepts for delinquency
effect just mentioned contributes to the intercept for suggests a higher relative standing on this variable
delinquency, it was expected that constraining the among Black male youth (4.380) than among White
intercept just mentioned to equality would appreciably male youth (3.411), under the same conditions. (This
worsen fit compared with model 2. This prediction is statement assumes that the raw scores are not centered,
consistent with the results; specifically, although model or mean-deviated.) Exercise 2 concerns the results in
4 passes the chi-square test—chiML(7) = 10.237, p = Table 12.6 for the disturbance variances.
.176—its fit is relatively worse than that of model 2— Listed in the bottom part of Table 12.6 are results for
chiD(1) = 4.130, p = .042 (Table 12.3). Also, residuals equality-constrained direct effects on achievement and
for model 4 are similar to those for model 1; that is, delinquency. For example, the unstandardized coeffi-
TABLE 12.5. Mean Residuals and Standardized Mean Residuals

for Endogenous Variables in the Final Multiple-Group Recursive
Path Model of Achievement and Delinquency
Observed Predicted Mean Standardized
Variable mean mean residual mean residual
Black
Achievement 2.510 2.525 –.015 –.641
Delinquency 1.400 1.392 .008 .068a
White
Achievement 2.880 2.857 .023 .510
Delinquency 1.220 1.222 –.002 –.146
Note. These results are for model 3 in Table 12.3.

aNormalized mean residual.
Pt2Kline5E.indd 209 3/22/2023 3:44:29 PM

cient for the direct effect of verbal IQ on achievement Indirect effects of SES, test effort, and verbal IQ
while controlling for SES and effort is .032 for both on delinquency through the intervening variable of
groups. Note also that the standard errors of the coef- achievement are computed in analysis 2 of Table 12.2.
ficient just stated, .003, are also equal across the groups The first stages for all three indirect effects are con-
even though two samples are not the same size. This strained to equality. This is because the unstandard-
outcome for the standard errors of equality-constrained ized direct effects of SES, test effort, and verbal IQ on
parameters is expected. Standardized estimates for the achievement are forced to be equal over groups (see
direct effect of verbal IQ on achievement vary over Table 12.6). But the second stage for all indirect effects,
samples, though: Black, .538; White, .564. The two or the unstandardized direct effect of achievement on
standardized path coefficients just listed are not the delinquency, is freely estimated in each group. Thus,
same because equality constraints are imposed in the each indirect effect is conditional, too. Reported in
unstandardized solution in this analysis, not in the stan- Table 12.7 are the coefficients (i.e., product estimators)
dardized solution. The equality-constrained intercept for indirect effects in each group. Exercise 3 asks you to
for achievement, –.426 for both groups, is also reported verify that none of the unstandardized indirect effects
at the bottom of Table 12.6. are significant at the .05 level in the White sample, but
TABLE 12.6. Maximum Likelihood Parameter Estimates for a Multiple-Group

Recursive Path Model of Achievement and Delinquency
Black White
Parameter Unst. SE St. Unst. SE St.
Unconstrained estimates
Direct effects
Achieve → Delinq –.493 .138 –.245 –.087 .124 –.057
Intercepts
Delinq 4.380 .566 — 3.411 .586 —
Disturbance variances
Achieve .463 .045 .697 .580 .061 .668
Delinq 2.306 .223 .855 1.881 .198 .913
Equality-constrained estimates
Direct effects
SES → Achieve –.002 .003 –.031 –.002 .003 –.029
Effort → Achieve .036 .029 .059 .036 .029 .051
VIQ → Achieve .032 .003 .538 .032 .003 .564
SES → Delinq –.003 .007 –.022 –.003 .007 –.029
Effort → Delinq –.099 .059 –.081 –.099 .059 –.091
VIQ → Delinq –.017 .006 –.144 –.017 .006 –.198
Intercepts
Achieve –.426 .245 — –.426 .245 —
Note. These estimates are for model 3 (Table 12.3). Unst., unstandardized; St., standardized. Standardized estimates
for error variances are proportions of unexplained variance. Variances, covariances, and means for the exogenous
variables (SES, Effort, VIQ) are the sample values in each group (Table 12.1).
Pt2Kline5E.indd 210 3/22/2023 3:44:29 PM

TABLE 12.7. Indirect Effects of SES, Test Effort, and Verbal IQ

on Delinquency through Achievement in a Multiple-Group
Recursive Path Model of Achievement and Delinquency
Black White
Causal variable Unst. SE St. Unst. SE St.
SES .001 .002 .007 < .001 < .001 .002
Effort –.018 .015 –.014 –.003 .005 –.003
Verbal –.016 .005 –.132 –.003 .004 –.030
Note. These estimates are for model 3 (see Table 12.3). Unst., unstandardized; St., stan-
dardized. Standard errors are Sobel standard errors.
(2) the unstandardized coefficient in the Black sample support the imposition of constraints on researcher-
for the indirect effect of verbal IQ on delinquency defined parameters such as for indirect or total effects.
through achievement, or –.016, is statistically signifi- One method to formally test the null hypothesis in
cant at the same level. Exercise 4 asks you to interpret Equation 12.1 is to fit two nested models to the same
the result just mentioned in the Black sample. data: One model is constrained, where product esti-
mators for the same indirect effect are constrained to
equality over groups, but the other model is uncon-
TESTS FOR CONDITIONAL INDIRECT strained. The test statistic is chiD(1) assuming normal-
EFFECTS OVER GROUPS ity or a scaled chi-square difference statistic adjusted
for nonnormality (Topic Box 11.1). This is the likeli-
A somewhat unusual feature of the analysis just hood ratio (LR) test method. A second method is the
described is that the unstandardized coefficients for the Wald test: Only the unconstrained model is analyzed,
first stages of indirect effects are constrained to pair- and the increase in the model chi-square due to impos-
wise equality over groups, but the second stages are ing the equality constraint is approximated by the W
freely estimated in each sample. Perhaps a more typi- statistic. The null hypothesis in Equation 12.1 describes
cal scenario occurs when indirect effects are estimated a single constraint applied over two groups for an indi-
but with no constraints imposed on constituent direct rect pathway made up of three variables, but both meth-
effects. A strategy in this case is to constrain just the ods can be extended to simultaneously test equality
unstandardized product estimator for an indirect effect hypotheses over ≥ 2 groups for ≥ 2 indirect effects that
to equality over groups (Ryu, 2015). Suppose that a1 each consist of ≥ 3 variables.
and b1 are the unstandardized coefficients in population Two additional methods to test Equation 12.1 require
1 for, respectively, the direct effects in the same model no constraints (Ryu, 2015): The third is nonparametric
bootstrapping, where large numbers of generated sam-
X→W and W → Y ples are randomly selected from the original groups,
where the sizes of generated samples are the same as
and that a2 and b2 are their counterparts in population those in the original sample. A bootstrapped sampling
2. Constraining the product estimator for the indirect distribution for group differences in product estimators
effect of X on Y through W to equality over groups tests for indirect effects is constructed by the computer, and
the null hypothesis a bootstrapped significance test is applied, including
the derivation of bootstrapped confidence intervals for
H0: a1b1 = a2b2 (12.1) parameter estimates in each group. The fourth method
is Monte Carlo simulation assuming that the joint dis-
without also explicitly constraining the direct effects. tributions of estimates for unstandardized direct effects
Some SEM tools, including lavaan and Mplus is multivariate normal. Empirical sampling distribu-
(Muthén & Muthén, 1998–2017; Rosseel et al., 2023), tions for group differences in estimated indirect effects
Pt2Kline5E.indd 211 3/22/2023 3:44:29 PM

are generated, and confidence intervals within these a continuous causal variable X. A coding variable ×
distributions are derived by the computer. X product term is created, and a presumed mediator
Ryu (2015) conducted a computer simulation study M is regressed on group membership, cause X, and
of the four basic methods just described to determine the product term (e.g., Figure 7.8(a)). The coefficient
their relative power and accuracy to detect population for the product term estimates whether the relation
differences in indirect effects. Two bootstrap confi- between cause X and mediator M changes over the
dence interval methods were studied, percentile and groups (i.e., there is interaction). If so, then the indirect
bias-adjusted, so the total number of conditions for effect of cause X on outcome Y through mediator M is
methods in Ryu (2015) is five. Other study conditions conditional because its first stage, or X → M, depends
included magnitudes of differences in indirect effects on group. Thus, a single coefficient (for the product
(including zero), group size (n = 50, 100, 150), whether term) is used to test for group differences in the indi-
group sizes are balanced or unbalanced (i.e., equal or rect effect. There are other variations of a single-group
unequal), and whether a larger indirect effect is asso- approach that are special cases of conditional process
ciated with the larger or smaller group in unbalanced analysis, and these are described in Chapter 20.
conditions. The LR test method performed well over A drawback of the single-group approach is that
most conditions regarding power and Type I error rate. the sole coefficient for estimating differences in indi-
The Wald test method was generally less powerful than rect effects assumes equal variances over groups.
the LR test when group sizes were equal and when a In contrast, there is no comparable requirement in a
smaller indirect effect was associated with the largest multiple-group approach because variances can be
group in unbalanced conditions. Both methods just estimated separately in each group (e.g., Table 12.6).
mentioned generally outperformed the two bootstrap In a computer simulations, Ryu and Cheong (2017)
methods, and the Monte Carlo method had the lowest compared the accuracy of single-group and multiple-
power. An advantage of the LR test and Wald test meth- group approaches for estimating differences in indirect
ods is that no resampling or Monte Carlo simulation is effects over two populations. Study conditions included
needed for either one (i.e., they are simpler to apply) ones where the homoscedasticity assumption was vio-
and their results are deterministic in that they each lated. In this case, single-group analysis can generate
yield a single estimate, unlike methods based on simu- incorrect results, especially in the LR test and Wald
lated random sampling (bootstrapping, Monte Carlo). test methods. Performance in single-group analysis
A recent example of a multiple-group path analy- under heteroscedasticity was better in bootstrapped
sis where direct or indirect effects were compared confidence interval methods and also in the Monte
over independent samples is van Veelen et al. (2019), Carlo confidence interval method, but at cost of higher
who reported that the combination of working almost Type I error rates. The best combination was the LR
exclusively with male colleagues and working in tech- test in multiple-group analysis, which generally yielded
nical sectors where women are negatively stereotyped optimal levels of power and Type I error rates. Given
predicted higher levels of perceived gender identity these results, Ryu and Cheong (2017) recommended
threat and less career confidence and work engagement that researchers first examine estimates for variance
among women with STEM (science, technology, engi- parameters in a multiple-group analysis with no con-
neering, or mathematics) degrees than among men with straints to verify homoscedasticity before opting for a
similar credentials. single-group analysis.
In a multiple-group approach, group membership
is not represented in the model (e.g., Figure 12.1).
Instead, it is treated in the analysis as a categorical SUMMARY
moderator variable when estimates for parameters in
the same model, such as for indirect effects, are com- When simultaneously fitting a structural equation
pared over groups (e.g., Table 12.7). Ryu and Cheong model to the data from multiple groups, it is common
(2017) described an alternative for comparing indi- to impose cross-group equality constraints on certain
rect effects over two groups called the single-group unstandardized parameter estimates. These parameters
approach. In this method, group membership is rep- could be for causal effects, including direct, indirect, or
resented in the model as a binary exogenous vari- total effects, or for variances, covariances, means, or
able (i.e., a single coding variable) that covaries with intercepts. The choice among them should be related to
Pt2Kline5E.indd 212 3/22/2023 3:44:29 PM

specific hypotheses about group differences. If the fit straightforward to test homogeneity assumptions in a
of the constrained model is much worse than that of the multiple-group approach.
unconstrained model—and the unconstrained model
fits the data reasonably well—we may conclude that
the populations from which the groups were selected LEARN MORE
may differ on the equality-constrained parameter(s).
An alternative to multiple-group analysis is to represent Results of computer simulations by Ryu (2015) and Ryu and
group membership in a single-group model that is fitted Cheong (2017) about methods to compare indirect effects
over groups were briefly summarized, but it is worthwhile to
to the data from all groups combined. Certain endog-
read the whole original works.
enous variables in the model are regressed on group
and product terms that involve group and other pre-
Ryu, E. (2015). Multiple-group analysis approach to testing
sumed moderator variables. Coefficients for the product group difference in indirect effects. Multivariate Behav-
terms estimate interactive effects of group with other ioral Research, 47(2), 484–493.
causal variables, and their values can be converted to
estimates of conditional indirect effects. A drawback Ryu, E., & Cheong, J. (2017). Comparing indirect effects in
is that single-group models generally assume homosce- different groups in single-group and multi-group struc-
dasticity over groups, and estimates can be inaccurate tural equation models. Frontiers in Psychology, 8, Article
if such assumptions are untenable. In contrast, it is 747.
EXERCISES
1. Interpret the results in Table 12.6 for the unstan- 3. Test the unstandardized coefficients for indirect
dardized direct effects of achievement on delin- effects in Table 12.7 for statistical significance in
quency in both groups. both groups.
2. Calculate R2 for the endogenous variables in Table 4. Interpret in Table 12.7 the unstandardized indirect
12.6 for both samples. Are these results directly effect of verbal IQ on delinquency through achieve-
comparable over the groups? Explain your answer. ment in the Black sample.
Pt2Kline5E.indd 213 3/22/2023 3:44:30 PM

Pt2Kline5E.indd 214 3/22/2023 3:44:30 PM
Part III
Multiple‑Indicator Approximation
of Concepts
Pt3Kline5E.indd 215 3/22/2023 2:56:11 PM

Pt3Kline5E.indd 216 3/22/2023 2:56:11 PM
13
Multiple-Indicator Measurement
Models considered up to this point featured single-indicator measurement where each theoretical variable is
approximated by just one observed variable (i.e., manifest-variable path models). Introduced in this chapter
is the logic of multiple-indicator measurement, where two or more observed variables are used to estimate
the same concept. Multiple indicators may assess a wider range of attributes that correspond to the theoreti‑
cal definition of a concept than a single indicator, and precision (reliability) of empirical approximations for
concepts is generally higher with multiple indicators, if all indicators have adequate psychometric properties.
But estimating theoretical concepts with multiple indicators requires the researcher to face a wider range of
questions and decisions compared with analyzing models with single indicators. Broadly stated, these ques‑
tions involve hypothesized connections between the data from multiple indicators (i.e., their scores) and the
concepts those indicators should approximate. They are summarized next:
1. What are the directions of the relations between indicators and concepts? Specifically, are the
observed scores viewed as the outcomes of those concepts versus the causes of those concepts?
2. How is measurement error handled? That is, will it be taken into account at the level of the indicator,
at the level of the concept, or ignored (i.e., nothing is done)?
3. Given decisions on (1) and (2), how will numerical information from multiple indicators be summa‑
rized or assembled to form a proxy, or a statistical entity built up from data that approximates a
concept?
There are different kinds of measurement models, and they assume different patterns of relations
between concepts and indicators, represent measurement error in different ways, and are associated with
different kinds of proxies. To meaningfully test a theory when multiple indicators are available for some
concepts, the researcher should make intelligent decisions about measurement models on which to base
the analysis. The opposite involves making uninformed choices or, even worse, blindly relying on defaults
in a particular computer tool to implement these decisions for the researcher. Thus, the main goal of this
chapter is to make you more aware of the rationale, assumptions, and contexts for specifying measurement
models in SEM. I hope this presentation better prepares you for learning about specific SEM techniques for
multiple-indicator measurement described in upcoming chapters in Part III of this book. Next, we consider
two frameworks for linking data with concepts through the specification of measurement models.
I thank Edward Rigdon for his invaluable comments on an earlier version of this chapter. Any remaining limitations in this presentation
are clearly my own.
217
Pt3Kline5E.indd 217 3/22/2023 2:56:11 PM

218 Multiple-Indicator Approximation of Concepts
CONCEPTS, INDICATORS, Second, Rigdon’s (2012) emphasis on proxies also

AND PROXIES reminds us to be more careful about language, espe-
cially about the term “latent variable model,” which is
Bagozzi and colleagues described holistic construal, widely used in the SEM literature (including also by
which is an eclectic, iterative approach to theory devel- me earlier in this book). Specifically, our models fitted
opment, testing, and refinement that is neither strictly to sample data do not actually have latent variables in
deductive nor confirmatory (Bagozzi, 2011; Bagozzi & them, only proxies for target concepts. It is also impor-
Phillips, 1982; Bagozzi & Yi, 2012). The model con- tant to avoid a logical error about names assigned to
nects theory and hypotheses at higher levels of abstrac- proxies in our models. The error is the naming fallacy
tion with entities at lower levels of abstraction ending or nominalist fallacy (Cliff, 1983): Just because a proxy
with empirical observations (data), also called the is named does not mean that the corresponding concept
observational sector (Schaffner, 1969). At the high- is understood or even correctly labeled. Proxies require
est level is a theoretical concept, or a verbal (text- some type of designation, though, if for no other reason
based) definition (Maraun & Halpin, 2008). It specifies than communication of the results. Although abstract
expected features or attributes of the target phenom- symbols such as h or x from LISREL notation are an
enon, and this definition could be embedded within a option, verbal labels are more user friendly and repre-
larger nomological network in which relations among sent the effort to interpret statistical results for prox-
multiple concepts are outlined. ies. But meaningful proxy names should be viewed as
A derived concept in holistic construal falls at a conveniences or hypotheses and not as substitutes for
lower level of abstraction than a theoretical concept. critical thinking.
Although unobserved (i.e., latent), a derived concept Presented in Figure 13.1 is a graphical illustration
is directly linked with empirical concepts, or observ- of the concept proxy framework for a single concept.
ables (i.e., data) (Bagozzi & Phillips, 1982). Henseler Symbols in the figure are arranged among a dimension
(2021, p. 3) defines constructs in a similar way: They that connects data at the bottom to increasingly abstract
are “statistical variables that are not by themselves entities ending with the definition of a focal concept
observable, but can be mathematically inferred from at the top of the figure. That definition may refer to
observable variables.” Constructs also involve hypoth- other concepts that are part of the same theory, some
eses about how indicators should covary in studies of of which may be described as antecedent to the focal
individual differences or how they will be similarly concept and others assumed to be consequent concepts.
affected by experimental manipulations (Nunnally & Related concepts are not shown in the figure, but each
Bernstein, 1994).1
Rigdon’s (2012) concept proxy framework rep-
resents two basic elaborations beyond holistic con- Concept
strual: First, there is greater emphasis on the differ- definition
ence between a theoretical concept and an empirical
approximation of that concept. The approximation is a
proxy, which essentially replaces the idea of interme-
diate latent variable in holistic construal, or a derived
concept, that lies between theoretical definitions and Concept
data. This substitution emphasizes that proxies merely
approximate concepts, that is, they are empirical stand-
ins for concepts, but proxies and concepts are not
identical. At most, a proxy is a replacement variate
generated by combining or weighting scores from mul- Proxy
tiple indicators in optimal ways that essentially replace
the original indicators as the focus of the analysis and
interpretation of the results (Maraun & Halpin, 2008). Indicators
1 Michell
(Data)
(2013) reminded us that theory can be inferred from
observation, that is, it is a myth that data must always follow
theory. FIGURE 13.1. Rigdon’s (2012) concept proxy framework.
Pt3Kline5E.indd 218 3/22/2023 2:56:11 PM

Multiple-Indicator Measurement 219
would have a similar graphical representation in the are nothing more than labels for particular kinds of
concept proxy framework as depicted in the figure. regularities among measured variables (Rigdon, 2016).
The second symbol from the top in Figure 13.1—the That is, concepts have no causal agency from this per-
oval rendered in double lines—represents the concept spective.
as it exists in the real world.2 There are variations on In Figure 13.1, the lines that connect indicators with
the status of this existence. One corresponds to a scien- their proxy are nondirectional. They represent cor-
tific realist perspective that concepts exist independent respondence rules, also called C-rules or epistemic
of observation and have causal agency on the manifest correlations (Schaffner, 1969), that associate concepts
variables that are supposed to measure them (Rigdon, with their indicators. But correspondence rules also
2016). For example, intelligence is expected to drive specify the presumed nature and direction of the causal
behavior; specifically, persons with varying levels of relations between a concept and its indicators. The two
intelligence should behave in different ways in cer- aspects together, concept–indicator associations and
tain situations, including attaining different scores on hypotheses about directionality, describe a formal mea-
measures of reasoning, memory, fund of factual infor- surement model. There are three basic kinds: reflective,
mation, or other cognitive abilities. In this way, intel- causal–formative, and composite (Henseler, 2017). Of
ligence is a viewed as a causal entity. the three, reflective measurement is probably the most
Concepts can also exist purely in the minds of familiar to researchers in psychology and education,
researchers as organizing principles for understand- especially those trained in classical measurement the-
ing a particular problem, or what Einstein (1916/1997) ory. Types of measurement models are described next.
described as a “necessity of thought” or an “a priori
given” among scientists. For example, Davidson and
White (2007) described recovery as the organizing REFLECTIVE MEASUREMENT
principle for an integrated model of treatment deliv- AND EFFECT INDICATORS
ery for persons with mental health or addiction prob-
lems. Recovery occurs or does not occur as function A reflective model for a set of three indicators is
of measured variables that include treatment efficacy depicted in Figure 13.2(a). It represents the hypoth-
or duration, patient background, or level of community esis that each indicator has two unrelated causes, the
services posttreatment, among others, but recovery per concept the indicator is supposed to measure and an
se is an outcome, not a cause. This perspective under- error component that represents unique variance that
lies the scientific empiricist view that (1) reality corre- is not explained by the concept. Unique variance has
sponds to what is directly observable, and (2) concepts two parts: random measurement error and specific
variance, which is systematic but not explained by the
2 E. Rigdon (personal communication, April 27, 2020). concept. The rest of indicator variance, called com-
(a) Reflective (b) Causal–Formative (c) Composite

L→M M→L M→C
FIGURE 13.2. Measurement models for a set of three indicators: reflective measurement with effect indicators (a), causal–for‑
mative measurement with causal indicators (b), and composite (composite–formative) measurement with composite indicators
(c). L, latent; M, manifest; C, composite.
Pt3Kline5E.indd 219 3/22/2023 2:56:11 PM

mon variance, is systematic and shared with the con- indicators that depend on the same factor, among other
cept. ways, to evaluate the adequacy of reflective measure-
Grace and Bollen (2008) used the term L → M block ment models described in the next chapter.
(latent to manifest) to describe the “flow” of causation A drawback is that assumptions of the common
from concept to indicators in Figure 13.2(a). Observed factor model are at odds with the motivation to select
variables in reflective measurement models are called indicators each of which measures a distinct facet or
effect indicators or reflective indicators because aspect of the target concept. Suppose that leadership
they are specified as outcomes of concepts (and also of is defined as the combination of the abilities to teach,
their own error terms). Other assumptions of reflective listen, challenge, inspire, and solve problems while not
measurement are listed next (Bollen & Bauldry, 2011; complaining (Tobak, 2015). Measures of each facet just
Rhemtulla et al., 2020): Effect indicators for the same listed are selected. A common leadership factor would
concept represent only what is shared among these indicators,
not anything uniquely measured by any one of them.
1. have a rationale or a conceptual unity that all cor- Thus, if a concept is viewed as having distinct facets
respond to the same domain (i.e., there is a theory such as leadership in this example, the common factor
that links the indicators as measures of a common model is not an optimal choice.
quantity); Here is a second example: Adolescents and their
parents and teachers each complete a questionnaire
2. their levels rise or fall in value together, given a cor-
about adolescent social skills. The goal is to capture
responding change in the underlying concept;
differences in perspectives over the three informants all
3. are internally consistent as a set, which means that reporting on the same thing. If adolescent, parent, and
their intercorrelations are positive and at least mod- teacher reports are specified as effect indicators, then
erately high in magnitude (e.g., > .50 for continuous any unique variance due to a particular type of infor-
variables); mant would not contribute to a common social skills
4. can be substituted for another without appreciably factor. Reflective measurement is again not the best
affecting the interpretation of the concept; choice, given the goal just stated. Fortunately, there are
5. are locally independent, or uncorrelated control- alternatives to the specification of reflective measure-
ling for their common cause (the concept), if their ment as outlined next.
errors are independent; and
6. contribute only what they share (i.e., common vari- CAUSAL–FORMATIVE MEASUREMENT
ance) to approximation of the concept. AND CAUSAL INDICATORS
The representation of a concept as a latent variable A causal–formative model is represented in Figure
with effect indicators only is called the common fac- 13.2(b). It is described as an M → L block (manifest to
tor model. The model’s name also describes the proxy latent) (Grace & Bollen, 2008) because the latent vari-
(i.e., a common factor) in factor analytic techniques that able is represented as the outcome of its causal indica-
approximate latent variables from whatever is com- tors plus a disturbance term that represents unexplained
mon among their indicators while allocating anything variation in the proxy for the concept. There are times
not shared among those indicators to the error terms when it makes sense to view a latent variable as caused
(Rhemtulla et al., 2020). A benefit of the common factor by its indicators. For example, income, education, and
model is that indicator measurement error variance is occupational prestige, among other variables, deter-
removed from common factors. As a result, estimates of mine a person’s standing on a latent socioeconomic sta-
relations among concepts are generally unbiased in the tus (SES) dimension, not the other way around. That is,
common factor model, if the population measurement SES is the outcome of the variables just listed, not their
model is reflective. Another benefit is that assumptions cause. For this reason, some authors, such as MacKen-
of common factor models are generally testable in the zie et al. (2011), described the construct represented in
data. For example, the local independence assumption Figure 13.2(b) as a composite latent construct, which
is testable through the derivation of vanishing partial emphasizes that the construct in a causal–formative
correlations (i.e., conditional independencies) between model basically summarizes its indicators (plus unex-
Pt3Kline5E.indd 220 3/22/2023 2:56:12 PM

plained variation) instead of determining them as in a the next three chapters for identifying models in which
reflective model. conceptual variables are approximated with sets of
Other features of causal–formative measurement multiple indicators. But I can say now that specification
models are listed next (Henseler, 2017); see also Fig- of a measurement model has implications for identifica-
ures 13.2(a)–13.2(b): tion in SEM just as specification does for a structural
model (e.g., recursive or nonrecursive, Rule 7.4).
1. Causal indicators for the same concept are exog-
enous variables that are free to vary and covary. In
contrast, effect indicators are endogenous, so their vari- COMPOSITE MEASUREMENT
ances and covariances are not free parameters. Thus, AND COMPOSITE INDICATORS
a reflective model explains the variance and covari-
ances of effect indicators, but a causal–formative model A composite model, also called a composite–forma-
explains neither variances nor covariances among a set tive model (Henseler, 2021), with composite indica-
of causal indicators. tors is represented in Figure 13.2(c). It differs from a
causal–formative model in two ways: (1) The proxy for
2. Measurement error in effect indicators is rep-
the theoretical variable in a composite model is a com-
resented at the level of measured variables (i.e., their
posite, or a linear combination of its indicators and (2)
error terms), but measurement error in causal indica-
because a composite is not latent, it has no disturbance.
tors is manifested at the construct level. This means
In fact, constraining the disturbance variance to zero
that as scores on causal indicators become less precise,
in the causal–formative model in Figure 13.2(b) gen-
unexplained variation in estimates for the target latent
erates the composite model in Figure 13.2(c) (i.e., the
variable, or disturbance variances, increase. Dropping
two models are hierarchically related). Grace and Bol-
a causal indicator may also increase measurement error
len (2008) described a composite model as an M → C
at the construct level.
block (manifest to composite). These authors also rep-
3. There is no requirement for internal consistency resent composites in diagrams as hexagons, which are
among causal indicators of the same construct. This also used in Figure 13.2(c). A hexagon is not a standard
means that they can covary in any pattern whatsoever, symbol in the literature, but it conveys the fact that a
positive, negative, or even zero. What unifies a set of composite is just an observed variable, albeit “assem-
causal indicators is the researcher’s hypothesis that bled” from ≥ 2 observed variables in the model (the
they affect the same domain or concept. composite indicators).
Listed next are the features of composite models
4. But omitting a causal indicator that covaries
(Henseler, 2017); see also Figures 13.2(a)–13.2(c):
with measured causal indicators may lead to bias. This
can happen because the omission of such an indicator
1. The indicators make up the concept; that is, the
changes the empirical definition of the construct. So
meaning of the concept is derived entirely from its
unlike effect indicators, causal indicators are not gener-
components, if it can be assumed a set of composite
ally interchangeable.
indicators adequately represents the target domain.
A stumbling block to estimating causal–formative 2. Just as for causal indicators, composite indicators
measurement models in traditional (i.e., covariance- are exogenous variables that are free to vary and
based) SEM is identification. For example, Figure covary.
13.2(b) with causal indicators only is not identified. 3. Relatively high correlations within a set of compos-
Specifically, its parameters could be estimated only if ite indicators may be expected but are not required.
(1) Figure 13.2(b) were embedded within a larger model 4. Dropping or adding an indicator can change the
where (2) the latent variable with causal indicators meaning of the composite, so composite indicators
only is specified as directly affecting at least two other are not generally interchangeable.
endogenous variables. In contrast, the reflective model
5. Measurement error in composite indicators is not
with effect indicators only in Figure 13.2(a) is identi-
represented in the model, either at the construct
fied, so it could potentially be analyzed as a “stand-
level or at the level of measured variables. This
alone” model. Specific requirements are described in
means that composites constructed from error-
Pt3Kline5E.indd 221 3/22/2023 2:56:12 PM

prone observed variables contain measurement effect indicators is discarded (counted as error), such
error, too. features are retained in the analysis of composite indi-
cators. Thus, the goal of assembling a set of indicators,
The last point (5) implies that if the true population each of which represents a disparate facet or element
model is reflective but proxies for latent variables are of a concept, is directly supported in composited-based
composites with measurement error instead of com- approaches to the analysis. Of course, composite mod-
mon factors, then correlations between composites will els are not panaceas in that they present their own chal-
inconsistently estimate correlations between factors. lenges. One is identification: The composite model in
This means that, in all likelihood, the estimates fail Figure 13.2(c) is not identified unless it is included as
to progressively converge toward the true values with part of a larger model where its composite has direct
increases in sample size, which can increase both Type effects on other variables. (Recall that the same is true
I and Type II errors in model testing (Henseler, 2021). for Figure 13.2(b).) The identification and analysis of
The degree of bias in composite-based results seems composite models is described in Chapter 16.
to decrease when population reflective models have at
least four indicators per factor all with relatively high
standardized factor loadings, such as > .70, or as cor- MIXED‑MODEL MEASUREMENT
relations among indicators increase (Hair et al, 2022;
Rhemtulla et al., 2020). There are also relatively new Depicted in Figure 13.3 is mixed-model measure-
methods for consistently estimating population reflec- ment, also called a multiple-indicators and multi-
tive models with composites that are described in ple-causes (MIMIC) model (Jöreskog & Goldberger,
Chapter 16. 1975). Its factor has both effect indicators and causal
Henseler and Schuberth (2020) described the role of indicators, so the whole model is actually a combina-
composites as proxies for emergent variables, which tion of a reflective model and a causal–formative model
refer to concepts that are forged, or come into being (see Figures 13.2(a)–13.2(b)). The basic hypothesis for
through invention to solve problems. Emergent vari- an MIMIC model is that the latent variable is a cause
ables do not typically refer to natural phenomena— of its effect indicators while simultaneously it is an out-
they are instead abstractions of objects, either actual come of its causal indicators. For example, Hershberger
or conceptual, with a purpose or intentional design. (1994) described how the measurement of depression
That is, emergent variables are built up from artifacts, could be viewed from a mixed-model perspective:
or things that can be assembled into manufactured Some indicators, such as “feeling sad,” “loss of appe-
objects or abstract processes, such as software, indexes tite,” and “insomnia,” can be understood as symptoms;
or guides, policies, or design principles. In this way, an that is, they are outcomes of being depressed. But a dif-
emergent variable is a composite of variables that act as
a whole, not as mere “heap of parts” (Henseler, 2021,
p. 36). This means that (1) a set of composite indica-
tors should be independent of all other variables in the Mixed model
model, given their common proxy, and (2) the ratio Effect, causal indicators
between indicators’ correlations with other variables in
the model should remain constant. These expectations
are testable implications in the data.
The ideas about emergent variables as proxies for
forged concepts in composite measurement models
just reviewed may be relatively novel for researchers
trained in classical measurement theory, but there are
specific rhymes, reasons, and rationales to approximat-
ing emergent variables instead of, or perhaps in addi-
tion to, latent variables. There also potential advan-
tages for classically trained researchers (like myself). FIGURE 13.3. Mixed-measurement model with both
For example, in contrast to common factor models, causal indicators and effect indicators, also called a MIMIC
for which anything uniquely measured by individual (multiple indicators and multiple causes) model.
Pt3Kline5E.indd 222 3/22/2023 2:56:12 PM

ferent indicator, “feeling lonely,” might cause depres- tive measurement models in Figures 13.2(b)–13.2(c) are
sion rather than vice versa, such as when social isola- not identified without adding variables (or embedding
tion or withdrawal precedes the onset of depression. them in larger models, which does the same thing).
An MIMIC factor like the one in Figure 13.3 is always Also, the MMIC model in Figure 13.3 represents one
endogenous, so it has a disturbance. Random measure- method for identifying the causal–formative model in
ment error in its two effect indicators, also endogenous, Figure 13.2(b) with a composite latent construct: If two
is manifested in their error terms. The bivariate correla- effect indicators were added to the causal–formative
tion between the effect indicators should be positive and model, it would resemble the MIMIC model in Figure
at least moderately high in value. The partial correlation 13.3, and thus Figure 13.2(b) would be identified with
between the two effect indicators should vanish when this change.
controlling for their common cause, the MIMIC factor,
and only their common variance contributes to approxi-
mation of the factor. The same two indicators are also CONSIDERATIONS IN SELECTING
viewed as interchangeable with other effect indicators for A MEASUREMENT MODEL
the same factor, just as in reflective measurement models.
In contrast, measurement error in the two causal indica- The considerations in specifying reflective or formative
tors in Figure 13.3 is manifested in the factor disturbance, measurement (causal–formative, composite) for sets of
and both their common and unique variances contribute multiple indicators are discussed next. Rhemtulla et al.
to approximation of the factor. As in causal–formative (2020) reviewed SEM studies published in six major
models, cause indicators in an MIMIC model are not psychology journals around August 2016. They found
viewed as interchangeable, and they can have any pattern that common factor models were specified in most
of bivariate correlations whatsoever. analyses without explicit justification or even when the
An MIMIC model can be intuitively described as assumptions of reflective measurement were implausi-
regressing the common variance of the effect indica- ble. Jarvis et al. (2003) reviewed 178 articles published
tors (i.e., the parts they contribute to the proxy) on the in four marketing research journals over the years
causal indicators, and the factor disturbance can be 1982–2000, and they found that 28% of measurement
seen as shared variance over the effect indicators that models were incorrectly specified as common factor
is not explained by the causal indicators (Wilcox et al., models, given descriptions of concepts in article text.
2008). The key to specifying an MIMIC model is that Part of the problem may be the misunderstanding that
all its indicators, both effect and causal, are understood a common factor represents a combination of its effect
as, respectively, measuring the same thing or causing indicators, when just the opposite is true (i.e., common,
the same thing. Such a rationale provides the justifica- not unique, variance is analyzed). Another possibility
tion for viewing Figure 13.3 as a complete measurement is that psychology researchers trained in classical mea-
model for a single latent construct for which all its indi- surement theory were (1) relatively less familiar with
cators, causal or effect, belong to the same domain— formative measurement but, to their credit, (2) prob-
see MacKenzie et al. (2011) and Wilcox et al. (2008) for ably understood that a common factor model adjusts
more information about the logic of MIMIC models. for measurement error in the indicators. But Rhemtulla
See also Bollen and Bauldry (2011), who differentiated et al. (2020) noted that the simple presence of measure-
covariates from causal indicators and composite indica- ment error is insufficient to justify the default specifica-
tors. Briefly summarized, covariates are often included tion of reflective measurement as explained next.
in the model to control for omitted variable bias (Chap- Through both rational examples and reanalyses of
ter 6), but their effects are viewed as distal compared extant data, Jarvis et al. (2003) and Rhemtulla et al.
with those of cause or composite indicators on the same (2020) described how specification of reflective mea-
construct. This is because covariates do not generally surement, when the true model is formative measure-
belong to same domain as construct indicators. ment, can result in severely biased estimates even when
An advantage of an MIMIC model like the one in the model closely fits the data. To summarize:
Figure 13.3 is that it is identified as a stand-alone model.
Thus, both the reflective model in Figure 13.2(a) and 1. Estimates of absolute bivariate correlations
the MIMIC model in Figure 13.3 require no additional between variables in the structural part of a larger
variables for identification. In contrast, the two forma- model are generally too high. This happens because
Pt3Kline5E.indd 223 3/22/2023 2:56:12 PM

common factor models assume that unique indicator for the same indicators have perfect fit in the popula-
variance is not part of the corresponding concept, so tion and (2) results were compared between them,
excluding that unique variance has the effect of falsely which does not give a privileged position to either type
inflating estimates of shared variance among other of model. Their simulation scenarios represented the
variables in the model. Such positive bias can be propa- existence of multiple viable specifications where treat-
gated through the model in complex patterns that result ing measures as reflective is just as likely to be suc-
in both over- and underestimation of direct effects. cessful as treating measures as formative. This balance
includes the possibility that researchers might appro-
2. The degree of bias increases as the proportion
priately model a set of indicators as either reflective
of unique indicator variance increases, which has or formative. For example, indicators that satisfy the
the effect of lowering correlations among indicators. more restrictive assumptions of reflective measurement
But bias decreases as correlations among indicators can also be compatible with the less stringent assump-
increase (i.e., there is less unique variance) so that the tions of formative measurement. The results indicated
choice between a reflective model and a composite that reflective models generated somewhat less biased
model becomes less important. estimates with greater statistical power than formative
3. Both the degree and direction of bias depend on models, especially in smaller sample sizes (N = 250).
which other variables are also in the model. For exam- Chang et al. (2016) also reported that standardized esti-
ple, parameter estimates for common factors with pre- mates were generally more accurate than unstandard-
dictors and outcomes in the rest of the model were very ized estimates for both types of models. Overall, when
different than when only outcomes for the same factors indicators are strongly correlated w‑ith each other and
are included in the model. model fit is reasonably good, the dif‑ference between
reflective models and composite models shrinks so the
4. Biased parameters were not always detected by choice of model becomes much less critical.
global model fit statistics. In extreme cases, fit can be In some research areas, such as managerial account-
perfect yet bias can be severe. In general, degree of bias ing, archival data play important roles (Nitzl & Chin,
and degree of misfit were unrelated. Thus, statistical 2017). For two reasons, though, the analysis of such data
analysis is generally unable to detect the “correct” might be challenging from a reflective measurement
measurement model, reflective versus formative. perspective: First, the development of common factor-
based measurement models can require multiple rounds
Sarstedt et al. (2016) found in computer simulation as questionnaires or measures are revised and updated,
studies that when a composite model is true in the pop- but a similar opportunity to refine or adjust indicators
ulation, estimates that assumed reflective measurement may not exist for archival data (Rigdon, 2013). Second,
were markedly biased in all simulation conditions. a variable defined in theory as a key effect indicator for
Results were more nuanced when a common factor a particular theoretical concept may not exist in archi-
model is true: For models with just two indicators or val data. There could be post hoc alternatives such as
when standardized factor loading were relatively low, financial statements or other institutional documents
estimates that assumed a composite model were appre- about business practices or policies, but the statisti-
ciably biased. But in larger samples (e.g., N ≥ 500) for cal properties of such data could be inadequate for the
models with more indicators (e.g., ≥ 4), mean absolute common factor model. The use of composite methods
errors were generally comparable across common fac- that analyze formative measurement models is a more
tor models versus composite models. In small samples flexible option in this context.
of N = 100, though, estimates from both types of mea-
surement models were equally poor, again when a com-
mon factor model is true. On average, the amount of CAUTIONS ON
bias was about 11 times greater when fitting a common FORMATIVE MEASUREMENT
factor model to data generated from a composite model
compared with fitting a composite model to data gener- Specification of formative measurement is not a magi-
ated from a common factor model. cal alternative to reflective measurement. Indeed, the
Chang et al. (2016) conducted simulation studies merits of formative measurement have been intensely
where (1) both reflective models and formative models debated in the literature. Possible advantages already
Pt3Kline5E.indd 224 3/22/2023 2:56:12 PM

mentioned include the fact that some concepts are as part of a goodness-of-fit test in the method of con-
really forged, or they are the outcomes of their indica- firmatory composite analysis (CCA), a composite ana-
tors, not the reverse. Another is that somewhat smaller log to the technique of CFA for reflective measurement
samples may be needed for adequate statistical power models, and Bentler and Huang (2014) described chi-
when analyzing composite models even when the true square goodness-of-fit tests for composite methods.
model is reflective (Reinartz et al., 2009). But there are Other options for global fit testing of composite models
possible drawbacks, too. Summarized next are some are described in Chapter 16.
major criticisms of formative measurement—see the
works cited next for more detailed presentations:
ALTERNATIVE MEASUREMENT MODELS
1. Because causal indicators and composite indica- AND APPROACHES
tors are exogenous, their variances and covariances are
not explained by a measurement model. This charac- Specification of measurement as reflective, forma-
teristic makes it more difficult to assess the construct tive, or mixed with multiple indicators as proxies for
validity of such indicators (Edwards, 2011), but Bollen hypothetical constructs are the not the only options in
(2011), Hair et al. (2022, chap. 5), and Markus (2018) SEM. Described next are additional types of indicators
offered suggestions. and models, some of which do not assume or deal with
latent variables or composite latent constructs at all.
2. The absence of the requirement for internal con-
This overview is necessarily brief, but see the works
sistency in formative measurement implies the risk that
cited next for more information.
multiple dimensions are aggregated into a single com-
posite, which can be difficult, if not impossible, to inter-
pret (Wilcox et al., 2008). This is because a proxy that Reactive Measurement
links heterogeneous causes (its indicators) to distinct and Reactive Indicators
outcomes can become a “conceptual polyglot with no
In reactive measurement, the act of measurement
clear interpretation of its own” (Edwards, 2011, p. 379).
itself changes the characteristic being assessed. The
That causal indicators should have a strong rationale as
Hawthorne effect, where people modify their behavior
belonging to the same domain might allay some con-
in response to being observed, is an example of reactive
cern about this issue (Bollen & Diamantopoulos, 2017).
measurement. Another is pretest sensitization, where
3. The lack of the expectation for internal consis- administration of a pretest alerts participants in an
tency can also lead to misunderstanding. Suppose that experiment to the target variable such that their post-
a researcher observes low correlations among a set of test scores are affected. There is evidence over a wide
observed variables and concludes that they must be range of behavioral, emotional, and cognitive variables
formative indicators (causal or composite). This deci- that people asked to complete psychological measures
sion is not justified because low intercorrelations in this are altered by the experience, and effect sizes can range
case could merely signal poorly constructed measures from trivial to more substantial in ways that are not
(Edwards, 2011). always well understood (French & Sutton, 2010).
Hayduk et al. (2007) described reactive indicators,
4. Measurement error in composites can result in
which are assumed to simultaneously cause and affect
biased estimates when the true population model is
latent variables. Thus, a reactive indicator is both reflec-
reflective, but there are special methods for analyzing
tive and formative. In model diagrams, the symbol for
composites described in Chapter 16 that can adjust esti-
a direct feedback loop () connects a reactive indica-
mates of structural correlations or other parameters for
tor with its proxy, and both variables would have error
measurement error.
terms because each variable in a causal loop is endog-
5. Until recently, there were few inferential tests enous. Conceptually, an external input on a latent vari-
of global fit for composite models, which made it dif- able with a reactive indicator cycles repeatedly through
ficult to simultaneously test overidentifying constraints the feedback loop just described. With each cycle, the
for the whole model (Antonakis et al., 2010; Rönkkö magnitude of the effect arriving back at the latent vari-
& Evermann, 2013). But this situation is changing. For able from a reactive indicator diminishes. Beyond a
example, Henseler et al. (2014) introduced the SRMR few cycles, such as 4–5 for smaller reciprocal effects
Pt3Kline5E.indd 225 3/22/2023 2:56:12 PM

or more cycles for larger effects, an asymptotic limit is ments about relevant component behaviors, skills, or
reached, and this limiting effect would be transmitted knowledge (Psychometrics Primer). Content validity
to outcomes (if any) of the latent variable associated in C-OAR-SE also includes answer–scale validity,
with a reactive indicator. or whether the response format is free from semantic
Hayduk et al. (2007) fitted models with a single reac- confusion when participants give answers. Predictive
tive indicator to data from a study where participants validity, convergent validity, and discriminant validity
had been deceived into believing a task was either eas- are all de-emphasized in favor of developing content-
ier or harder than it really was. The task involved hold- valid measures.
ing an isometric sitting position above a stool without Other features of C-OAR-SE are summarized next
touching it. For participants prompted to believe falsely (Rossiter, 2011):
that the task was very difficult, self-reported endurance
increased true (latent) endurance. Perhaps something 1. Single-item measures are highlighted over multiple-
about the experience of unexpected success, such as item measures in the evaluation of beliefs or per-
a momentarily boost in self-confidence, improved ceptions, which are considered as basic or concrete
endurance. The opposite effect was found for partici- concepts with a single meaning. Abstract concepts,
pants falsely told the task was very easy: Self-reported which have more than one meaning, are generally
endurance decreased true endurance after discovering viewed as aggregations of concrete beliefs or per-
that the task was harder than participants anticipated. ceptions that follow formative measurement.
Perhaps negative self-evaluations stemming from the
2. In questionnaire development, adding or drop-
perceived failure to easily perform the task hindered
ping items based on increasing coefficient alpha is
endurance.
rejected, especially if emphasizing internal consis-
Because reactive measurement models are nonrecur-
tency jeopardizes content validity.
sive, they are more challenging to identify than com-
mon factor models with effect indicators only. Require- 3. Score stability is assessed through short-term test–
ments for identifying nonrecursive models with causal retest analysis, and score precision is based on con-
loops are outlined in Chapter 19, but it can be said now fidence intervals that reflect the sample size. Also,
that one strategy involves specification of causally-prior the evaluation of score reliability follows the estab-
variables as instruments with direct effects on just one lishment of content validity in C-OAR-SE, not the
of two variables in a direct feedback loop. Another is to reverse.
impose constraints on certain parameters for a causal
loop. The reactive measurement models described by The focus on content validity in C-OAR-SE is a
Hayduk et al. (2007) were identified using both meth- strength: If a measure’s items are not representative of
ods just mentioned. the target domain, no other psychometrics can salvage
it (Rossiter, 2011). But there are potential shortcom-
ings with other features of the method. For example,
C‑OAR‑SE Method use of single-item measures may be a reasonable choice
The acronym for Rossiter’s (2011) C-OAR-SE method when the target domain is very concrete, there is near-
represents its six basic steps, which are Construct unanimous consensus among participants about what
definition, Object representation, Attribute classifica- is being measures, sample sizes are small (e.g., N < 50),
tion, Rater-entity identification, Scale (item type and multiple items are very homogenous (e.g., inter-item
answer format) selection, and Enumeration (scoring). correlations > .80) and semantically redundant, and
It abandons reflective measurement and related sta- weaker effects are expected (e.g., correlations with cri-
tistical techniques, coefficients, or methods, includ- terion < .30); otherwise, multiple-item measures tend
ing factor analysis (both EFA and CFA), Cronbach’s to perform better (Diamantopoulos et al., 2012). Mea-
alpha, and multitrait–multimethod (MTMT) analysis. sures “approved” by experts as being content valid are
Instead, methods for content validation, which concern not automatically gold standards without further evi-
whether item content comes close to semantic identity dence; that is, psychometrics still matter in test evalua-
with the target domain, are emphasized. Recall that tion (Salzberger et al., 2016). There are research prob-
content validity is established through rational analy- lems for which the selection of a measurement model
sis of item content, given a priori definitions or argu- other than C-OAR-SE is justified (Rigdon et al., 2011).
Pt3Kline5E.indd 226 3/22/2023 2:56:12 PM

Indeed, a point of this discussion is that researchers lines that connect nodes, and such lines represent vari-
should not make default, one-size-fits-all choices about able associations instead of directional causal effects
measurement models. (e.g., Borsboom, 2017, p. 8). There are special statistical
methods and software tools for analyzing network mod-
els. For example, centrality coefficients estimate the
Network Models
relative importance of individual nodes based on their
(Causally‑Linked Observed Variables)
influence on other nodes. These coefficients can reflect
In a network model, there are no proxies for latent vari- either local influence over just a part of the network or
ables, latent composites, or emergent variables; indeed, global influence over the whole network. Borsboom
there are no proxies whatsoever that combine multiple et al. (2021) summarized recent criticisms of central-
indicators into a single approximation of a concept. ity coefficients as estimates for causal effects, such as
Instead, observed variables are viewed as autonomous when interactions occur at different timescales or when
causal agents that affect each other in a network that peripheral network nodes are important for determin-
comprises a system of interacting variables. Unlike ing system behavior. Thus, the interpretation of cen-
reflective measurement, where concepts cause manifest trality coefficients as measures of causal dynamics may
variables, or formative measurement, where concepts be questionable, and more work is needed in this area.
are constructed from observables, studying a concept Clustering methods identify subsets of nodes with
in a network perspective means evaluating the function stronger internal associations with one another and
of its indicators in a dynamical system (Schmittmann weaker associations with other, more external nodes. In
et al., 2013). Networks are understood as receiving social network analysis, people with higher centrality
inputs from external variables, such as stressful life values are seen as relatively more influential, and clus-
events, that can shift a system from equilibrium to a ters correspond to subgroups of people with close-knit
perturbed state, such as from a healthy emotional state relationships—see Clifton and Webster (2017).
to a depressed state.
A network perspective on psychopathology is quite
different compared with more conventional measure- SUMMARY
ment models. For example, depression in reflective
measurement is considered a unidimensional latent Researchers familiar with traditional SEM techniques
variable that causes its symptoms such as lack of sleep, generally specify reflective measurement for multiple
fatigue, poor concentration, or sadness, among others. indicators of the same concept. But that choice should
Apart from their common cause, these symptoms have not be routine, or blindly selected without considering
no causal relations between them (e.g., Figure 13.2(a)). whether the assumptions of reflective measurement are
In formative measurement, depression is constructed plausible. For example, only variation shared among
from its symptoms, which vary and covary, but again multiple indicators contributes to their proxy, a com-
there is no causal effect between any pair of indica- mon factor. This poses a problem when each indica-
tors (e.g., Figures 13.2(b)–13.2(c)). In a network model, tor is intended to measure a distinct aspect or facet of
depression is represented as a system of causal effects a concept: Any variance that is unique (not shared) is
among the symptoms themselves (Schmittmann et al., shuffled off to indicator error terms in a reflective mea-
2013). For example, a stressful life event could cause surement model. Formative measurement is an alter-
lack of sleep, which in turns leads to fatigue and then native where potentially all indicator variance (except
next to poor concentration or negative mood states. random measurement error) contributes to approxima-
Thus, the experience of depression arises from exter- tion of hypothetical concepts. But formative measure-
nal shocks to a network of causally-linked symptoms in ment models are not cure-alls without potential pitfalls
this perspective (Borsboom, 2017). of their own. One is that it is more challenging to estab-
Diagrams of network models resemble those for lish indicator validity in formative measurement mod-
manifest-variable path models. Patterns of direct or els than in reflective measurement models. Another is
indirect effects between nodes, or variables in a net- that the consequences of omitting a formative indica-
work, can be either recursive or nonrecursive (respec- tor are generally more serious than the same specifi-
tively, without or with causal loops). Note that network cation error for reflective indicators. Specification of a
diagrams are sometimes presented with nondirectional measurement model should be guided by substantive
Pt3Kline5E.indd 227 3/22/2023 2:56:12 PM

considerations, not by fitting alternative models to the describe misuses of the common factor model, and Rigdon
data to find the one with the best fit. One reason is that (2012) outlines the concept proxy framework.
a reflective model versus a composite model could
explain the same data nearly as well (i.e., they could be Bollen, K. A., & Diamantopoulos, A. (2017). In defense of
nearly equivalent models). Analysis of reflective mea- causal–formative indicators: A minority report. Psycho-
surement models in the technique of CFA is the subject logical Methods, 22(3), 581–596.
of the next chapter. Rhemtulla, M., van Bork, R., & Borsboom, D. (2020). Worse
than measurement error: Consequences of inappropri‑
ate latent variable measurement models. Psychological
LEARN MORE Methods, 25(1), 30–45.
Bollen and Diamantopoulos (2017) address criticisms of for‑ Rigdon, E. E. (2012). Rethinking partial least squares path
mative measurement that fail to distinguish between causal modeling: In praise of simple methods. Long Range Plan-
indicators and composite indicators, Rhemtulla et al. (2020) ning, 45(5–6), 341–358.
Pt3Kline5E.indd 228 3/22/2023 2:56:12 PM

14
Confirmatory Factor Analysis
This chapter covers the analysis of measurement models with common factors and continuous indicators
using the SEM technique of confirmatory factor analysis (CFA). In contrast to exploratory factor analysis
(EFA), restricted measurement models are analyzed in CFA. This means that the researcher specifies
(1) the exact number of factors (e.g., 3); (2) the pattern of factor loadings, or the specific correspondence
between factors and indicators; and (3) the presence of correlated errors, if any. The second feature just
mentioned means that an indicator loads only on the factor(s) it is supposed to measure (as specified by the
researcher), but all cross-loadings of that indicator on other factors are fixed to zero. Although it is pos‑
sible (but not required) to specify an exact number of factors in EFA, the technique analyzes unrestricted
measurement models, where each indicator loads on all factors, and the researcher cannot disable this
feature of the method (i.e., all cross-loadings are freely estimated). Another difference is that EFA models
with multiple factors are identified only after specifying a method of factor rotation, such as oblique (fac‑
tors are allowed to covary) versus orthogonal (factors are uncorrelated). Because standard CFA requires an
identified model, there is no rotation phase, and factors are typically allowed to covary. Within requirements
for identification, correlated errors can be estimated in CFA, but doing so in EFA is not so straightforward.
Thus, the technique of CFA better supports the analysis of error covariance structures than EFA. Additional
considerations in the choice between EFA or CFA are outlined next.
EFA VERSUS CFA Both techniques partition observed indicator vari-

ance into common variance and unique variance.
Both CFA and EFA are based on the common factor Common variance is shared among the indicators and
model, which assumes reflective measurement from the is a basis for observed covariances between them that
perspective of classical test theory.1 The two methods departs appreciably from zero. Factors that approxi-
share mathematical bases such that the restricted mod- mate latent variables are “constructed” from common
els analyzed in CFA are generated by constraining to variance, and these proxies are called common factors.
zero cross-loadings in the corresponding EFA model The number of factors of substantive interest is usually
for the same indicators and number of factors (Jöreskog, less than the number of indicators. It is impossible to
1969). These constraints identify the restricted mea- estimate more common factors than indicators, but for
surement models that are analyzed in CFA. parsimony’s sake, there is no point in retaining a model
with just as many explanatory entities (factors) as there
1 An exception is the EFA technique of principal component are entities to be explained (indicators; Mulaik, 2009a).
analysis (PCA), which analyzes composites instead of common Unique variance consists of specific variance and ran-
factors. dom measurement error. Specific variance is system-
229
Pt3Kline5E.indd 229 3/22/2023 2:56:12 PM

230 Multiple‑Indicator Approximation of Concepts
atic variation that is not explained by any factor in the tion that follow more preliminary analyses with EFA
model. It may be due to characteristics of individual in which the basic dimensionality of the indicators has
indicators, such as the particular stimuli that make up a been established (Brown, 2015). That is, results from
task. Another source is method variance, or the use of earlier applications of EFA can inform the specifica-
a particular measurement method (e.g., self-report) or tion of CFA models in later test validation studies with
informant (e.g., parents) to obtain the scores. data collected in new samples. The technique of CFA
A fundamental issue in both EFA and CFA is that can also be applied to evaluate the effect of revising an
latent variables can be approximated from observed established test by adding or removing indicators, such
variables, but not exactly. This is because a total of v as whether the original factor structure in the modified
observed variables (the indicators) cannot be uniquely test is preserved in the revision (Flora & Flake, 2017).
transformed to an even larger number of variables, or It was mentioned that hypotheses about correlated
v + m + u in total, where m is the number of common measurement error can be readily tested in CFA.
factors and u is the number of unique variances for the Another advantage over EFA is that there is no need
indicators (of which there are v such terms). This gen- to save estimated factor scores in the raw data file in
eral problem is called factor indeterminacy, which CFA if common factors are later analyzed as predictors
means that common factors are not uniquely deter- or outcomes of other variables. This is because a CFA
mined by their respective sets of indicators. For exam- model can be respecified as a structural regression (SR)
ple, there are infinitely many ways to generate factor model, where some factors are specified as exogenous,
scores for individual cases, an issue called factor score or causes, but others are endogenous, or outcomes. In
indeterminacy, and not all of these sets of factor scores this way, an endogenous factor can be regressed on
will rank the cases the same way (Grice, 2001). Indeed, other factors or on observed variables, such as covari-
correlations between sets of factor scores equally com- ates. Chapter 15 covers SR models. In general, fac-
patible with the same data can be negative (Guttman, tor scores for cases are not generally needed in SEM
1955). The problem of rotational indeterminacy in analyses.
EFA refers to the existence of infinite sets of factor A much more controversial role for CFA is its appli-
loadings and factor correlations all of which fit the data cation immediately after retaining a factor model in
equally well in models with multiple factors (Steiger & EFA with the goal of somehow “verifying” or “con-
Schönemann, 1978). Also, indicator scores are rarely firming” the EFA results. In this case, the CFA model
perfectly precise (i.e., rXX < 1.0). With infinitely many has the same number of factors as the EFA model, but
indicators in an infinitely large sample, there is no inde- some or all cross-loadings are constrained to zero in
terminacy in factor analysis, but this goal is impracti- the CFA model. No new data are collected, so both
cal; thus, latent variables are inherently estimated with the EFA and CFA models are analyzed in the same
uncertainty—see Rigdon et al. (2019) for more infor- sample. There are two problems here: One is that
mation. the two techniques, EFA and CFA, could capitalize
Because EFA is less demanding than CFA, it may be on the same chance variation in a particular sample,
preferred in areas of more recent research where theory especially if the same estimator (e.g., maximum likeli-
about measurement is weak in terms of the number of hood, ML) is used in both analyses. Whether any CFA
theoretical concepts that should be approximated or the model retained in this situation would replicate in a
correspondence between common factors and indica- new sample is unknown.
tors. Early stages in constructing psychological tests or A second problem is that specification of a CFA
questionnaires, especially when there is relatively little model based on EFA outcomes and analyzed with the
guidance from theory, is a context for which EFA is well same data could lead to rejection of the CFA model.
suited. For example, if there could be unexpected, but This is because indicators in EFA often have relatively
substantially meaningful, factors that strongly relate to high secondary loadings on factors other than the one
subsets of indicators, then EFA would be preferred over for which they have their primary loading. These sec-
CFA, which is less flexible here (Flora & Flake, 2017). ondary loadings may account for relatively high pro-
In contrast to the data-driven character of EFA, portions of variance, so constraining them to zero in
the CFA technique is used to evaluate how well a CFA may be too conservative. Consequently, the more
predefined factor solution fits the data in a particular restrictive CFA model may be inconsistent with the
sample. An example is later stages of test construc- data (van Prooijen & van der Kloot, 2001). Misspecifi-
Pt3Kline5E.indd 230 3/22/2023 2:56:12 PM

Confirmatory Factor Analysis 231
cation of zero cross-loadings can lead to overestimation SUGGESTIONS

of factor correlations (Asparouhov & Muthén, 2009). FOR SELECTING INDICATORS
Replicating factor analysis results, whether EFA or
CFA, in new samples is a better solution—see Osborne Indicator selection is critical because the quality of
and Fitzpatrick (2012) for guidance. To be clear, apply- results in factor analysis (CFA or EFA) depends on the
ing CFA right after EFA in the same sample in no way quality of the scores analyzed. The recommendations
verifies, confirms, or replicates the EFA results (Flora by Fabrigar and Wegener (2012) and Little et al. (1999)
& Flake, 2017). are summarized next:
The labels “exploratory” versus “confirmatory” (i.e.,
EFA vs. CFA) should not be reified. It is true that EFA 1. Define theoretical concepts in sufficient detail so
requires no a priori hypotheses about factor–indica- that the essential nature of each target domain can be
tor correspondence or even the number of factors. But clearly stated. If the goal is to delineate dimensions of
there are also more confirmatory modes in EFA, such anxiety, for example, then consult relevant theoretical
as instructing the computer to extract a specific num- and empirical works about the nature and number of
ber of factors based on theory. The technique of CFA factors, such as state anxiety, trait anxiety, social anxi-
is not strictly confirmatory. It happens in many, if not ety, and so on.
most, analyses that the initial restricted factor model
2. Next, identify candidate indicators that, as a set,
fails to fit the data. In this case, the researcher typically
adequately sample the various domains. Ideally, not all
modifies the hypotheses on which the initial model was
indicators for the same domain will rely on the same
based and specifies a new model, and the respecified
model is then tested with the same data. This process method of measurement, such as assessment through
should be guided by theory, but relatively few applica- self-report questionnaires only. This is because com-
tions of CFA are strictly confirmatory. mon method variance can affect observed scores
Two alternative factor analysis methods blend fea- above and beyond any influence of substantive latent
tures of EFA and CFA and thus are intermediate options. variables. Special CFA models for estimating method
Both allow for some of the flexibility in EFA while add- effects are described later in this chapter.
ing capabilities for global fit testing or specification of 3. Suppose there is strong guidance from theory or
error covariances. In exploratory structural equation results from empirical studies about selecting indica-
modeling (ESEM) (Asparouhov & Muthén, 2009), the tors a priori. In this case, homogeneous indictors are
researcher specifies the exact number of factors just preferred. This is because estimates based on indica-
as in CFA, but the measurement model is unrestricted tors that are highly correlated (internally consistent)
in that all possible cross-loadings are free parameters. may be less biased and more efficient in analyses that
The model is identified through specification of a rota- are more confirmatory than exploratory.2
tion option also as in EFA. Standard errors are available
for freely estimated parameters as are statistics about 4. But analyzing sets of less homogenous indicators
both global fit (e.g., model chi-square) and local fit (e.g., that cover a wider range of the target domain may be a
standardized residuals). The technique was originally safer bet when there is relatively little guidance from
exclusive to Mplus, but now there are R packages for either theory or research about indicator selection. A
ESEM such as psych (Revelle, 2022)—see Marsh et al. risk is that sharply defined approximations based on
(2014) and Morin (2023) for more information. highly correlated indicators can be off-target, or fail to
Unrestricted measurement models are also analyzed reflect essential aspects of a concept. Use of more con-
in the EFA in CFA framework (E/CFA) (Jöreskog, firmatory methods can be a benefit in this context, too.
1969) except that a reference variable indicator is 5. Analyzing a set of indicators with less-than-stel-
selected for each factor and all cross-loadings of that lar psychometrics (i.e., lower score reliabilities or inter-
indicator are fixed to zero; otherwise, cross-loadings nal consistencies) can yield estimates that are approxi-
for all other indicators are freely estimated as are all mately accurate if they (a) sample a sufficiently wide
factor covariances. Relatively large residuals between part of the concept, (b) yield scores that are sufficiently
pairs of indicators could indicate the need to freely esti-
mate error covariances—see Brown (2015, chap. 5) for 2 Indicatorintercorrelations should not be so high (e.g., r > .95)
examples. that extreme collinearity is a problem.
Pt3Kline5E.indd 231 3/22/2023 2:56:12 PM

variable (e.g., capture wider rather than narrow ranges error and specific variance not explained by the fac-
of individual differences), and (c) are analyzed by more tor—represented by the error term.
confirmatory than exploratory methods. 2. The error terms are independent of each other and
6. Technical problems in the analysis, such as Hey- of the factors; that is, there are no unmeasured con-
wood cases or convergence failure in iterative estima- founders for any pair of indicators, and all omitted
tion, are more likely to occur if some factors have too causes are unrelated to the factors.
few indicators, especially in small samples for factors 3. All relations are linear and the factors covary (i.e.,
with just two indicators. A safer minimum is about 3–5 there are no causal effects between any pair of fac-
indicators for each anticipated factor. For example, if a tors).
total of four leadership dimensions are hypothesized,
then the minimum number of candidate indicators The first two features just listed specify unidimen-
would be about 12–20. Possible exceptions to this ≥ 3 sional measurement, or the hypothesis that each indi-
indicators/factor pragmatism are explained next. cator measures a single dimension and shares nothing
with other indicators after controlling for the common
Hayduk and Littvay (2012) described situations factors. Later in the chapter we will deal with CFA
where having more indictors per factor is not neces- models for multidimensional measurement in which
sarily better than having fewer, including the use of some indicators load on more than one factor or pairs of
single indicators. For example, if just 1 of 3 candidate error terms are specified as correlated. There are spe-
indicators for the same factor has good psychometric cial factor analytic methods for estimating curvilinear
characteristics, it may be better to omit the 2 weaker relations between factors and continuous indicators or
indicators, which may dilute or contaminate estimates between factors themselves—see Amemiya and Yalcin
for the best single indicator. If multiple indicators are (2001). Relations between categorical indicators and
highly redundant, there is little extra information to factors are inherently nonlinear, and the technique of
be gained beyond analyzing a single indicator of the categorical CFA is described in Chapter 18.
same construct. Default specification of a minimum Presented in Figure 14.1 is a basic CFA model
number of multiple indicators, such as 3–5 per factor, with two factors and six indicators represented in full
as a “golden rule” limits the number of estimated fac- McArdle–MacDonald RAM graphical symbolism. All
tors. In contrast, relying on fewer indicators per factor, cross-loadings are fixed to zero. For example, there is
including single indicators, permits the specification no direct causal effect from factor B to indicator X1,
of additional latent variables, given the same number which is specified as measuring the other factor (i.e.,
of indicators. These added concepts could allow stron- A → X1). But this specification does not imply that X1
ger statistical control for confounding or estimation of and factor B are unrelated. To the contrary, the open
indirect effects—see Hayduk and Littvay (2012) for path in the model, or
examples. The point is that there is no magic num-
ber of indicators per factor. Instead, the researcher’s X1 ← A B
hypotheses should guide the selection of indicators,
not an arbitrary rule for a minimum number (e.g., ≥ 3 predicts that indicator X1 and factor B should covary
indicators/factor). because B is correlated with A, a cause of X1 (the other
cause is E1, its error term), but this association is not
causal. Likewise, indicators X1 and X4 in the figure are
BASIC CFA MODELS expected to covary because their respective causes, fac-
tors A and B, are correlated, or
Characteristics of basic CFA models with multiple
factors are summarized next: X1 ← A B → X4
1. Each indicator is continuous with two causes— and this is true even though X1 and X4 are presumed to
a common factor that approximates the latent vari- measure different things.
able the indicator is supposed to measure and all The numerals (1) in the Figure 14.1 are scaling con-
sources of unique variance—random measurement stants that specify metrics for unmeasured variables,
Pt3Kline5E.indd 232 3/22/2023 2:56:13 PM

which in CFA models include common factors and If multiple indicators for the same factor have
indicator error terms. For example, the specification in equally precise scores and if none is deemed as criti-
the figure that cally representative of the underlying concept (Hay-
duk & Littvay, 2012), then it is generally arbitrary
E1 → X1 = 1.0 in single-sample analyses which indicator is selected
as the reference variable. This is because the choice
is a unit loading identification (ULI) constraint that does not typically affect (1) global model fit, (2) the
scales E1 in the metric of the unique (unexplained) vari- standardized solution, or (3) estimates of indicator
ance of its indicator, X1. The same basic method was error variances in the unstandardized solution. Refer-
described in Chapter 7 for scaling disturbances in path ence variable loadings fixed to 1.0 remain so in the
models. unstandardized solution and have no standard errors
Each factor in Figure 14.1 is scaled through the because they are constants, not variables. This means
reference variable method—also called the marker that there is no significance test—a normal deviate (z),
variable method or the referent loading identifica- or the ratio of a statistic over its standard error—for
tion approach (Newsom, 2015)—where a ULI con- fixed reference variable loadings, which is a possible
straint is imposed on the loading of one indictor per drawback of the method if a significance test for all
factor. For example, the specification unstandardized loadings is desired. Other options for
scaling factors that do not require the selection of ref-
A → X1 = 1.0 erence variables are described in the next section.
All factors in basic CFA models are exogenous vari-
in the figure scales the variance of factor A in the met- ables that are free to vary and covary. It is possible to
ric of the common (explained) variance of indicator include covariates in measurement models, such as age
X1, the reference variable for factor A. By the same or other characteristics of research participants, that
logic, the computer estimates the variance of factor control for potential confounds or bias. Covariates are
B in the metric of the common variance in X4, the usually specified as causes of common factors, which
marker variable for this factor. Most computer tools also respecifies such factors as endogenous variables
for CFA that automatically scale common factors use with disturbance terms (i.e., endogenous factors are
the reference variable method. For example, a com- neither free to vary nor to covary). A CFA model with
mon default for syntax is that the first indicator listed covariates is actually an SR model with both a mea-
for a factor is automatically selected as the reference surement component (the factor model) and a structural
variable—check the documentation for your computer component (measured causes of factors, or the covari-
tool about factor scaling. ates).
E1 E2 E3 E4 E5 E6
1 1 1 1 1 1
X1 X2 X3 X4 X5 X6
1 1
A B
FIGURE 14.1. Basic confirmatory factor analysis model with 2 factors and 6 indicators presented in full McArdle–MacDon‑
ald RAM graphical symbolism.
Pt3Kline5E.indd 233 3/22/2023 2:56:13 PM

Model Parameters ing that the 2 unstandardized loadings are equal.3 The
second part of Rule 14.1 for basic models with mul-
Free parameters of CFA models with continuous indi-
tiple factors is the two-indicator rule. As mentioned,
cators when means are not analyzed include the vari-
though, analyses of models with factors that have just
ances and covariances of exogenous variables and
two indicators in small samples are susceptible to tech-
direct effects on endogenous variables (Rule 7.1). For
nical problems in the analysis.
the basic model in Figure 14.1, free parameters include
1. 8 variances (of 2 factors and 6 indicator error

OTHER METHODS
terms);
FOR SCALING FACTORS
2. a single covariance for the 2 factors; and
3. 4 direct effects of factors on indicators (i.e., load- A drawback of the reference variable method is that the
ings) that are not fixed to equal scaling constants estimate of factor variance depends on the particular
(e.g., the loadings for X2–X3 and X5–X6). indicator selected as the marker variable. That estimate
will change if the loading for a different indicator for
The grand total of free parameters is thus 13. With v = the same factor is constrained to unity, especially if
6 observed variables, the total number of observations raw score metrics vary over indicators. It is also true
when means are not analyzed is 6(7)/2, or 21 (Rule 7.2), that ratios of factor variance estimates over their stan-
so df M = 21 – 13 = 8 for Figure 14.1. dard errors (i.e., z) depend on the reference variable.
This means that outcomes of significance testing (i.e.,
p values) for factor variances can be affected by replac-
Identification Requirements
ing one indicator by another as the reference variable.
All CFA models must satisfy the same necessary-but- The same is also true for factor covariances: Their esti-
insufficient identification requirements as for other mates, standard errors, and z statistics all depend on the
parametric structural equation models, or (1) df M ≥ 0 metrics for the particular indicators selected for scaling
and (2) each unmeasured variable must be scaled (Rule the two factors (Gonzalez & Griffin, 2001).
7.3). We just proved that the basic model in Figure 14.1 Two other options for scaling factors in CFA that
meets both of the requirements just stated. Additional do not require the selection of reference variables are
sufficient requirements for basic CFA models that con- described next. The choice among the three methods
cern minimum numbers of indicators are summarized has no effect on global model fit, the standardized solu-
next: tion, and unstandardized estimates for error terms, but
unstandardized estimates for other model parameters
RULE 14.1 If a basic CFA model depend on the method. The two alternative methods
are compared against the reference variable method for
1. with a single factor has at least 3 indicators, or
the basic 3-factor, 6-indicator model depicted in Figure
2. has 2 or more factors where each factor has 2 14.2(a), but now using more compact graphical symbol-
or more indicators, then the model is identified ism. Recall that df M = 8 for Figure 14.2(a), which is the
without additional constraints when means are not same model as depicted in Figure 14.1, which features
analyzed. full RAM graphical symbolism.
Represented in Figure 14.2(b) is the variance stan-
That’s it. The first part of Rule 14.1 for single-factor dardization method or factor variance identification
models is the three-indicator rule. Exercise 1 asks you approach (Newsom, 2015), where scaling constants
to verify that df M = 0 for a basic model CFA with 1 fac- are applied to the factor variances. That is, the method
tor and 3 indicators. In practice, such a model would standardizes factors by fixing their variances to 1.0,
be uninteresting because it would perfectly fit the data.
Thus, df M > 0 only for basic single-factor models with 3 Kenny (1979) noted that although a 2-factor model with equal-
≥ 4 indicators. Exercise 2 involves proving that a basic ity-constrained loadings is identified with df M = 0, such a model
single-factor model with just 2 indicators is not iden- does not perfectly fit the data, if the correlation between the indi-
tified without imposing constraints, such as specify- cators is negative.
Pt3Kline5E.indd 234 3/22/2023 2:56:13 PM

(a) Reference variable (b) Variance standardization

ULI constraints UVI constraints
X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6
1 1
A B 1 A B 1
(c) Effects coding

ECI constraints
X1 X2 X3 X4 X5 X6
b e
a c d f
A B
FIGURE 14.2. Scaling factors in the reference variable method with unit loading identification (ULI) constraints (a), variable
standardization method with unit variance identification (UVI) constraints (b), and effects coding method with effects coding
identification (ECI) constraints (a + b + c)/3 = (d + e + f )/3 = 1.0 (c). Models presented in compact graphical symbolism, and
all indicator error terms are scaled through ULI constraints.
which is called a unit variance identification (UVI) model is analyzed in a single sample and there are no
constraint. Because the factors are standardized, their repeated measures variables, either method is probably
variances are not free parameters, but factor covari- fine because overall model fit is usually not affected.4
ances are freely estimated as Pearson correlations. Nei- Fixing the factor variances to 1.0 to standardize them
ther the indicators nor their error variances are stan- has the advantage of simplicity and does not require
dardized in this method, and loadings for all indicators the selection of reference variables. A shortcoming of
are free parameters. This means that all loadings have
the factor variance method is that it is usually appli-
standard errors in the unstandardized solution, so each
cable only to exogenous factors. Although most SEM
can be tested for statistical significance (if desired).
Exercise 3 asks you to verify that df M = 8 for Figure software allows the imposition of constraints on any
14.2(b); that is, scaling factors through ULI constraints 4 Steiger(2002) described an exception called constraint inter-
versus UVI constraints affects neither the number of
action that can occur when some common factors have just two
free parameters nor the model degrees of freedom. indicators and a cross-factor equality constraint is imposed on
The choice between scaling factors through ULI the loadings for indicators of different factors. In some cases, the
constraints versus UVI constraints is usually based on value of chiD (1) for the test of the equality constraint depends
the merits of analyzing factors in, respectively, unstan- on how the factors are scaled. The same author also described
dardized versus standardized form. When a CFA checks for constraint interaction and ways to handle the issue.
Pt3Kline5E.indd 235 3/22/2023 2:56:13 PM

free parameter, variances of endogenous factors are constraint that scales factor B in Figure 14.2(c), assum-
not free parameters, and only some programs, such as ing that scores on X4 –X6 are all based on the same
LISREL and SEPATH, allow the predicted variances metric. There is an option in lavaan to automatically
of endogenous factors to be constrained to 1.0. This is generate ECI constraints when models with common
not an issue for basic CFA models, wherein all factors factors are analyzed (Rosseel et al., 2023).
are exogenous, but it can be for SR models where some The effects coding method does not standardize the
factors are endogenous. factors, and all indicators for the same factor contribute
There are times when standardizing factors is not to scaling that factor. The feature just mentioned is a
appropriate—for example, (1) the analysis of a struc- potential benefit if the researcher seeks a stable esti-
tural equation model across independent samples that mate of factor variance that does not rely on a single
differ in their variabilities and (2) longitudinal mea- indicator (i.e., a reference variable). This may be espe-
surement of variables that manifest changing variances cially useful in studies where factor variances provide
over time. In both cases, important information may be substantive information. One example is in longitudi-
lost when factors are standardized. How to appropri- nal studies of change over time when the focus is on
ately scale factors in multiple-group CFA is addressed latent growth factors (Chapter 21 in this book), and
in Chapter 22. another is when samples from different populations are
Little et al. (2006) described the effects coding compared on factors that underlie observed outcomes
method for scaling factors when (1) the indicators for (Chapter 22), among other situations described by Sch-
the same factor all have the same raw score metric weizer et al. (2019) when interpretation of factor vari-
and (2) most of those indicators are specified to mea- ances is meaningful.
sure just one factor. Their method does not require the
selection of reference variables nor does it standardize
factors. Instead, it relies on the capability of modern DETAILED EXAMPLE FOR A BASIC CFA
SEM computer tools to simultaneously impose linear MODEL OF COGNITIVE ABILITIES
constraints on two or more parameter estimates, in this
case the unstandardized loadings for indicators of a The first edition of the Kaufman Assessment Battery
common factor. for Children (KABC-I) (Kaufman & Kaufman, 1983)
The method works by specifying an effects coding is an individually administered cognitive ability test
identification (ECI) constraint, which means that the for children 2½–12½ years old. The test’s authors
average loading for indicators of the same factor equals claimed that the eight subtests represented in Figure
1.0 in the unstandardized solution. This specification 14.3 measure two factors. The three tasks in the fig-
instructs the computer to derive optimal loadings for a ure believed to reflect sequential processing all require
set of indicators that average to 1.0. So scaled, the fac- the correct recall of auditory stimuli (Number Recall,
tor variance is estimated as the average common vari- Word Order) or visual stimuli (Hand Movements) in a
ance across all the indicators in their original metric, particular order. The other five tasks—Gestalt Closure
weighted by the extent each indicator contributes to through Photo Series in the figure—are presumed to
factor measurement. In this way, all indicators contrib- measure more holistic, less order-dependent reasoning,
ute to the scale of their common factor. For example, or simultaneous processing. Keith (1985) suggested
the ECI constraint for indicators X1–X3 of factor A in alternative factor names for the KABC-I, including
Figure 14.2(c) is “short-term memory” instead of “sequential process-
ing” and “visual-spatial reasoning” instead of “simul-
a+b+c
= 1.0 (14.1) taneous processing” (i.e., factor names are hypotheses,
3 not truths), but the original terms by Kaufman and
which is algebraically equivalent to Kaufman (1983) are retained in this example. Exercise
5 asks you to demonstrate that df M = 19 for Figure 14.3.
3–a–b–c=0 (14.2) Listed in Table 14.1 are the annotated script files
used to conduct the analyses for this example in
The researcher specifies the linear constraint defined in lavaan (Rosseel et al., 2023) and semTools (Jorgensen
Equation 14.2 in the syntax of an SEM computer tool. et al., 2022). All input and output files can be down-
Exercise 4 asks you to derive the corresponding ECI loaded from this book’s website. Summarized in Table
Pt3Kline5E.indd 236 3/22/2023 2:56:13 PM

EHM EWO ENR EGC ETr ESM EMA EPS

1 1 1 1 1 1 1 1
Hand Word Number Gestalt Spatial Matrix Photo

Triangles
Movements Order Recall Closure Memory Analogies Series
1 1
Sequential Simultaneous
FIGURE 14.3. Basic confirmatory factor analysis model of the first edition of the Kaufman Assessment Battery for Children
presented in full McArdle–MacDonald RAM graphical symbolism.
14.2 are the input data, which are from the KABC-I’s Listed in Table 14.1 for analysis 1 is the lavaan input
normative sample for 10-year-old children (N = 200). file for fitting a single-factor model with eight indica-
Variances in lavaan analyses are automatically cal- tors (all KABC-I subtests) to the data in Table 14.2.
culated as S2 in the denominator, not as s2 with N – 1 The unstandardized loading for the Hand Movements
in the denominator. (The standard deviations in Table task was automatically fixed to 1.0 to scale the single
14.2 are the square roots of s2.) Remember that this factor. You should verify that df M = 20 for this single-
default in lavaan can be disabled—see the script file factor model. Estimation in lavaan with default ML
for analysis 1, Table 9.1. converged to an admissible solution. Values of selected
global fit statistics for the single-factor model are
Single‑Factor Model reported next:
If the target model has two or more factors, the first chiML(20) = 105.427, p < .001
model analyzed in CFA is often a single-factor model.
If a single-factor model cannot be rejected, there is RMSEA = .146, 90% CI [.119, .174]
little point in evaluating models with more factors. CFI = .818, SRMR = .084
TABLE 14.1. Analyses, Script Files, and Packages in R

for Confirmatory Factor Analysis Examples
1. Single-factor model for the KABC-I kabc-1-factor.r lavaan
2. Two-factor model for the KABC-I, power kabc-2-factor.r lavaan

analysis for the exact-fit test, and reliability semTools
coefficients for factor measurement
3. Example of a Heywood case for a two- sabatelli-heywood.r lavaan

factor model analyzed in a small sample
Note. Output files have the same names except the extension is “.out.” KABC-I, Kaufman Assessment
Battery for Children, first edition.
Pt3Kline5E.indd 237 3/22/2023 2:56:13 PM


of a Two-Factor Model of the KABC-I
Indicator 1 2 3 4 5 6 7 8
Sequential scale
1. Hand Movements —
2. Number Recall .39 —
3. Word Order .35 .67 —
Simultaneous scale
4. Gestalt Closure .21 .11 .16 —
5. Triangles .32 .27 .29 .38 —
6. Spatial Memory .40 .29 .28 .30 .47 —
7. Matrix Analogies .39 .32 .30 .31 .42 .41 —
8. Photo Series .39 .29 .37 .42 .58 .51 .42 —
SD 3.40 2.40 2.90 2.70 2.70 4.20 2.80 3.00
Note. KABC-I, Kaufman Assessment Battery for Children, first edition. Input data are from Kaufman and
Kaufman (1983), N = 200.
The single-factor model fails the chi-square test at a a single-factor in this context means that the variables
conventional level of statistical significance, and the measure only one domain (they do not have discrimi-
lower bound of the RMSEA’s 90% confidence interval, nant validity).
or .119, exceeds .10 which is a poor result. Results for
the CFI and SRMR are also unfavorable. Exercise 6
Two‑Factor Model
asks you to inspect the residuals for this analysis, but I
can tell you that local fit is poor, too. Thus, the single- In analysis 2 (see Table 14.1), the two-factor model in
factor model for the KABC-I is rejected. Figure 14.3 was fitted to the data in Table 14.2.6 The
A single-factor CFA model is nested under any other estimator is default ML, and the analysis in lavaan
CFA model with two or more factors and the same pat- converged normally to an admissible solution. The
tern of error covariances (if any) for the same indica- power of the chi-square test in this analysis where
tors. This is because a model with one factor is just a df M = 19 estimated with the MacCallum–RMSEA
restricted version of any model with multiple factors method for the actual sample size, N = 200, is low,
where, conceptually, all absolute pairwise factor cor- only .39.7 Thus, if the model does not have perfect fit
relations are constrained to unity.5 So constrained, the in the population, there is only about a 40% chance of
factors are identical, which is the same as collapsing detecting this result over random samples. A minimum
multiple factors down to just one. This also means sample size for power of the exact-fit test to be at least
that the chi-square difference test can be conducted to .90 is N = 542, or over twice as large as the actual sam-
directly compare the relative fit of single- versus mul- ple (N = 200). Despite low power, the two-factor model
tiple-factor CFA models. Kenny (1979) noted that the fails the chi-square test, or
test for a single-factor is also relevant for manifest-vari-
able path models. This is because the failure to reject chiML (19) = 38.325, p = .005
5 Constraints are generally imposed in the unstandardized solu- 6 Kline(2013b) and Flora and Flake (2017) compared EFA results
tion, not the standardized one. This means that single-factor
for an unrestricted version of Figure 14.3 fitted to the same data
models must be explicitly specified and fitted to the data instead
with those for the two-factor CFA model for the KABC-I.
of fixing factor correlations to 1.0 in models with two or more
factors. 7 Parameters: e0 = 0, e1 = .05, df M = 19, N = 200, a = .05.
Pt3Kline5E.indd 238 3/22/2023 2:56:13 PM

Thus, we tentatively reject the two-factor model, and tor. The standard error is .181, so z = 1.147/.181 = 6.34,
we will examine other details about both global and which is significant at the .01 level. All other unstan-
local fit after demonstrating the chi-square difference dardized loadings in the table are interpreted in the
test for this example. same way. Standardized loadings in basic CFA models
Given the chiML values for the one- and two-factor are estimated Pearson correlations between indicators
models just reported, the chi-square difference statistic and their common factor. For example, although the
is calculated as follows: Hand Movements task is a reference variable, it has a
standardized loading that is not 1.0; instead, it is .497
df D = 20 – 19 = 1 (see the table). Thus, the sequential factor explains
chiD (1) = 105.427 – 38.325 = 67.102, p < .001 .4972, or about R2 = .247 (25%) of this indicator’s vari-
ance. All the other standardized loadings in Table 14.3
which says that the fit of the model with two factors is are interpreted in comparable ways.
statistically better than that of the model with a single Ideally, a factor would explain at least 50% of the
factor. The implication of this result is not yet clear variance in a continuous indicator (Bagozzi & Yi,
because the two-factor model is tentatively rejected. 2012). Such a result for all indicators of the same
In general, the outcome of the chi-square difference factor would support the hypothesis of convergent
test is most meaningful when the more complex of the validity. A somewhat less demanding standard is that
two models, or the two-factor model in this case, has average variance extracted (AVE), which is just the
acceptable fit. As we will see, there are appreciable average of the squared standardized loadings for indi-
problems with the fit of the two-factor model, too. cators that depend on the same factor, should exceed
Selected values of approximate fit indexes for the .50. If AVE > .50, then, on average, more variance
two-factor CFA model of the KABC-I are listed next: is explained by the common factor than remains in
indicator error terms (Hair et al., 2022). By the first
RMSEA = .071, 90% CI [.038, .104] standard just described, the results in Table 14.3 are
problematic: The two-factor model fails to explain the
CFI = .959, SRMR = .072
majority of the variance (R2 > .50) for a total of four
out of eight indicators, or half their number. Results
Although values of the CFI and SRMR are not obvi-
are somewhat more favorable using the second stan-
ously problematic, the upper bound of the RMSEA’s
dard, at least for the sequential factor, which explains
confidence interval, .104, is unfavorable.8 Thus, results
an average of about 52% of the variance among its
over the global fit statistics considered to this point are
three indicators (AVE = .517). In contrast, the simul-
mixed: The exact fit test is failed and values of some,
taneous factor explains, on average, only about 43% of
but not all, approximate fit indexes indicate potential
the variance among its five indicators (AVE = .434).
problems in global fit. The residuals for this model are
More formal measures of precision for factor measure-
examined after we next consider parameter estimates
ment are described in Topic Box 14.1.
in this pedagogical example for the two-factor model.
In the real world, though, lower values of R2 are
Presented in Table 14.3 are parameter estimates for
generally acceptable in factor analysis. For exam-
the two-factor model with standard errors for both solu-
ple, Comrey and Lee (1992) suggested the graded
tions. Note in the table that the unstandardized loadings
descriptors listed next: R2 > .50 is excellent, and R2
for reference variables (Hand Movements, Gestalt Clo-
values approximately equal to .40, .30, .20, and .10
sure) equal 1.0 and have no standard errors. The unstan-
are, respectively, very good, good, fair, and poor. By
dardized loading for the Number Recall task is 1.147,
these less strict guidelines, results for indicators of the
so scores on this task are expected to increase by this
two-factor CFA model of the KABC-I are “excellent”
amount for every 1-point increase in the sequential fac-
(R2 > .50) for three out of eight indicators, none are
8 Dynamic “poor” (R2 around .10), and the rest (for five indica-
thresholds at the .05 level computed in the Monte
Carlo simulation method by Niemand and Mai (2018) for this tors) are somewhere in between—see Table 14.3. But
analysis are RMSEA < .058, SRMR < .056, and CFI > .941. note that these guidelines should not be blindly applied
Observed results for the RMSEA (.071) and SRMR (.072) are over all applications of CFA or over all kinds of indica-
problematic based on these dynamic thresholds, but not for the tors. For example, indicators that are continuous, such
CFI (.959) in this method. as total scores in the present example, tend to obtain
Pt3Kline5E.indd 239 3/22/2023 2:56:13 PM

TABLE 14.3. Maximum Likelihood Estimates for a Two-Factor

Model of the KABC-I
Unstandardized Standardized
Parameter Estimate SE Estimate SE
Factor loadings
Sequential
Hand Movements 1.000 — .497 .062
Number Recall 1.147 .181 .807 .046
Word Order 1.388 .219 .808 .046
Simultaneous
Gestalt Closure 1.000 — .503 .061
Triangles 1.445 .227 .726 .044
Spatial Memory 2.029 .335 .656 .050
Matrix Analogies 1.212 .212 .588 .055
Photo Series 1.727 .265 .782 .040
Error (unique) variances

Hand Movements 8.664 .938 .753 .061
Number Recall 1.998 .414 .349 .075
Word Order 2.902 .604 .347 .075
Gestalt Closure 5.419 .585 .747 .061
Triangles 3.426 .458 .472 .064
Spatial Memory 9.997 1.202 .570 .065
Matrix Analogies 5.105 .578 .654 .065
Photo Series 3.482 .537 .389 .063
Factor variances and covariance

Sequential 2.838 .838 1.000 —
Simultaneous 1.834 .530 1.000 —
Sequential Simultaneous 1.271 .324 .557 .067
Note. KABC-I, Kaufman Assessment Battery for Children, first edition. Standardized esti-
mates for error variances are proportions of unexplained variance. All variables are standard-
ized in the standardized solution.
higher factor loadings than indicators that are ordinal, between the simultaneous factor and the Hand Move-
such as individual items with Likert-type response ments indicator of the sequential factor. Note that these
scales. Item-level (i.e., categorical) CFA is described in results are not cross-loadings because each indicator
Chapter 18. is specified to measure a single factor in basic CFA
Given the results in Table 14.3 for the standardized models. Graham et al. (2003) used the term structure
loadings, Exercise 7 asks you to calculate the predicted coefficient to describe estimated Pearson correlations
correlations between each factor and the indicators of between measured variables and common factors or
the other factor, such as the model-implied correlation composites. So defined, structure coefficients for basic
Pt3Kline5E.indd 240 3/22/2023 2:56:14 PM

TOPIC BOX 14.1
Reliability of Factor Measurement

Recall that the alpha coefficient, also known as tau-equivalent reliability, assumes that a set of items
(i.e., multiple indicators) are unidimensional and measure their common factor in the same way (i.e., their
factor loadings are equal), but their error variances are allowed to vary. Also, computation of alpha
does not require a factor analysis—see the Psychometrics Primer. Coefficient omega is a model-based
alternative that (1) requires the analysis of a CFA model and (2) assumes only that a set of indicators is
congeneric, or caused by a common factor to varying degrees (i.e., loadings vary over indicators) and
with error variances that may be unequal. The version of omega for the unstandardized solution in unidi‑
mensional models with continuous indicators and independent errors is listed next (Raykov, 2004):
( Σ λˆ ) φˆ
2
i
ω= (14.3)
( Σ λˆ ) φˆ + Σ qˆ
2
i ii
where Σ λ̂i is the sum of the loadings among indicators for the same factor, φ̂ is the estimated factor vari‑
ance, and Σ q̂ii is the sum of the error variances over all indicators. A different formula is needed for when
indicators share at least one error covariance:
( Σ λˆ )
2
ˆ
φ
i
ω= (14.4)
( Σ λˆ )
2
ˆ + Σq
φ ˆ + 2Σ q
ˆ
i ii ij
where 2Σ q̂ij is the sum of the error covariances. Flora (2020) describes other forms of omega, including
ones for categorical indicators or multidimensional measurement models.
In the analysis of the two-factor model of the KABC-I in Figure 14.3, I used the semTools package
to compute omega and related coefficients for factor reliability. These results are listed in the output file
for analysis 2 in Table 14.1, but the calculation of omega for the three indicators of the sequential fac‑
tor is demonstrated next. There are no error covariances in the model, so we need Equation 14.3 and
the unstandardized parameter estimates for these indicators and their common factor in Table 14.3. You
should verify these calculations:
Σ λ̂i = 1.000 + 1.147 + 1.388 = 3.535

φ̂ = 2.838
Σ q̂ii = 8.644 + 1.998 + 2.902 = 13.564
3.5352(2.838)
= ω = .723
3.5352(2.838) + 13.564
which is not a terrible result, but still the evidence for convergent validity among the three indicators of this
factor is mixed at best (see Table 14.3). It is not surprising that the value for the alpha coefficient, which
assumes tau-equivalence instead of congenerity, is lower for the same three indicators, or .701—see the
output file for this analysis.
Pt3Kline5E.indd 241 3/22/2023 2:56:14 PM

CFA models include the (1) standardized loadings for the simultaneous factor corresponds to the common
indicators of the same factor and (2) model-implied variance of its reference variable, Gestalt Closure. The
correlations between indicators and all other factors estimate of the covariance between the sequential and
in the model (i.e., the ones they are not specified to simultaneous factors is 1.271 (see the table). The corre-
directly measure). sponding result in the standardized solution is the esti-
You should know that standardized loadings for mated factor correlation, which is .557. As expected,
indicators that depend on multiple correlated factors the two cognitive ability factors are positively related
are not generally interpreted as Pearson correlations. and, more specifically, share .5572, or about .310 (31%)
This is because they take the form of standardized of their variance with each other. The value for the fac-
regression coefficients (beta weights), which control for tor correlation is only moderate in size, which suggests
nonzero factor correlations and are scaled in a standard reasonable discriminant validity (i.e., the sequential
deviation metric, not in a correlation metric. Suppose and simultaneous factors are not identical), but see
that indicator X1 depends on both factors A and B and Rönkkö and Cho (2022) for alternative definitions of
that the factor correlation is not zero. Its two standard- discriminant validity for factor models.
ized loadings on factors A and B are, respectively, .75 Local fit of the two-factor CFA model is considered
and .80. The first coefficient says that the score on X1 is next. Correlation residuals and standardized residuals
expected to increase by .75 standard deviations, given are reported in Table 14.4. Absolute correlation residu-
a change in factor A of 1 standard deviation while con- als ≥ .10 and absolute standardized residuals that are
trolling for factor B. Its other standardized loading, .80, significant at the .05 level (> 1.96) are shown in bold-
has a similar interpretation except that now factor A is face. Many of the results just mentioned concern Hand
controlled. The squares of standardized loadings that Movements, an indicator of the sequential factor, and
are not correlation coefficients in form generally have most of the indicators of the other factor, simultane-
no meaningful interpretation—specifically, such val- ous. All of these results are positive, which says that
ues are not proportions of indicator variance explained the model underestimates the corresponding sample
by the corresponding factor—see Graham et al. (2003) associations (correlations or covariances). Overall,
for more examples. these residuals are problematic and suggest poor local
Listed in the middle part of Table 14.3 are estimated fit. Given all the results considered so far, the two-fac-
error (unique) variances for the indicators. For exam- tor model in Figure 14.3 is rejected. Next, we consider
ple, the unstandardized error variance for the Triangles options for respecifying this model.
task is 3.426. The sample variance for this indicator is
s2 = 2.702 = 7.290 (Table 14.2), and the rescaled vari-
ance for N = 200 is RESPECIFICATION OF CFA MODELS
S2 = (199/200) 7.290, or 7.254 Respecification of CFA models is more challenging

than for manifest-variable path models. This is because
Thus, the ratio 3.426/7.254, or .472, is the proportion there are more possibilities for revising a CFA model,
of variance in the Triangles task not explained by the including (1) factor–indicator correspondence, (2)
simultaneous factor (i.e., 1 – R2). The same proportion, whether measurement is unidimensional or multidi-
.472, is the estimated proportion of error variance in mensional, or (3) the number of factors in the model.
the standardized solution—see the table. It also equals What is no different in CFA, though, is that just as
the complement of the squared standardized loading specification of the initial model should be guided by
for the Triangles task, or R2 = .7262, or .527 at three- substantive considerations so, too, should its respecifi-
decimal accuracy (i.e., .472 + .527 = 1.0). cation. Recall MacCallum’s (1995) admonition about
Estimates of the factor variances and covariance are the shift from more confirmatory analysis of the ini-
reported in the bottom part of Table 14.3. For example, tial model to a more exploratory mode as respecified
the unstandardized estimate for the variance of the models are fitted to the same data: If any respecified
sequential factor is 2.838. Exercise 8 asks you to verify model is retained, it should not be viewed as confirmed
that this factor variance corresponds to the common without fitting it to new data.
(explained) variance of its reference variable, Hand The role for residuals, modification indexes, or other
Movements. Likewise, the unstandardized variance for diagnostic information in CFA respecification is also
Pt3Kline5E.indd 242 3/22/2023 2:56:14 PM

TABLE 14.4. Correlation Residuals and Standardized Residuals for a Two-Factor

Model of the KABC-I
Indicator 1 2 3 4 5 6 7 8
Sequential scale
1. Hand Movements — –.591 –3.790 1.126 2.046 3.464 3.505 2.991
2. Number Recall –.011 — 1.539 –2.329 –1.558 –.112 1.129 –2.002
3. Word Order –.052 .018 — –1.315 –1.001 –.355 .727 .524
Simultaneous scale
4. Gestalt Closure .071 –.116 –.066 — .429 –.784 .323 .910
5. Triangles .119 –.057 –.037 .015 — –.267 –.245 .677
6. Spatial Memory .219 –.005 –.015 –.030 –.007 — .664 –.144
7. Matrix Analogies .227 .056 .035 .014 –.007 .024 — –1.978
8. Photo Series .174 –.061 .018 .027 .012 –.003 –.040 —
Note. Correlation residuals are listed below the diagonal, and standardized residuals are reported above the diagonal. KABC-
I, Kaufman Assessment Battery for Children, first edition.
the same: Their values should prompt the researcher test yields a single total score, which may reflect both
to free a fixed or constrained parameter in the initial verbal reasoning and visual–spatial ability. If just the
CFA model only when (1) there is a substantive basis for total score is available (i.e., responses to individual
doing so and (2) the value of the corresponding expected items are not saved in the data file), the correspond-
parameter change (EPC) has a meaningful interpreta- ing indicator might be specified as measuring two fac-
tion (Brown, 2015). Respecifications made solely to tors. But if item responses are also available, another
improve model fit—that is, where there are no theoreti- option is to derive separate total scores, one for the
cal, empirical, or practical bases for the change—can verbal items and another for the visual–spatial items,
capitalize so much on sampling error that any revised and specify that each subtotal depends on a different
model may not replicate. With these cautions in mind, factor. Although specification of unidimensional mea-
let’s review basic respecification options in CFA. surement provides stronger tests of hypotheses about
Sometimes an indicator fails to have a substantial convergent validity and discriminant validity (Ziegler
loading on the factor it is supposed to measure. One & Hagemann, 2015), there can be a priori or practical
option is to specify that the indicator depends on a dif- reasons to specify that an indicator assesses multiple
ferent factor. Inspection of the residuals can help to factors.
identify the other factor to which the indicator’s corre- Freeing parameters for one or more error covari-
spondence may be switched. Suppose that the residuals ances—also called a correlated uniqueness in factor
between an indicator of factor A and those for indicators analysis—is the second way to specify multidimen-
of factor B are positive and relatively large. This pattern sional measurement. Here, a pair of indicators is speci-
suggests that the indicator may measure factor B bet- fied as sharing systematic variance that is not due to
ter than it does factor A. An indicator can also have a any factor in the model. Summarized next are substan-
relatively high loading for its own factor in the initial tive reasons for including an error covariance in an ini-
model but also have high residuals between it and the tial CFA model (Brown, 2015; Cole et al., 2007): The
indicators of another factor. This pattern suggests that corresponding pair of indicators
the indicator may measure both factors, which is one
form of multidimensional measurement. 1. share comparable stimuli, content, or item wording
Sometimes indicators are known a priori to mea- or come from the same task (e.g., Figure 3.1);
sure more than one domain. Suppose that the items 2. are based on the same source of information, such
of an engineering aptitude test are either text-based as parent informants, or based on the same mea-
or require the interpretation of data graphics. The surement method, such as self-report;
Pt3Kline5E.indd 243 3/22/2023 2:56:14 PM

3. are susceptible to the same response sets, or system- Poor convergent validity where loadings for indicators
atic differences in how participants answer ques- of the same factor are not uniformly positive and rela-
tions regardless of their content (e.g., social desir- tively high in the standardized solution suggests that
ability response set); or the model has too few factors. Anyhow, respecifying
4. are repeated measures variables with dependent the number of factors in the initial model indicates
(autocorrelated) error variances. that (1) the original hypotheses were very wrong and
(2) subsequent analyses are far more exploratory than
The failure to specify theoretically justifiable error confirmatory.
covariances may not in some cases harm fit, but their It is generally dubious practice in factor analysis,
omission can lead to undetected misspecifications that EFA or CFA, to automatically prune indicators from
distort the interpretation of common factors as prox- the model that fail to have statistically significant fac-
ies for latent variables—see Cole et al. (2007) for tor loadings. This is especially true in small samples
examples. A downside is that post hoc specification of when the power of significance tests to detect popula-
tion loadings that are not zero is low. A better criterion
multidimensional measurement poses a risk: Each error
is to establish a priori a minimum standardized fac-
covariance or factor loading added to a model “costs”
tor loading for a substantively meaningful association
only a single degree of freedom, which increases model
between indicator and factor. This threshold could vary
complexity and probably improves its fit, too. But with-
over different research areas or level of measurement for
out cross-validating models that are so modified, it is
the indicators (e.g., higher target values for continuous
unknown whether any results based on them generalize
than for categorical indicators). But avoid respecifying
beyond the original sample, if a respecified model is
a measurement model—or any kind of structural equa-
retained.
tion model—based solely on results of statistical sig-
The specification of multidimensional measurement nificance tests, which can greatly capitalize on chance.
also has consequences for model identification: Neither Next, we consider respecification of the two-factor
correlated errors nor indicator loadings on additional model for the KABC-I in Figure 14.3. Earlier we exam-
factors can be added to basic CFA models in any arbi- ined the residuals in Table 14.4 for this model. Most
trary pattern; that is, special identification rules are of the larger and positive residuals are between Hand
needed. For example, O’Brien (1994) described a set Movements and other tasks specified to measure the
of rules for CFA models where every indicator loads other factor. Because the magnitude of the standardized
on a single factor but some error covariances are freely loading for Hand Movements is at least moderate (.497;
estimated. These rules are applied “backwards” start- Table 14.3), it is possible that this task may measure
ing from patterns of independent (uncorrelated) pairs both factors, among other possibilities, for respecifica-
of error terms to prove the identification of loadings, tion. Listed in Table 14.5 are the six largest modifica-
then of factor covariances in models with two or more tion indexes and corresponding unstandardized and
factors, and finally of measurement error covariances. standardized EPC values for parameters fixed to zero
The O’Brien (1994) rules work well for relatively sim- in the original model (see the output file for analysis 2,
ple factor models, but they can be awkward to apply to Table 14.1). Note in the table that the MI statistics for
more complex models. A different set of identification the parameters
rules by Kenny et al. (1998) that may be easier to apply
to a wider range of models is described in Appendix Simultaneous → HM and ENR EWO
14.A.
More drastic respecification changes the number of are identical, 20.097. This means that allowing Hand
factors in the initial CFA model. For example, poor Movements (HM) to also depend on the simultaneous
discriminant validity is evidenced by estimated abso- factor or adding a covariance between the error terms
lute factor correlations that are practically 1.0. This of the Number Recall (NR) and Word Order (WO)
means that the two factors are basically identical and, tasks would reduce chiML by about 20 points. Let’s con-
thus, could be merged into a single factor with all the sider each of these results in more detail.
indicators from the two factors in the original model. In Table 14.5, it is estimated that the standardized
That is, there are too many factors in the initial model. loading of Hand Movements on the simultaneous factor
Pt3Kline5E.indd 244 3/22/2023 2:56:14 PM

TABLE 14.5. Respecifications for the Six Largest

Modification Indexes for a Two-Factor Model of the KABC-I
Expected parameter change
Effect MI p Unstandardized Standardized
Simultaneous → HM 20.097 < .001 1.054 .421
ENR EWO 20.097 < .001 4.741 1.969
EHM EWO 7.013 .008 –1.746 –.348
Simultaneous → NR 7.013 .008 –.501 –.289
EHM ESM 4.847 .028 1.609 .173
EHM EMA 3.799 .051 .995 .150
Note. The estimate in boldface is invalid (out of range). KABC-I, Kaufman Assessment
Battery for Children, first edition; MI, modification index; HM, Hand Movements; NR,
Number Recall; WO, Word Order; SM, Spatial Memory; MA, Matrix Analogies; PS,
Photo Series.
would increase from zero in the original basic model to freeing the parameters for two different error covari-
(i.e., Figure 14.3) to .421 in a respecified model where ances, but the estimated reductions in the chiML and the
Hand Movements loads on both factors (i.e., the stan- magnitudes of the standardized EPC values are rela-
dardized EPC). This value is nearly as large as the tively smaller for these respecifications. Given all these
actual standardized loading of this task on the sequen- estimates and based on my knowledge of the KABC-I
tial factor in the original model, or .497 (Table 14.3). (Kline et al., 1996) and results of other factor-analytic
Note in Table 14.5 that the standardized EPC for allow- studies (Keith, 1985), specifying that Hand Move-
ing the errors of the Number Recall and Word Order ments measures both factors is plausible. Although the
tasks to covary, or 1.969, is invalid because it is out-of- task requires exact reproduction of a sequence of hand
range in a correlation metric. Remember that both MI movements performed by the examiner, there is an
and EPC values are merely estimates, and sometimes obvious visual–spatial component to the task that could
these estimates are simply wrong: The actual result is also reflect simultaneous processing. For practice, you
.531 in the standardized solution for the CFA model could fit the respecified model just described to the
where the error correlation between the two tasks just data in Table 14.2 and check the results. (Hint: Major fit
mentioned is freely estimated.9 problems in the original model are mainly cleared up in
Among other changes suggested by the results in the respecified model just described.)
Table 14.5, the next two have exactly the same MI
value (7.013): Allow the errors of the Hand Movements
ESTIMATION PROBLEMS
and Word Order to covary or to respecify that Number
Recall loads on both the sequential and simultaneous The issues, complications, and possible remedies in the
factors. The last two results in Table 14.5 correspond analysis phase of CFA are considered next.
9 Justas for correlated disturbances, a positive error correlation
in a measurement model indicates that increases in one or more Empirical Underidentification
unmeasured common causes increases scores on both indicators.
A negative correlation says that scores on one indicator increase Recall that empirical underidentification occurs when
while scores on the other indicator decrease, given an increase in the data provide too little information to estimate all
an unmeasured common cause. free parameters in a structural equation model (Chapter
Pt3Kline5E.indd 245 3/22/2023 2:56:14 PM

9). This can happen if sample estimates for certain key Other possible causes of empirical underidentification
parameters are very close to either zero or one, among include (1) violation of normality or linearity when
other possibilities. Consequently, the analysis may fail using normal theory methods such as default ML;
due to nonconvergence or inadmissible solutions even (2) extreme multicollinearity among observed vari-
when the model is theoretically identified. Three exam- ables; and (3) specification errors (Rindskopf, 1984).
ples of empirical underidentification in CFA are listed
next:
Convergence Failure
or Inadmissible Solutions
1. A basic single-factor CFA model requires three
indicators to be just-identified (Rule 14.1). But if Failure of iterative estimation in CFA can be caused
one of the factor loadings has a value that is close by poor starting values; see Topic Box 14.2 for sug-
to zero, the analysis may fail because the model has gestions. Nonconvergence or inadmissible solutions
basically just two indicators, which is insufficient are more likely when there are only two indicators per
for identification with no additional constraints. factor or the sample size is less than 100–150 cases
2. Suppose that the estimated factor covariance is (Marsh & Hau, 1999). Tips for analyzing CFA models
close to zero for a basic two-factor model where and other kinds of structural equation models in small
each factor has two indicators, the minimum samples are offered in Chapter 17.
required for identification (Rule 14.1). The virtual Inadmissible solutions include Heywood cases such
absence of the covariance parameter transforms the as negative variance estimates for indicator error terms
original model into two separate single-factor, two- or estimated absolute correlations or squared correla-
indicator models, each of which is underidentified. tions (e.g., R2) that exceed 1.0. Solution inadmissibil-
ity can also occur at the parameter matrix level. The
3. In Figure 14.8(f), indicator X3 depends on two fac-
computer estimates in CFA a factor covariance matrix
tors, but if the estimated factor correlation is close
and an error covariance matrix. If any element of either
to 1.0, the loading or error variance for this indi-
matrix is out of bounds, the whole matrix is nonpositive
cator may be empirically underidentified—see
definite, and this can happen even though no individual
Appendix 14.A.
TOPIC BOX 14.2
Starting Values for Measurement Models

These recommendations assume that all variables, including the factors, are unstandardized. In the refer‑
ence variable method of scaling factors, initial estimates of factor variances should probably not exceed
90% of the observed (sample) variance for the corresponding reference variable. In the effects coding
method, all indicators of the same factor share the same raw score metric, so the starting value for the
factor variance should be less than 90% of the average observed variance over all indicators. The starting
values for covariances follow the initial estimates of their variances; that is, they are the product of each
factor’s standard deviation (the square root of the initial estimates of their variances) and the expected
correlation between them.
If indicators of the same factor have variances similar to that of the reference variable, then start‑
ing values of their factor loadings could be 1.0. If the reference variable is, say, one-tenth as variable as
another indicator, the initial estimate of the other indicator’s loading could be 10.0. Conservative starting
values for indicator error variances could be 90% of the observed variance associated with the associated
indicator, which assumes that only 10% of the variance will be explained. Bentler (2006) suggested it is
probably better to overestimate the variances of factors and error variances than to underestimate them in
terms of starting values.
Pt3Kline5E.indd 246 3/22/2023 2:56:14 PM

element is a Heywood case. The causes of nonposi- 14.1, a two-factor model where the FOE factor has three
tive definite parameter matrices include the following indicators and the marital adjustment factor has two
(Wothke, 1993): indicators is fitted to the data in Table 14.6. Although
the basic CFA model just described is identified (Rule
1. The data provide too little information, that is, 14.1), the estimate in lavaan for the error variance of
empirical underidentification that involves factor the intimacy indicator is –39.892. The standardized
variances, covariances, or loadings. loading for the same indicator, 1.006, is also invalid:
2. The model is overparameterized (too many free This statistic for a continuous indicator that depends on
parameters). a single factor is interpreted as a Pearson correlation,
which cannot exceed 1.0 in absolute value. The lavaan
3. The sample has outliers or severely nonnormal dis-
output contains the warning that some estimated vari-
tributions. ances are negative; otherwise, there is a complete solu-
4. The measurement model is misspecified. tion with standard errors and residuals, and the model
passes the chi-square test—chiML (4) = 4.688, p =
Would you like to see an example of a Heywood case .321—but the Heywood cases render the whole solution
in a real analysis? Sabatelli and Bartle-Haring (2003) inadmissible.
administered to each spouse in a total of 103 married Kolenikov and Bollen (2012) noted that although
couples three indicators of family-of-origin experi- negative variances for observed variables with actual
ences (FOE) and two indicators of marital adjust- scores are impossible, error terms in CFA models are
ment. The FOE indicators are retrospective measures not observed; instead, they are estimated. In global
of the perceived quality of each spouse’s relationship estimation, it can happen that negative variance esti-
with his or her own mother or father and of the rela- mates for unmeasured variables occur within a larger
tionship between their parents while growing up. The solution that satisfies the fit function for a particular
adjustment ratings concern problems and intimacy in estimation method. Assuming that the model is identi-
the marital relationship. Higher scores on all measures fied, the possibilities that the Heywood case is due to
suggest more respect, empathy, and tolerance for indi- misspecification or empirical underidentification were
viduality (FOE) and higher levels of intimacy and fewer mentioned earlier in this section. A third possibility is
complaints about problems in the marital relationship. that the negative variance estimate is within the bounds
Reported in Table 14.6 are the summary statistics of sampling error for a population error variance that
for the sample of wives only. For analysis 3 in Table is positive but close to zero. This conclusion would be
TABLE 14.6. Input Data (Correlations, Standard Deviations)

for Analysis of a Two-Factor Model with Heywood Cases
Indicator 1 2 3 4 5
Marital adjustment
1. Problems —
2. Intimacy .740 —
Family-of-origin experiences
3. Father .265 .422 —
4. Mother .305 .401 .791 —
5. Father-Mother .315 .351 .662 .587 —
M 161.779 138.382 86.229 86.392 85.046

SD 32.936 22.749 13.390 13.679 14.382
Note. These data are from S. Bartle-Haring (personal communication, June 3, 2003), N = 103.
Means are reported but not analyzed.
Pt3Kline5E.indd 247 3/22/2023 2:56:14 PM

consistent with the observation that a confidence inter- analysis of the same model. If the second analysis
val around the offending variance estimate includes does not generate the same parameter estimates as
zero, which is true in the example analysis of the data the first, the model is not identified.
in Table 14.6 for a two-factor CFA model: In the output 3. If the model is identified, the Fisher informa-
for analysis 3 in Table 14.1, the estimated error variance tion matrix of covariances among the parameter
for the intimacy indicator is –39.892, its standard error estimates has an inverse. Correlations among the
is 109.200, and the 95% confidence interval is parameter estimates are derived from this matrix. A
problem is indicated if any of these absolute corre-
[–253.921, 174.137] lations is close to 1.0, which suggests extreme linear
dependency. Bollen and Bauldry (2011) described
which includes zero as an estimate for the population additional empirical checks for identification.
error variance. A caveat here is that the standard error
or confidence interval for the Heywood case could
themselves be wrong, but Kolenikov and Bollen (2012) Local Estimation with Instruments
described alternative methods for estimating standard Most CFA models for continuous indicators described
errors in this context. Larger negative error variances in the literature are analyzed with simultaneous meth-
that exceed the limits of sampling error could indicate ods such as default ML for summary data or raw data
misspecification, but there is no quick fix, such as con- while assuming multivariate normality or robust ML
straining variance estimates to be at least zero, that for raw data only but with no requirement for multi-
fundamentally solves the problem for a given estimator. normality. Recall that although global estimation is
Use of alternative methods to analyze these data with more efficient in large samples for correctly specified
no Heywood case are described in Chapter 17. models, local estimation may better isolate error due
to misspecified parts of the model than global estima-
Empirical Checks for Identification tion (Chapter 8). Global methods also require identified
models, and iterative estimation in such methods does
It is theoretically possible for the computer to generate not always converge to admissible solutions.
a converged, admissible solution for a model that is not Although local estimation (i.e., single-equation,
really identified, yet print no warning or error message. limited information) is usually reserved for manifest-
This is most likely to happen in CFA when analyzing variable path models, Bollen (2019) and Bollen, Fisher,
very complex models with multiple error covariances or Giordano, et al. (2022) described an option for models
indicators that load on more than one factor for which with common factors called model-implied instru-
the application of heuristics cannot prove identification. mental variables using two-stage least squares
Described next are empirical tests for solution unique- (MIIV-2SLS). As the name suggests, the method locates
ness that can be applied when analyzing any type of instruments for every model parameter and then sepa-
structural equation model. These tests concern neces- rately estimates each parameter in a 2SLS regression
sary but insufficient requirements; that is, if any test is analysis. These instruments are already part of the CFA
failed, the solution is not unique, but passing does not model; that is, they are not external variables that are
prove identification: added to the model to function as instruments. Unlike
default ML, the 2SLS estimator is neither iterative nor
1. A second analysis of the same model is done, but does it assume multivariate normality. If the whole
different starting values than in the first analysis model is underidentified, the 2SLS method can still be
should be used (e.g., computer defaults vs. user- applied to individual parameters that are identified.
specified starting values). If estimation in the sec- The MIIV-2SLS method is available in the R pack-
ond analysis converges to a different solution work- age MIIVsem (Fisher et al., 2021). It works by replacing
ing from different initial estimates, the model is not all common factors in the model with their respective
identified. reference variables but with no error terms. After all
2. This check applies to overidentified models only such latent-to-observed (L2O) transformations are
(i.e., df M > 0): Use the predicted covariance matrix complete, the resulting model is based on observed
from the first analysis as the input data for a second variables only but with a more complex error covari-
Pt3Kline5E.indd 248 3/22/2023 2:56:14 PM

ance structure for each equation. Next, instruments are factor correlation with the specification that two indi-
located in the respecified model, and these instruments cators, educational aspiration and college plans, depend
are applied to generate 2SLS estimates for param- on both factors. In Figure 14.4(c), the plans factor in the
eters in the original model, including those for com- original model was swapped for a correlation between
mon factors or error terms (Bollen, 1996). If more than the two indicators of this factor and the specification
the minimum number of instruments is available for a that all six indicators are affected by a single common
particular parameter not associated with a reference factor. Figure 14.4(d) is a hierarchical CFA model in
variable, the Sargan overidentification test, which which the covariance between the ability factor and
determines if all instruments are uncorrelated with the plans factor in the original model is replaced by a
the equation error, is conducted (Sargan, 1958). If the second-order general factor (A), which has no observed
Sargan test for an equation is significant, then a speci- variables as indicators and is presumed to directly affect
fication error is suggested because the model-implied the first-order factors (ability, plans). This specification
independence of instruments and error terms is not sup- provides an account of why the two first-order factors
ported in the data. But the Sargan test is not foolproof: (which are endogenous in this model and have distur-
Jin (2022) described situations where the test can fail bance terms) covary. Because the general factor has
to detect nonzero omitted parameters, and preventing only two indicators (i.e., the ability and plans factors), it
such misdirections is statistically challenging. Such is necessary to constrain both its unstandardized direct
limitations emphasize the importance of model speci- effects on the first-order factors to 1.0, if the variance
fication over analysis. In Chapter 17, the MIIV-2SLS of the general factor A is freely estimated. Hierarchical
method is applied to estimate an admissible solution to CFA models are described later in this chapter.
the two-factor CFA model fitted to the data in Table The situation involving equivalent versions of CFA
14.6 that generated a Heywood case in default ML models with multiple factors is even more complex than
estimation (analysis 3, Table 14.1). Chen et al. (2023) suggested by the last example. It is possible to apply the
described additional estimation options for structural replacing rules (Rules 11.1–11.2) to substitute factor cor-
equation models with common factors. relations with direct effects, which makes some factors
endogenous. The resulting model is an SR model, but it
will fit the data equally well. For example, substitution
EQUIVALENT CFA MODELS of the factor correlation in Figure 14.4(a) with a direct
effect between the two factors generates an equivalent
There are two sets of principles for generating equiva- SR model. Raykov and Marcoulides (2001) showed that
lent CFA models—one for models with multiple fac- there are an infinite number of versions of basic CFA
tors and another for single-factor models. Consider models. For each equivalent model, the factor correla-
the two-factor model of self-perceived ability and tions are eliminated (i.e., the factors are assumed to be
academic plans by Kenny (1979) presented in Figure orthogonal) and replaced by one or more factors with
14.4(a). I used the method of constrained estimation in fixed loadings (1.0) for all indicators. These added fac-
the SEPATH procedure of STATISTICA (TIBCO Sta- tors replace the factors in the original model as common
tistica, 2022) to fit this model to the correlation matrix sources of variation among the indicators in the equiva-
reported in a sample of 556 grade 8 students presented lent versions that all explain the same data just as well.
in Table 14.7. Values of selected global fit statistics are Equivalent versions of basic single-factor CFA
listed next: models can be derived using Hershberger and Mar-
coulides’s (2013) reversed indicator rule, where one
chiML(8) = 9.256, p = .321 indicator is specified as causal and the rest remain as
effect indicators. Consider the CFA model of reading
RMSEA = .012, 90% CI [.017, .054] in Figure 14.5(a). The effect indicators represent pho-
CFI = .999, SRMR = .012 nological processing, word and letter recognition, and
word attack skills (the ability to recognize and analyze
The other three models in Figure 14.4 are equivalent a printed word to connect it to the spoken word). The
versions of the original model that yield the same val- equivalent version in Figure 14.5(b) specifies phono-
ues of global fit statistics, predicted correlations, and logical ability as a cause of the reading factor, which is
residuals. Figure 14.4(b) features replacement of the now endogenous. This respecified version is an MIMIC
Pt3Kline5E.indd 249 3/22/2023 2:56:14 PM

(a) Original (b) Equivalent 1
A PTE PPE PFE EA CP A PTE PPE PFE EA CP
1 1 1 1 1
Ability Plans Ability Plans
(c) Equivalent 2 (d) Equivalent 3
A PTE PPE PFE EA CP A PTE PPE PFE EA CP
1 1 1
A Ability Plans
1 1
FIGURE 14.4. Four equivalent CFA models of self-perceived ability and educational plans shown in compact graphical
symbolism for indicator error terms (a–d) and disturbances of endogenous factors (d). For all models, chiML(8) = 9.256,
p = .321 when fitted to the data in Table 14.7 using a method for constrained estimation. AS, ability self-concept; PTE, per‑
ceived teacher evaluation; PPE, perceived parental evaluation; PFE, perceived friends’ evaluation; EA, educational aspiration;
CP, c ollege plans. Scaling constants (1) for error or disturbance terms are assumed.
TABLE 14.7. Input Data (Correlations) for Analysis of a Two-Factor

Model of Perceived Ability and Educational Plans
Variable 1 2 3 4 5 6
1. Ability Self-Concept —
2. Perceived Parental Evaluation .73 —
3. Perceived Teacher Evaluation .70 .68 —
4. Perceived Friends’ Evaluation .58 .61 .57 —
5. Education Aspiration .46 .43 .40 .37 —
6. College Plans .56 .52 .48 .41 .71 —
Note. Input data are from Kenny (1979); N = 556.
250
Pt3Kline5E.indd 250 3/22/2023 2:56:15 PM

(a) Original (CFA) (b) Equivalent (MIMIC)
Phonological Word Letter Word Word Letter Word

Ability Recognition Recognition Attack Recognition Recognition Attack
1 1
Phonological
Reading Reading
Ability
FIGURE 14.5. Application of the reversed indicator rule to generate an equivalent single-factor model of reading. CFA,
confirmatory factor analysis; MIMIC, multiple indicators–multiple causes. All models shown in compact graphical symbolism
where scaling constants (1) for error or disturbance terms are assumed.
model with both cause and effect indicators, and its fit to alent indicators (Topic Box 14.1). Recall that equality
the data would be identical to that of the original single- constraints are generally imposed in the unstandard-
factor CFA model. A total of three additional equivalent ized solution. This means that although indicators
models could be generated, but the one with phonologi- specified as tau-equivalent have equal unstandardized
cal ability as a causal indicator is plausible (Hulme & factor loadings, their standardized loadings may differ.
Snowling, 2016). An even more restricted model assumes parallel
indicators, which have both equal loadings on their
common factor and equal error variances. Parallel
SPECIAL TESTS tests measure a common factor in the same way (equal
WITH EQUALITY CONSTRAINTS loadings) and with the same degree of precision (equal
unique variances). Such measures are psychometrically
Congeneric indicators in basic CFA models depend interchangeable in use. An example is when two or
on a common factor and their error variances are inde- more parallel forms are developed to avoid participants
pendent. Also, their factor loadings (except for scal- retaking the same test in follow-up assessments: The
ing constants) and error variances are all free param- same thing is measured with equal precision over paral-
eters. For example, the two sets of indicators for the lel versions, and practice with the same items or stimuli
sequential and simultaneous factors in Figure 14.3 are due to administering the same test twice are avoided. In
congeneric if no constraints are imposed on their load- a model for parallel indicators, the loadings for indica-
ings and error terms. Although congeneric indicators tors of the same factor will be equal in the unstandard-
are assumed to measure the same dimension, both the ized solution as will be the error variances for the same
magnitudes of their association with their common indicators. This is because there are now constraints
factor, or factor loadings, and their error variances are related to both causes of all indicators in basic CFA
allowed to differ. models: their common factors and their error terms. A
A CFA model with tau-equivalent indicators is model with parallel indicators is nested under the tau-
nested under the version with congeneric indicators. equivalent version for the same indicators.
It features equality-constrained loadings for each set Testing whether sets of indicators are congeneric,
of indicators that depend on the same factor, but their tau-equivalent, or parallel begins by fitting a basic
error variances are free parameters. This means that (1) CFA model to the data with no equality constraints
a one-unit change in their common factor predicts the imposed on loadings or error variances. If the model
same amount change in each indicator, but (2) their error for congeneric indicators just described is retained, the
variances may be unequal (Brown, 2015). The alpha more restrictive tau-equivalence model with equality-
coefficient for internal consistency assumes tau-equiv- constrained loadings for indicators of the same factor is
Pt3Kline5E.indd 251 3/22/2023 2:56:15 PM

fitted to the same data. Because models for congenerity continuous variables with either normal or nonnormal
and tau-equivalence are hierarchically related, their distributions. Because vanishing tetrads are estimated
relative fit can be compared with the chi-square differ- from observed variables, the technique of CTA may be
ence test. If the model for tau-equivalence (equal factor useful for analyzing measurement models that are not
loadings) is retained, then the analysis proceeds in the identified in CFA.
same way except that the next model tested assumes Bollen and Ting (2000) described the vanishing
parallel indicators with equal factor loadings and equal tetrad test (VTT) as an empirical check of effect
error variances. The model for parallel indicators is indicators versus causal indicators based on vanish-
nested under the model for tau-equivalent indicators. ing tetrads: If absolute sample estimates for Equation
Note that all of the tests just described assume indepen- 14.5 exceed zero beyond the limits of sampling error,
dent errors, and the data must be fitted to a covariance then the null hypothesis of reflective measurement
matrix, not a correlation matrix—see Brown (2015, with effect indicators is rejected. In contrast, a forma-
chap. 7) for examples. tive measurement model with causal indicators implies
Vanishing tetrads are a kind of overidentifying no vanishing tetrads, and models with both effect and
restriction for factors with at least four continuous causal indicators or correlated errors imply still differ-
indicators and no error correlations. In his pioneering ent patterns of vanishing tetrads. The same authors also
work on factor analysis, Spearman (1904) showed that cautioned against exploratory application of the VTT,
differences in products of covariances or correlations especially when the number of observed variables is
between certain pairs of indicators must be zero, if all large: Not only could the sheer numbers of such tests be
indicators depend on the same factor (i.e., reflective large, but some of the significant results could very well
measurement). For continuous indicators X1–X4 of the be Type I errors (true reflective models are rejected). I
same factor, there are three vanishing tetrads in a cor- agree, and you should not be surprised to hear me say
relation metric: (again) that no researcher should blindly rely on sig-
nificance testing to specify a CFA model (or any other
r12 r34 – r13 r24 = 0 (14.5) kind of statistical model). Wang and Finn (2016) dem-
onstrated application of the VTT to evaluate indicators
r13 r24 – r14 r23 = 0
of consumer-based brand loyalty as either reflective or
r14 r23 – r12 r34 = 0 formative. Although reflective measurement character-
ized this concept at the brand level, formative measure-
where r12 represents the population correlation between ment was more consistent with the data at the consumer
indicators X1 and X2, and so on. If any two restrictions level.
hold in in Equation 14.5, the third must also be true,
so there are actually just two independent overidentify-
ing restrictions. For each factor with k ≥ 4 indicators, MODELS FOR MULTITRAIT–
there are a total of k (k – 3)/2 orthogonal overidenti- MULTIMETHOD DATA
fying restrictions. Kenny and Milan (2012) described
additional types of between- or within-factor vanishing The technique of CFA can be used to analyze data from
tetrads for CFA models. a multitrait–multimethod (MTMM) study, the logic
Glymour et al. (1987) described exploratory tetrad of which was first articulated by Campbell and Fiske
analysis (ETA), a computer-based search algorithm (1959). In such a study, at least two different traits are
that attempts to locate unidimensional reflective mea- each measured using at least two different methods.
surement models based on observed vanishing tetrads Traits are hypothetical concepts about stable charac-
among at least four variables that together comprise teristics, such as cognitive abilities, and methods refer
a congeneric indicator set. This search algorithm is to multiple-test forms (e.g., computer-administered vs.
implemented in the freely available program TETRAD paper and pencil), ways of collecting data (e.g., self-
(Spirtes et al., 2022) that is compiled as a Java applica- report vs. observational), or informants (e.g., parents
tion. Bollen and Ting (1993) described confirmatory vs. teachers), among other possibilities. The goals in an
tetrad analysis (CTA) which, unlike ETA, requires a MTMM study are to (1) evaluate the convergent valid-
priori measurement models. Both local and global fit ity and discriminant validity of tests that vary in their
can be evaluated by examining vanishing tetrads for measurement method and (2) derive separate estimates
Pt3Kline5E.indd 252 3/22/2023 2:56:15 PM

for the effects of traits versus methods on observed high loadings for method factors would indicate com-
scores. See the Psychometrics Primer on this book’s mon method effects, and moderate correlations (not too
website for a summary of the MTMM approach to high) between the trait factors would indicate discrimi-
estimating convergent validity and discriminant valid- nant validity.
ity while controlling for common method bias, which A drawback of the CTCM model is that it assumes
provides a rival explanation of observed correlations effects of the same method are uniform over all mea-
between measures (Podsakoff et al., 2003). sures based on that method; that is, it does not allow for
The earliest procedure for analyzing data from an trait-specific method effects, such as when traits dif-
MTMM study involved inspection of correlation matri- fer in their proneness to response sets such as acquies-
ces for all variables. For example, convergent validity cence, or the tendency to agree with test items regard-
would be indicated by the observation of appreciably less of their content (Eid et al., 2003). For example, self-
high correlations among variables that supposedly reports about positive traits, such as resilience, might
measure the same trait but with different methods. If be more susceptible to acquiescence response sets than
correlations among variables that should measure dif- about negative traits, such as narcissism. Another prob-
ferent traits but use the same method are relatively lem is that analyses of CTCM models can be plagued
high, then common method effects are suggested. This by convergence problems or inadmissible solutions. For
implies that correlations between different variables example, Marsh and Bailey (1991) found in computer
based on the same method may be relatively high even simulations that aberrant estimates were derived about
if they measure unrelated traits. 75% of the time for CTCM models. Kenny and Kashy
The failure to control for common methods effects in (1992) noted part of the problem: CTCM models are
CFA (and EFA, too) can inflate the values of estimated not identified if the loadings on trait or method factors
factor correlations. This is because any extra systematic are equal. If the loadings are different but similar in
variation due to methods instead of traits is allotted to value, then CTCM models may be empirically unde-
the factors. Consequently, values of factor loadings for ridentified.
indicators based on the same method can be inflated. Some simpler alternatives to CTCM models that may
Correlations between indicators of different factors avoid estimation problems have been proposed, includ-
can also be inflated, if those indicators share a com- ing those with multiple but uncorrelated method fac-
mon method. Thus, common method bias can lead to tors, models with fewer method factors than the total of
the false conclusion that (1) evidence for discriminant methods used to collect the MTMM data, and the model
validity is poor for indicators that depend on different presented in Figure 14.6(b), which is a correlated-
factors or (2) evidence for convergent validity is strong uniqueness (CU) model (Marsh & Grayson, 1995).
for indictors that depend on the same factor (Brown, For the same indicators, traits are represented in CU
2015). There are a variety of special CFA models for models exactly as they are in CTCM models—compare
MTMM data that control for common method bias, Figures 14.6(a) and 14.6(b). But method effects in CU
some of which are described next—see Widaman models are assumed to be a property of each indicator,
(1985) or Podsakoff et al. (2003) for taxonomies of and relatively high error covariances between indica-
models for MTMM data. tors based on the same method are taken as evidence
Early applications of CFA to MTMM data in the for common method effects. An advantage is that con-
1970s generally corresponded to the correlated traits– verged, admissible solutions are more likely in analyses
correlated methods (CTCM) model like the one in of CU models than for CTCM models. A drawback is
Figure 14.6(a). Such models have separate trait and that because method effects are shunted off to the error
method factors that are assumed to covary, but method terms in CU models, their estimates are a mishmash
factors are assumed to be independent of trait factors. of systematic indicator variance, random influence,
In the figure, variables X1–X3 are based on one method, and common method effects that can be greatly biased
X4 –X6 rely on a different method, and X7–X9 correspond (Lance et al., 2002). Another issue is that method
to a third method. The model also specifies that the set effects modeled as error variance are assumed to be
of indicators (X1, X4, X7) measures one trait but each of zero on average, but method effects could lower or raise
the two other sets, (X2, X5, X8) and (X3, X6, X9), measures scores for most cases (i.e., the mean is not zero) (Pohl &
a different trait. Given these specifications, high load- Steyer, 2010). There are still more types of CFA models
ings for trait factors would suggest convergent validity, for MTMM data—see Topic 14.3 Box for examples—
Pt3Kline5E.indd 253 3/22/2023 2:56:15 PM

(a) Correlated traits–correlated methods
Method 1 Method 2 Method 3
1 1 1
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 1
1
Trait 1 Trait 2 Trait 3
(b) Correlated–uniqueness
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 1 1
Trait 1 Trait 2 Trait 3
FIGURE 14.6. A correlated traits–correlated methods model (a) and a correlated uniqueness model (b) for multitrait–multi‑
method data shown with compact symbolism for indicator error terms.
Pt3Kline5E.indd 254 3/22/2023 2:56:15 PM

and that is one of the challenges for researchers in this SECOND‑ORDER AND BIFACTOR
area: matching theory with the appropriate statistical MODELS WITH GENERAL FACTORS
model. Rönkkö and Cho (2022) offered suggestions for
how to operationalize and assess discriminant validity All CFA models described to this point featured first-
in situations where variables are measured with a sin- order factors with indicators that are observed vari-
gle method instead of multiple methods as in MTMM ables. These first-order factors are typically specified
studies. to covary (e.g., Traits 1 and 2 in Figure 14.6), but that
TOPIC BOX 14.3
Other Types of CFA Models for MTMM Data

Eid et al. (2008) described MTMM models for interchangeable versus structurally different methods.
Interchangeable methods (raters) are structurally equivalent. Examples include customers randomly
selected to evaluate a product or students selected in the same way to rate an instructor. The raters here
are interchangeable because they all have access to the same product or teacher, and which customers
or students are randomly selected should not matter. Structurally different methods (raters) are
not randomly selected from a larger set of equivalent raters. Examples include coworkers versus managers
who rate the job performance of the same employees: These different raters are not randomly chosen and
do not share same perspective, so they are structurally different.
Models for interchangeable methods described by Eid et al. (2008) are based on principles of multi‑
level analysis, where raters such as students are nested within higher-level units such as teachers. In contrast,
the same authors recommended correlated traits– correlated methods minus one (CTC(M–1))
models that directly compare structurally different methods. In CTC(M–1) models, one method is selected
as the reference method for which no explicit method factor is specified but trait-specific factors are
specified for all other methods. Thus, a score on a method factor estimates the deviation of a true value
for a trait score by a specific method versus the prediction based on the reference method. For example,
if parents, teachers, and adolescents all report on the adjustment of the adolescents, then the amount of
over- or underestimation by parents and teachers can be compared with self-reports from the adolescents.
A limitation is that the fit and estimates of factor variances in analyses of CTC(M–1) models may not be
invariant over choice of the reference method. Pohl and Steyer (2010) describe alternative models that
directly compare different methods without the limitation just mentioned; their models also allow average
method effects to depart from zero.
Eid et al. (2008) described additional MTMM models for situations where both interchangeable
and structurally different measurement methods are employed. Bauer et al. (2013) described trifactor
models for integrating data obtained from multiple informants or raters perhaps for a single domain
(e.g., monotrait–multimethod data). The data are analyzed at the item level, and a trifactor model includes
a common factor that represents the consensus view of the raters, separate factors that represent unique
views or biases of raters, and specific factors for each item. There are even more variations of CFA models
for MTMM data, but researchers should select among them based on substantive, not data-driven, reasons.
One reason is that MTMM models with no constraints can be so highly parameterized that they would fit
a wide variety of sample data, so analysis by itself may be insufficient to detect misspecification. Williams
and McGonagle (2016) described research designs and analysis strategies for estimating common method
variance for self-report measures, and Geiser et al. (2015) described possible consequences of specifying
the wrong model for MTMM data. See Eid et al. (2023) for more information and examples.
Pt3Kline5E.indd 255 3/22/2023 2:56:15 PM

association is unanalyzed in the sense that no causal causes each of the three first-order factors, A–C. The
hypothesis that explains these associations is repre- first-order factors have observed variables as indica-
sented in the model. In a second-order CFA model— tors, but no observed variables are directly caused by
also called a hierarchical CFA model—the indicators g. Instead, the indicators of the general factor are the
of a second-order factor are first-order factors, and three first-order factors. This specification (1) reflects
the second-order factor is a general factor that explains the hypothesis that g explains any covariances among
covariances among its indicators, the first-order fac- the first-order factors. It also (2) defines the first-order
tors. That is, a second-order CFA model is a causal factors as endogenous variables with disturbances,
model in which the researcher makes specific claims which reflect variation not explained by g. Loadings
about directions or patterns of covariation among first- on g and disturbances for the first-order factors are
order factors (Brown, 2015). estimated controlling for measurement error, which is
Presented in Figure 14.7(a) is a second-order model accounted for by the error terms for observed variables,
that represents the hypothesis that a general factor, g, or the indicators of first-order factors.
(a) Second-order (hierarchical)
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 1 1
A B C
(b) Bifactor (canonical)

g
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 1 1
A B C
FIGURE 14.7. A second-order hierarchical model (a) and a bifactor model (b) shown with compact symbolism for indicator
error terms and factor disturbances.
Pt3Kline5E.indd 256 3/22/2023 2:56:15 PM

To identify a second-order CFA model with a single their error terms). An extension of Figure 14.7(b) allows
general factor, there must be at least three first-order the domain-specific factors to covary with each other
factors or their disturbances may not be identified but not with the general factor. This specification rep-
without imposing additional constraints. That is, Rule resents the contribution of domain-specific factors as
14.1 for first-order factors applies to the general factor above and beyond that of the general factor, just as in
in Figure 14.7(a), which has three indicators (factors the canonical model. It may also be possible to specify
A–C). The general factor in the figure is scaled by fix- that the domain-specific and general factors covary, but
ing the loading of first-order factor A to 1.0 (i.e., A is now indicator variation is not uniquely attributable to
the reference variable for g). Another option is to fix domain-specific factors versus the general factor.
the variance of g to 1.0 (standardize it), which leaves all Unlike second-order models, all factors in bifactor
direct effects of g on first-order factors as free param- models are exogenous, and variation among domain-
eters. Either way of scaling g in a single-sample analy- specific factors is not explained by the general factor.
sis is probably fine, but it is usually inappropriate to A related difference is that domain-specific factors in
standardize factors in multiple-group analyses. There bifactor models do not mediate effects of the general
are examples of second-order CFA models in assess- factor on observed variables. First-order factors in
ment research where a single general factor is concep- second-order models are intervening variables that lie
tualized as a superordinate cognitive ability factor that along indirect causal pathways from the general factor
affects more specific skills such as verbal reasoning to the indicators, and their status as mediators implies
or visual–spatial analysis (Williams et al., 2010). It that first-order factors are always endogenous (see Fig-
is also possible to analyze second-order models with ure 14.7(a)). Because the general factor in a second-
two or more general factors that covary (e.g., Brown, order model has no indicators, it may be more difficult
2015, pp. 288–297), but each second-order factor must to interpret than the general factor in a bifactor model,
have at least two first-order factors as indicators (i.e., where all observed variables are its indicators (Gig-
Rule 14.1 still applies). There are also examples of CFA nac, 2008). Another advantage of canonical bifactor
models with third-order factors (Lee & Stankov, 2013), models is that the predictive validity of domain-spe-
but such analyses are relatively rare in the literature. cific factors, which are independent of the general fac-
Bifactor models—also called nested-factor mod- tor, can be directly estimated by “embedding” a bifac-
els or general-specific models—may be especially tor model in a larger SR model where outcomes are
well suited for situations where a set of indicators is predicted by domain-specific versus general factors
(1) believed to be mainly unidimensional, but (2) there (Chen et al., 2006). It is trickier to do so for second-
are also secondary, domain-specific factors of substan- order models where the general and first-order fac-
tive interest that should also be represented in the anal- tors overlap. This is because the general factor must
ysis (Brown, 2015; Chen et al., 2006). A bifactor model be held constant when estimating external validity of
has a general factor that represents the primary domain the first-order factors in a hierarchical model (Chen et
of interest. But unlike a general factor in a second-order al., 2006; Gignac, 2008). For more information about
model, the general factor in a bifactor model is speci- the specification and analysis of bifactor models, see
fied as (1) a direct common cause for all indicators, but Reise et al. (2023).
(2) it has no causal effects on domain-specific factors. It is also true that second-order models for the same
Thus, the general factor in a bifactor model does not data can be expressed as more constrained versions of
explain the association of the domain-specific factors. bifactor models; that is, the models are nested (hierar-
An example of a bifactor model is presented in Fig- chically related), so they can be directly compared with
ure 14.7(b), where all observed variables load on a gen- the standard chi-square difference test. The method
eral factor (g) and nonoverlapping subsets of indicators relies on the Schmid–Leiman (SL) transformation
load on three domain-specific factors, A–C. The form (Schmid & Leiman, 1957). In EFA, it transforms an
of the model in the figure is canonical in that it assumes oblique factor solution with correlated factors to a hier-
orthogonal relations among the domain-specific fac- archical model where the first-order factors are residu-
tors and the general factor (Chen et al., 2006). This alized, or rendered orthogonal to second-order (gen-
specification partitions indicator variance into three eral) factors through proportionality constraints. Thus,
nonoverlapping sources, the domain-specific factors, SL transformation makes explicit a hierarchical struc-
the general factor, and unique indicator variance (i.e., ture implied by correlated factors. It has the advantage
Pt3Kline5E.indd 257 3/22/2023 2:56:15 PM

that explained variation among the indicators is more those responses. Multiple markers should also reflect
readily partitioned between first-order and general various facets of common method bias—see Podsa-
factors. Yung et al. (1999, p. 116) demonstrated how koff et al. (2003, pp. 881–833) for descriptions of pos-
bifactor models for the same data can be constrained sible sources of method-related bias—and in CFA they
through a generalized form of SL transformation with would be specified as indicators of a method factor.
no proportionality constraints to generate second-order See Richardson et al. (2009) for more information and
models; that is, second-order models are nested within examples.
bifactor models. Thus, the two models can be compared
with the standard chi-square difference test. Chen et
al. (2006) offered detailed advice along these lines SUMMARY
and demonstrated the implications of fitting bifactor
models versus second-order models to data from rat- The CFA technique analyzes reflective measurement
ing scales about general quality of life (QOL) versus models where common factors are proxies for theo-
degree of satisfaction in more specific areas, such as retical variables. These models are restricted in that
mental health and physical health, among other areas. the researcher must specify in advance the number of
Bifactor models have become “popular” in that there factors, the correspondence between factors and indica-
are now many examples in the literature. The choice tors, and error covariance patterns, if any. Basic CFA
between a second-order model and a bifactor model models feature continuous indicators each of which
should be guided by a priori considerations, or theo- depends on just a single factor with independent errors.
ries of measurement in a particular research area. This This combination specifies unidimensional measure-
means that researchers should avoid “shopping” for a ment, and the evaluation of basic CFA models with
measurement model that fits the data in a particular multiple factors tests the hypotheses of convergent
sample. Instead, the researcher should consult relevant validity and discriminant validity. It is also possible to
theory and empirical studies about the dimensionality test CFA models with error covariances or indicators
of target concepts and a particular set of indicators— that load on more than one factor, but it is more chal-
see Dunn and McCray (2020) for more information lenging to determine whether such models are identi-
and examples. Flora (2020) described a special form fied compared with basic models. Technical problems
of the omega reliability coefficient for bifactor models, in the analysis such as nonconvergence of iterative
and Reise et al. (2010) outlined options for analyzing estimation or inadmissible solutions are more likely
exploratory bifactor models where indicators have in smaller samples, especially when some factors have
cross-loadings on multiple domain-specific factors or just two indicators. Respecification can be challeng-
where those factors are assumed to covary. ing because there might be many possible changes that
A more controversial type of bifactor model for could be made to a given CFA model. Another problem
indicators that all rely on the same method features is that of equivalent CFA models. The best way to deal
the specification of a general factor that purportedly with both of these challenges is to rely more on sub-
represents common method variance as a latent vari- stantive knowledge than on statistical considerations
able (Podsakoff et al., 2003, p. 891; Widaman, 1985, in the analysis. The next chapter concerns SR models
Model 3B). A problem with this type of model is that
where directional causal effects are specified between
the researcher is unable to identify or describe the spe-
certain pairs of common factors.
cific cause or type of method variance. Antonakis et al.
(2010) put it like this: Without also analyzing indicators
or markers of specific kinds of common method bias, LEARN MORE
a single general method factor may have no meaning-
ful interpretation. An example of a marker for a social Brown (2015) is a comprehensive resource for CFA with many
desirability method factor is the Marlowe–Crowne research examples. The shorter introduction by Roos and
Social Desirability Scale (Crowne & Marlowe, 1960). Bauldry (2022) is not as in-depth but covers a wide range of
In this case, the method factor represents a social desir- topics in clear, easy-to-understand language. Results of com‑
ability response set, or the tendency for respondents to puter simulations by Ondé and Alvarado (2020) provide a
present themselves in a favorable light based on per- cautionary tale against relying on arbitrary rules or thresh‑
ceived social norms, regardless of the truthfulness of olds about sample size, the number of indicators per factor,
Pt3Kline5E.indd 258 3/22/2023 2:56:15 PM

values of standardized factor loadings, or fixed thresholds Ondé, D., & Alvarado, J. M. (2020). Reconsidering the
for global fit statistics that supposedly indicate “good” model conditions for conducting confirmatory factor analysis.
fit in CFA. Rönkkö and Cho (2022) outline alternative defini‑ Spanish Journal of Psychology, 23, Article E55.
tions and methods for evaluating discriminant validity in the
Rönkkö, M., & Cho, E. (2022). An updated guideline for
case of when multiple methods are not available or when
assessing discriminant validity. Organizational Research
variables are all measured at a single occasion.
Methods, 25(1) 6–47.
Brown, T. A. (2015). Confirmatory factor analysis for Roos, J. M., & Bauldry, S. (2022). Confirmatory factor
applied research (2nd ed.). Guilford Press. analysis. Sage.
EXERCISES
1. Show that df M = 0 for a basic single-factor CFA 6. Describe the residuals in the lavaan analysis for
model with 3 indicators using the reference variable fitting a single-factor model to the data for the
method to scale the factor. KABC-I (see analysis 1, Table 14.1).
2. For a basic single-factor model with 2 indicators, 7. Calculate structure coefficients (i.e., predicted cor-
prove that df M = –1. Assume the reference variable relations) between each factor and the indicators of
method. the other factor, given the results in Table 14.3.
3. Prove that df M = 8 for Figure 14.2(b) in the variance 8. Show that the estimated variance for the sequen-
standardization method. tial factor of 2.838 in Table 14.3 corresponds to
the unique (unexplained) variance for its reference
4. Derive the ECI constraint that scales factor B in variable, Hand Movements (see analysis 2, Table
Figure 14.2(c). 14.1).
5. Show that df M = 19 for Figure 14.3.
Pt3Kline5E.indd 259 3/22/2023 2:56:18 PM

Appendix 14.A For every indicator, there is at least one (Rule 14.2c)
other indicator in the model with which its
error term is not correlated
Identification Rules Rule 14.3 covers sufficient requirements for identify-

for Correlated Errors ing multiple loadings for the same indicator. Basically,
this rule requires that each factor on which an indica-
or Multiple Loadings tor loads has a minimum number of indicators without
error correlations (i.e., each factor meets Rule 14.2a).
The same rule also requires that each of every pair of
The identification rules described next correspond to
such factors has an indicator that does not share an
Conditions B, D, and E in Kenny et al. (1998, pp. 253–
error correlation with the target indicator (i.e., the one
254). The same authors also described additional iden-
with multiple factor loadings). If an indicator with load-
tification heuristics for exceptions to these rules not
ings on multiple factors shares error correlations with
covered here. Rule 14.2 spells out conditions that must
other indicators, then the additional requirement stated
be satisfied by each factor (Rule 14.2a), pair of factors
as Rule 14.4 must also be satisfied. This rule requires
(Rule 14.2b), and indictor (Rule 14.2c) to identify error
that for each of multiple factors on which an indicator
correlations in CFA models. Rule 14.2a is a requirement
depends, at least one indicator with a single factor indi-
for a minimum number of indicators with no error cor-
cator does not share an error correlation with the target
relations per factor—either two or three depending on
indicator. The requirements of Rules 14.3 and 14.4 are
the pattern of error correlations or constraints imposed
usually addressed by specifying a sufficient number of
on factor loadings. Rule 14.2b refers to the specification
indicators that depend on just one factor.
that for every pair of factors, there must be at least two
indicators, one from each factor, whose error terms are
RULE 14.3 For multiple loadings to be identified,
not correlated. Rule 14.2c concerns the requirement for
both of the following must hold:
every indicator that there is at least one other indicator
in the model with which it does not share an error cor- 1. Each factor on which the indicator loads must
relation. Rule 14.2, as a whole, assumes that all factor satisfy Rule 14.2a for a minimum number of
variances are free parameters and that there are mul- indicators with no error correlations
tiple indicators of every factor. 2. Every pair of those factors must satisfy Rule 14.2b
that each factor has an indicator that shares no
RULE 14.2 For a CFA model with error covariances error correlation with a corresponding indicator
to be identified, all three conditions listed next must on the other factor of that pair
be satisfied:
For each factor, at least one of the (Rule 14.2a) RULE 14.4 For error correlations that involve
following is true: indicators with multiple loadings to be identified,
both of the following must be true:
1. There are at least three indicators whose errors
are uncorrelated with each other. 1. Rule 14.3 is satisfied
2. There are at least two indicators whose errors are 2. For each of the multiple factors on which the
uncorrelated and either indicator depends, there must be at least one
a. the errors of both indicators are not correlated indicator with a single loading that does not have
with the error term of a third indicator for a an error correlation the target indicator
different factor, or
Let’s apply the identification heuristics just consid-
b. an equality constraint is imposed on the
ered to the CFA models in Figure 14.8. To save space, I
loadings of the two indicators
use very compact symbolism in the figure where indi-
For every pair of factors, there are at least (Rule 14.2b) cators are designated as X and factors are represented
two indicators, one from each factor, whose as A, B, or C, but do not forget that variance parameters
error terms are uncorrelated are associated with each exogenous variable in the fig-
Pt3Kline5E.indd 260 3/22/2023 2:56:19 PM

ure. Scaling constants are also not shown in the figure, A → X2 = A → X3

but they are assumed. The single-factor, four-indicator
model in Figure 14.8(a) has two error covariances, or would be sufficient to identify the model in Figure
14.8(b) because then Rule 14.2 would be met (Kenny
E2 E4 and E3 E4 et al., 1998).
The two-factor, four-indicator model in Figure 14.8(c)
The model is just-identified because df M = 0, its fac- with a single error correlation is just-identified because
tor (A) has at least three indicators whose error terms df M = 0 and all three requirements for Rule 14.2 are
are uncorrelated (X1–X3) (Rule 14.2a), and all other satisfied (e.g., each factor has two indicators that share
requirements of Rule 14.2 (Table 14.8) are met. no error correlation with the indicator of another fac-
The single-factor, four-indicator model in Figure tor). But the two-factor, four-indicator model in Figure
14.8(b) also has two error covariances, but in a different 14.8(d) with a different error correlation is not identified
pattern, or because factor B in this model does not have two indi-
cators with unrelated error terms. It is generally easier
E1 E2 and E3 E4 to identify cross-factor error correlations (e.g., Figure
14.8(c)) than within-factor error correlations (e.g., Fig-
Although this model has at least two indicators whose ure 14.8(d)) when there are only two indicators per fac-
error terms are independent, such as X2 and X3, it nev- tor without imposing additional constraints.
ertheless fails Rule 14.2a because factor A does not The three-factor, six-indicator model in Figure
have three indicators whose error terms are indepen- 14.8(e) with two cross-factor error correlations is overi-
dent of each other. There are also no other factors in dentified because the degrees of freedom are positive
the model, so the alternative requirement in Rule 14.2 (df M = 4) and Rule 14.2 is satisfied. This model also
that factor A have at least two indicators with unrelated demonstrates that adding indicators—along with a
error terms and the errors of both of those indicators do third factor in this case—identifies additional error cor-
not covary with the error term of a different factor does relations compared with the two-factor model in Figure
not apply; therefore, Figure 14.8(b) is not identified. But 14.8(c). The model in Figure 14.8(f) has an indicator
this model would be identified if an equality constraint that loads on two factors. Because this model meets the
were imposed on the pattern coefficients of X2 and X4. requirements of Rule 14.3 and has positive degrees of
That is, the specification that freedom (df M = 3), it is identified.
Pt3Kline5E.indd 261 3/22/2023 2:56:19 PM

(a) Identified (b) Not identified
E1 E2 E3 E4 E1 E2 E3 E4
X1 X2 X3 X4 X1 X2 X3 X4
A A
(c) Identified (d) Not identified
E1 E2 E3 E4 E1 E2 E3 E4
X1 X2 X3 X4 X1 X2 X3 X4
A B A B
(e) Identified (f) Identified
E1 E2 E3 E4 E5 E6 E1 E2 E3 E4 E5
X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5
A B C A B
FIGURE 14.8. Identification status of confirmatory factor analysis models with correlated errors or indicators that load on
multiple factors. All models shown in very compact graphical symbolism where scaling constants (1) and variance parameters
for factors and error terms are assumed.
Pt3Kline5E.indd 262 3/22/2023 2:56:19 PM

15
Structural Regression Models
The most common kind of model in SEM that approximates theoretical concepts with common factors is a
structural regression (SR) model, also called a latent variable path model, or a full LISREL model
from the time when LISREL was one of the first computer programs to analyze such models; today, any mod‑
ern SEM computer tool can do so. The structural part of an SR model represents hypotheses about direct
or indirect effects among observed variables or common factors, and the measurement part represents the
correspondence between common factors and their indicators. The capability to test hypotheses about
both structural and measurement relations or about error covariances within a single model affords much
flexibility. The specification of SR models with continuous indicators and requirements for their identification
is considered first. Next, two different strategies are outlined for analyzing full SR models where every
variable in the structural model is a common factor and each has multiple indicators. The strategies address
the problem of how to locate the source(s) of specification error by separating the evaluation of the measure‑
ment part of the model from analysis of its structural part. The detailed example for this chapter follows the
two-step approach just mentioned. Also discussed in this chapter are partial SR models with single indi‑
cators for some, but not all, variables in the structural part of the model. A method for specifying the model
that explicitly controls for measurement in single indicators yet does not affect model fit is also explained.
Reflective measurement is assumed in this chapter, but the next chapter deals with the analysis of formative
measurement models in techniques for composite SEM.
FULL SR MODELS is specified as one among multiple indicators for a com-

mon factor. Consequently, all observed variables in
Presented in Figure 15.1(a) is a manifest-variable Figure 15.1(b) have error terms.
path model depicted in full McArdle–MacDonald The structural model of Figure 15.1(b) represents the
RAM graphical symbolism. Exogenous variable X1 same basic pattern of direct and indirect causal effects
is assumed to be measured without error, an assump- as the path model of Figure 15.1(a) but among common
tion often violated in practice. This assumption is not factors, or
required for the endogenous variables in this model,
but random error in Y1 or Y4 is manifested in their dis- A→B→C
turbances. Figure 15.1(b) is a full SR model with both
structural and measurement components. Its measure- The structural model just listed is recursive, but it is
ment model has the same three indicators represented also generally possible to specify an SR model with a
in the path model, X1, Y1, and Y4. Unlike the path model, nonrecursive structural component. Each endogenous
though, each of these three indicators in the SR model factor in Figure 15.1(b) has a disturbance (DB, DC).
263
Pt3Kline5E.indd 263 3/22/2023 2:56:19 PM

(a) Path model
DY1 DY4
1 1
X1 Y1 Y4
(b) Full SR model
EX1 EX2 EX3 EY1 EY2 EY3 EY4 EY5 EY6

1 1 1 1 1 1 1 1 1
X1 X2 X3 Y1 Y2 Y3 Y4 Y5 Y6
1 1 1
A B C
1 1
DB DC
FIGURE 15.1. Examples of a manifest-variable path model (a) and a corresponding full structural regression model with mul‑
tiple indicators for each common factor in the structural part (b) shown in full McArdle–MacDonald RAM graphical symbolism.
Unlike in manifest-variable path models, disturbances for SR models in the same ways as they are for path
for endogenous common factors in Figure 15.1(b) models and CFA models (i.e., Rules 7.1, 7.2, and 14.1).
reflect only omitted causes and not also measurement Exercise 1 asks you to verify that df M = 25 for Figure
error in the factor’s indicators, which are represented 15.1(b).
in the measurement part of the model. For the same Because df M ≥ 0 and all unmeasured variables—fac-
reason, estimates of the coefficients for the paths tors, indicator error terms in the measurement part, and
disturbances in the structural part—are scaled (through
A→B and B→C ULI constraints), Figure 15.1(b) meets the necessary,
but insufficient, requirements for identification. A suf-
in Figure 15.1(b) are adjusted for measurement error, ficient condition that can prove identification for full
but those for the paths SR models is the two-step identification rule (Ander-
son & Gerbing, 1988; Bollen, 1989). The rule’s name
X1 → Y1 and Y1 → Y4 hints at its use: Evaluation of whether full SR models
are identified is conducted separately for each part of
in Figure 15.1(a) are not so adjusted. When means are the model, measurement and structural. A theme of this
not analyzed, observations and parameters are counted evaluation is that a valid (i.e., identified) measurement
Pt3Kline5E.indd 264 3/22/2023 2:56:19 PM

Structural Regression Models 265
model is needed before it would make sense to assess means that its measurement and structural components
the structural part of a full SR model. It reflects the are estimated simultaneously in a single step (i.e., the
view that the analysis of an SR model is essentially a model as shown in the figure is analyzed). The results
path analysis conducted with the estimated variances indicate poor fit. Now, where is the model misspecified?
and covariances among the common factors. Thus, it The measurement part? The structural part? Or both?
must be possible for the computer to derive a unique, With one-step modeling, it can be hard to precisely
positive definite factor covariance matrix before coef- locate the source of the problem. Two-step modeling by
ficients for direct effects between factors can be esti- Anderson and Gerbing (1988) parallels the two-step heu-
mated. The two-step identification rule is stated next: ristic (Rule 15.1) for the identification of full SR models:
RULE 15.1 A full SR model is identified when means 1. In the first step of the analysis, the initial SR
are not analyzed if model is respecified as a CFA measurement model,
which is then analyzed to determine whether it fits the
1. the measurement part respecified as a CFA model
data. If the fit of this CFA model is poor, then not only
is identified (evaluate the CFA model against Rules
may the researcher’s hypotheses about measurement be
14.1–14.4); and
wrong, but also the fit of the original SR model may be
2. the structural part is identified (evaluate the even worse if its structural model is overidentified. Sup-
structural model against Rule 7.4)1 pose that the fit of the three-factor CFA model in Figure
15.2(b) is poor. This model has three paths among the
The two-step identification rule is a sufficient condi- factors that represent all possible factor covariances. In
tion: Full SR models that satisfy both parts of Rule 15.1 contrast, the structural part of the initial SR model in
are identified. Evaluation of Rule 15.1 is demonstrated Figure 15.2(a) has only two paths that represent pre-
next for Figure 15.2(a), which is the same model in Fig- sumed direct effects. If the fit of the CFA model with
ure 15.1(a) but now depicted using more compact graph- three paths among the factors is poor, then the fit of the
ical symbolism. As mentioned, df M = 25 (i.e., ≥ 0) and SR model with only two paths may be even worse. The
all unmeasured variables are scaled in Figure 15.2(a), first step thus involves finding an adequate measure-
but these facts are insufficient to prove identification. ment model.
To find out, we apply the two-step rule: The respecifica-
tion of the original full SR model as a CFA measure- 2. Given a retained CFA measurement model from
ment model is presented in Figure 15.2(b). Because this the first step, the second step compares the fits of the
basic CFA model has at least two indicators per factor, original SR model (with modifications to its measure-
it is identified. The first part of the two-step rule is met. ment part, if any, from the first step) and those with dif-
The structural part of the original SR model is shown ferent structural models to one another and the fit of the
in Figure 15.2(c). Because the structural part viewed CFA model with the chi-square difference test. (This
as a path model is recursive, it, too, is identified, and assumes that hierarchically related models are com-
because the original SR model in Figure 15.2(a) meets pared.) Here is the procedure: If the structural part of
both parts of the two-step rule, it is identified—specifi- an SR model is just-identified, the fits of the SR model
and the CFA respecification of it are identical because
cally, overidentified.
these models are equivalent. For example, if the path
A → C were added to the SR model of Figure 15.2(a),
TWO‑STEP MODELING then it would have just as many degrees of freedom, or
df M = 24, as does the CFA model of Figure 15.2(b). (You
Suppose that a researcher specified the full SR model in should verify this statement.) The original SR model
Figure 15.2(a). The data are collected and the researcher of Figure 15.1(a) with its overidentified structural part
uses one-step modeling to analyze this model, which is thus nested under the CFA model of Figure 15.2(b).
But it may be possible to trim a just-identified structural
1 Rule 7.4 concerns recursive structural models (i.e., they are part of an SR model without appreciable deterioration
identified). If the structural part of an SR model is nonrecursive, in fit. Structural portions of SR models are respecified
then evaluate its identification status against the rules or graphi- according to the same general principles as in manifest-
cal criteria outlined in Chapter 19. variable path analysis.
Pt3Kline5E.indd 265 3/22/2023 2:56:19 PM

(a) Original full SR model
X1 X2 X3 Y1 Y2 Y3 Y4 Y5 Y6
1 1 1
A B C
(b) Respecified as CFA model
X1 X2 X3 Y1 Y2 Y3 Y4 Y5 Y6
1 1 1
A B C
(c) Structural model
A B C
FIGURE 15.2. Evaluation of the two-step rule for identification of a full structural regression model presented in compact
graphical symbolism for indicator error terms in the measurement part and disturbances in the structural part.
There are some challenges when applying two-step the same data. Without fitting any retained SR model
modeling. One is that possibilities for respecification to new data, it is unknown whether that model will
include all the basic options in CFA for the measure- replicate. The only way for researchers to avoid data-
ment model plus all those in path analysis for the struc- dependent respecification or getting lost in a maze of
tural model. A second, related obstacle is the garden respecification choices is to rely more on theory and
of forking paths (Gelman & Loken, 2014), where each results from prior studies for guidance.2
respecification decision can lead to multiple options Equivalent models are a fourth challenge. It was
for further changes to the original model (Chapter 11). mentioned that a CFA model at Step 1 and a corre-
These possibilities can expand geometrically for very
complex SR models. A third complication is that the 2 This is true for any kind of structural equation model, not just
method capitalizes on chance variation when hierarchi- SR models, and also true for other SR model testing strategies,
cally related models are tested and respecified using not just two-step modeling.
Pt3Kline5E.indd 266 3/22/2023 2:56:19 PM

sponding SR model with a just-identified structural that improves fit. Instead, it is reassuring that including
part at Step 2 will be equivalent. Thus, the only basis excessive cross-loadings or error covariances in mea-
for preferring the SR model with stronger hypotheses surement models that are otherwise correct does not
about causal effects among factors over the CFA model substantially bias estimates of causal effects between
with no causal effects (only covariances) among the factors.
same factors is rational (i.e., argument, not analysis) or If a measurement model is retained at Step 1 in two-
is supported by study design (e.g., temporal precedence step modeling, there are two basic options for testing
in measurement). Equivalent models can also exist the structural model at the Step 2: (1) The initial struc-
within each step; specifically, there could be (and prob- tural model is more sparse (overidentified) in model
ably are) equivalent versions of a CFA model retained building (forward search), and free parameters that cor-
at Step 1 (Chapter 14) just as there could be versions of respond to direct effects or disturbance covariances are
an SR model retained at Step 2 with equivalent struc- added to the model over a series of comparisons with
tural models while holding constant the measurement the chi-square difference test. (2) In model trimming
part of the model (Chapter 11). Thus, reports from two- (backward search), the initial structural model is more
step modeling should (1) acknowledge the existence of complex, such as just-identified with all possible paths,
equivalent models at each step and (2) explain why any and selected free parameters are constrained (usually
retained models were preferred over equivalent ver- to zero) over subsequent comparisons of respecified
sions. models.
Suppose that the CFA model analyzed at Step 1 in Assuming a correct measurement model, Chou and
two-step modeling does not fit the data. Fan and Han- Bentler (2002) noted that the likelihood of detect-
cock (2006) cautioned against viewing the measure- ing the SR model with the true structural component
ment model as something that can be easily “patched depends very much on the initial model specified in
up” through post hoc addition of cross-loadings or model building. The reason is that as the initial model
error covariances until fit is satisfactory while having departs from the true model, the less likely it is that
no effect on the integrity of parameters at Step 2, or the proper respecification will occur. Model building
path coefficients for direct effects among factors. That can also “lock in” an incorrect specification early in the
is, the measurement part of an SR model is not just a modification process. In contrast, starting with an ade-
vehicle to get to estimates for the model’s structural quate but overparameterized model may lead to more
part. In computer simulations, Fan and Hancock (2006) correct respecifications in model trimming. In com-
estimated the impact of adding cross-loadings or error puter simulations, Chou and Bentler (2002) compared
covariances to sample measurement models when the rates of successful specification searches in model
population model had no such features, as summarized building versus trimming for SR models with differ-
next. ent numbers of exogenous versus endogenous factors
Over conditions that included varying magnitudes all with two indicators per factor. Causal directional-
of standardized factor loadings (.4, .6, .8), number of ity in all tested structural models was correct, which
indicators per factor (3, 5, 7), and sample size (N = 200, is typically unknown in analyses of real data. Given
400, 800) for a five-factor measurement model, Fan this limitation, Chou and Bentler (2002) found that the
and Hancock (2006) reported that the extent of param- correct model was detected about 60% of the time in
eter bias in the structural model was relatively small, model trimming. The success rate in model building
but convergence failure was somewhat more likely in was lower, about 20%, but hit rates in both methods
conditions with the smallest numbers of indicators per improved appreciably when respecification was guided
factor, factor loadings, and sample size. Superfluous by a priori information.
cross-loadings tended to reduce structural parameter As alternative versions of the structural part of the
estimates slightly; intrafactor error covariances inflated model are tested, researchers should observe only slight
estimates slightly; and interfactor error covariances changes in the loadings of indicators on their respec-
had basically little, if any, distorting effects. Overall, tive common factors. If so, then the assumptions about
the degree of bias in standard errors in the structural measurement may be invariant to changes in the struc-
model due to overparameterization of the measurement tural part of an SR model. But if factor loadings change
model was about 5%. These results should not be taken appreciably when different structural models are speci-
as license to respecify measurement models in any way fied, then (1) the measurement model is clearly not
Pt3Kline5E.indd 267 3/22/2023 2:56:19 PM

invariant, and (2) interpretational confounding is a tural model was relatively poor. A limitation of these
risk (Burt, 1976). This means that the empirical mean- results is that O’Boyle and Williams (2011) applied
ing of a common factor, or the values of loadings for fixed thresholds for the RMSEA that supposedly indi-
its indicators, do not remain constant, as either causes cate “good” fit, but I cannot endorse such thresholds
or outcomes of that factor are varied in the structural for the reasons discussed in Chapter 10. Another prob-
model. Appreciably unstable loadings in Step 2 of two- lem is that global fit statistics like the RMSEA-P are
step modeling jeopardize consistent interpretation of no substitute for examining the residuals, or local fit
what the indicators measure. It is generally easier to (McDonald & Ho, 2002). The real details about the fit
detect interpretational confounding in two-step model- of SR models lie in those residuals, not solely in values
ing than in one-step modeling. of global fit statistics whether they are adjusted or not
Some methodologists have expressed the concern adjusted for the difference between the structural and
that values of standard global fit statistics (Chapter 10) measurement models.
are influenced more by the measurement component of Based on computer simulation results for CFA and
SR models than by the structural part of these mod- SR models under varying sample sizes, regardless of
els (Anderson & Gerbing, 1988). One reason is that the whether the primary focus is structural or measurement
degrees of freedom for the measurement model can and the novelty of the research area, Mai et al. (2021)
far exceed those for the structural model. For exam- developed flexible thresholds for the CFI and SRMR
ple, the structural model in Figure 15.2(c) has only a other approximate fit indexes that supposedly detect
single degree of freedom, much lower compared with models with “good” fit in studies of technology man-
those for the measurement model in Figure 15.2(b), or agement. These thresholds are called “flexible” rather
df M = 24 as a CFA model. In a review of 14 published than “dynamic” because they do not take into account
studies, McDonald and Ho (2002) calculated separate the researcher’s specific model (other than its general
values of the RMSEA for the structural and measure- type; e.g., CFA vs. SR) or parameter estimates. Again,
ment parts of SR models. They found that evidence I cannot recommend blind reliance on such thresholds,
for misspecification in the structural part was gener- whether static, dynamic, or flexible, especially if local
ally negated by better fit of the measurement part of fit is not also carefully evaluated. To be fair, Mai et al.
retained SR models. (2021) described limitations of their tailored-fit evalu-
O’Boyle and Williams (2011) described the root ation strategy, including the fact that no simulation
mean square error of approximation of the path study can capture the whole range of models and data
component (RMSEA-P) with its 90% confidence inter- analyzed in the real world, but there is no proven set of
val that estimates the fit of the just the structural com- thresholds for approximate fit indexes that work over
ponent of an SR model. It is calculated based on the all types of models and data.
difference between the chi-square and degrees of free-
dom for the whole SR model with both its measurement
and structural parts and the corresponding values for OTHER MODELING STRATEGIES
the measurement part expressed as a CFA model, such
as Figure 15.2(a) versus Figure 15.2(b). So defined, its Two-step modeling is the not the only strategy for ana-
formula has the same general form as for the standard lyzing full SR models. There is also four-step model-
RMSEA (Equation 10.11). An online calculator for the ing, which expands on on two-step modeling to include
RMSEA-P is freely accessible.3 more exploratory analyses that lead to more confirma-
In their review of 43 published analyses, O’Boyle and tory analyses over possibly a longer series of studies
Williams (2011) found that RMSEA-P values for the (Mulaik & Millsap, 2000). It assumes that each com-
structural model only were just as favorable as RMSEA mon factor has at least four indicators, or a sufficient
values for the whole SR model and for the correspond- number to test for unidimensionality with the vanish-
ing CFA measurement model in just a handful of stud- ing tetrad test (Equation 14.5). Four indicators are also
ies, about 5 or so. In about 20 studies, the differences the minimum number so that a basic single-factor CFA
were more striking in that evidence for fit of the struc- model is overidentified. The researcher tests a sequence
of at least four hierarchically related models. As in two-
3 https://fgoeddeke.shinyapps.io/rmseap/ step modeling, if the fit of a model in four-step model-
Pt3Kline5E.indd 268 3/22/2023 2:56:19 PM

ing with fewer constraints is poor, then a model with tical or desirable when fewer indicators, including a
even more constraints should not even be considered. single best indicator, have better psychometric than
The basic steps are outlined next: four indicators. Mulaik and Millsap (2000) noted that
having at least four indicators increases df M, which
1. The least restrictive model specified at the first can offset, to some extent, the limitations of a smaller
step is an EFA model where each indicator loads on sample (e.g., estimated power in the MacCallum–
all factors and the number of factors is the same as RMSEA method increases with higher df M values).
for models analyzed in subsequent steps. This model Both two-step and four-step modeling capitalize on
should be analyzed with the same estimation method, chance variation when models are tested and respeci-
such as default ML when the indicators are continuous fied using the same data. Both methods are better
and normally distributed, as used in subsequent steps. than one-step modeling, where there is no separation
The techniques of ESEM or E/CFA could be used of measurement issues from structural issues. Also,
instead of EFA at this step. This first step is intended to neither method is a gold standard for testing SR mod-
test the provisional correctness of hypotheses regard- els, but there really is no such thing (Bentler, 2000).
ing the number of factors, but it cannot confirm that There are also other methods, such as Green et al.’s
hypothesis if the model fit is adequate (Hayduk & Gla- (2001) adjusted Bonferroni method for eliminating
ser, 2000). model parameters in forward specification searches;
see also Bollen (2000) for more information about
2. Step 2 in four-step modeling corresponds to testing strategies for SR models.
Step 1 of two-step modeling: A CFA model is speci-
fied where some cross-loadings are fixed to zero, which
designates those indicators that do not depend on par- DETAILED EXAMPLE OF TWO‑STEP
ticular common factors. If the fit of the CFA model at MODELING IN A HIGH‑RISK SAMPLE
this step is reasonable, then it is possible to go on to
test an SR model; otherwise, the measurement model Presented in Figure 15.3 is an initial full SR model
should be revised. of scholastic achievement and classroom adjustment
3. In Step 3, the target SR model is specified with among students (mean grade levels of about 7–8) as a
the same pattern of fixed-to-zero cross-loadings prop- function of general cognitive ability and degree of risk
erties as represented in the CFA model from the Step for psychopathology. One risk indicator is based on
2. In a typical sequence, the structural part of the SR parental diagnosis of a major psychiatric disorder, such
model will include fewer direct effects or unanalyzed as schizophrenia or bipolar disorder, and the second
associations than the total number of pairwise factor indicator is the degree of low family SES (i.e., higher
covariances in the CFA model from Step 2. But if the scores indicate lower SES). Indicators of cognitive
structural part of the SR has as many paths as the CFA ability are scores for verbal reasoning, visual–spatial
model, then the fit of the two models, at Steps 2 and 3, analysis, and memory from an individually adminis-
will be identical (i.e., they are equivalent). In this case, tered IQ test for children. There are two endogenous
Step 3 would be skipped (go to the next step). factors, including scholastic achievement, which is
measured by reading, arithmetic, and spelling tasks
4. The last step (Step 4) involves tests of prespeci- from a standardized test, and classroom adjustment
fied hypotheses about parameters freed from the out- with three teacher-informant indicators about student
set of the method (i.e., since Step 1). These tests may motivation, emotional stability, and harmony of social
involve imposing zero or other constraints that each relationships. In the structural model, achievement and
increase df M by one. Steps 3 and 4 of four-step model- classroom adjustment are both specified as caused by
ing are basically a more specific statement of activities cognitive ability and risk, but there is no direct effect
that would fall under the Step 2 of two-step modeling. or disturbance covariance between the two endogenous
factors. That is, the model assumes that any association
Criticisms of four-step modeling by Hayduk and between the two endogenous factors is explained by
Glaser (2000) include its reliance on having at least their common causes, the two exogenous factors. Exer-
four indicators per factor, which is not always prac- cise 2 asks you to verify that df M = 39 for Figure 15.3.
Pt3Kline5E.indd 269 3/22/2023 2:56:19 PM

Listed in Table 15.1 for analyses 1–2 are annotated Step 1 (CFA Model)
script files for analyzing Figure 15.3 in the two-step
For analysis 1 in Table 15.1, the model in Figure 15.3
method. The lavaan (Rosseel et al., 2023) and sem-
was respecified as a four-factor basic CFA model with
Tools (Jorgensen et al., 2022) packages are used in both
the same pattern of factor–indicator correspondence. In
analyses. All input and output files are freely accessible
contrast to the original SR model, which has five paths
on the website for this book. For pedagogical reasons,
among the four factors, the CFA model has all possible
I generated the hypothetical correlations in Table 15.2
paths (6) among the same four factors (i.e., df M = 38 for
to match the patterns of association reported by Wor-
the CFA model). In the MacCallum–RMSEA method,
land et al. (1984) for the same variables in a sample of
the power of the chi-square test for CFA model in analy-
N = 158 at-risk students. Worland et al. (1984) did not
sis 1 is low, only .46.4 Thus, the likelihood of detect-
report standard deviations, so I generated realistic val-
ing a measurement model that does not perfectly fit
ues based on studies conducted with similar measures
published around the same time (e.g., Spruill & Beck,
1986). 4 Parameters: e0 = 0, e1 = .05, df M = 38, N = 158, a = .05.
EVe EVi EMe ERe EMa ESp

1 1 1 1 1 1
Verbal Visual Memory Reading Math Spelling
1 1
1
Cognitive Achievement DAch
1
Risk Adjustment DAdj
1 1
Parent
Low SES Motivation Harmony Stability
Disorder
1 1 1 1 1
EPD ELo EMo EHa ESt
FIGURE 15.3. Initial full structural regression model of scholastic achievement and classroom adjustment as a function of
cognitive ability and risk for psychopathology shown in full McArdle–MacDonald RAM graphical symbolism.
Pt3Kline5E.indd 270 3/22/2023 2:56:20 PM

TABLE 15.1. Analyses, Script Files, and Packages in R for Analyses of Full
or Partial Structural Regression Models
Analysis Script files R Packages
1. Step 1 in two-step modeling for a full SR model worland-sr-step1.r lavaan
of achievement and classroom adjustment semTools
2. Step 2 in two-step modeling for a full SR model worland-sr-step2.r lavaan

of achievement and classroom adjustment semTools
3. Single-factor CFA models for tests of the CSB sauve-csb-1-factor.r lavaan

(models 1–5)
4. Partial SR model of cognitive level and symptom sauve-partial-sr.r lavaan

unawareness with single-indicator specification semTools
Note. Output files have the same names except the extension is “.out.” SR, structural regression; CFA, confir-
matory factor analysis; CSB, CogState Schizophrenia Battery.
TABLE 15.2. Input Data (Hypothetical Correlations and Standard Deviations) for Analysis
of a Full Structural Regression Model of Achievement and Classroom Adjustment
among Students at Risk for Mental Disorder
Variable 1 2 3 4 5 6 7 8 9 10 11
Cognitive
1. Verbal —
2. Visual .70 —
3. Memory .65 .60 —
Achievement
4. Reading .55 .50 .45 —
5. Arithmetic .50 .45 .40 .70 —
6. Spelling .35 .35 .30 .55 .50 —
Adjustment
7. Motivation .30 .30 .30 .50 .45 .44 —
8. Harmony .25 .20 .22 .41 .28 .34 .40 —
9. Stability .35 .32 .32 .48 .45 .42 .60 .45 —
Risk
10. Parent Disorder –.25 –.24 –.22 –.21 –.18 –.15 –.15 –.12 –.17 —
11. Low SES –.22 –.26 –.30 –.25 –.22 –.18 –.17 –.14 –.20 .42 —
SD 13.75 14.80 12.60 14.90 15.25 13.85 9.50 11.10 8.70 12.00 8.50
Note. N = 158.
Pt3Kline5E.indd 271 3/22/2023 2:56:20 PM

the population data matrix over random samples is just Step 2 (SR Model)
under 50%. The target sample size for power ≥ .90 is
Results from Step 1 of the two-step method just sum-
N = 352, or over twice as large as the actual sample size
marized about the measurement model give the green
(N = 158).
light (permission to proceed) to directly analyze the
Analysis of the four-factor CFA model with default
original SR model with five paths in Figure 15.3 in
ML estimation in Step 1 of the two-step method con-
Step 2 of the method—see analysis 2 in Table 15.1.
verged to an admissible solution. Shown in the top part
This second analysis also with default ML converged
of Table 15.3 are values of selected global fit statistics
to an admissible solution. Reported in the lower part of
for the CFA model. The model passes the chi-square
Table 15.3 are values of global fit statistics for the SR
test, but power here is low, and we need to consider
model. Although the model passes the chi-square test—
more information about fit. Values for the RMSEA
chiML (39) = 49.747, p = .116—the fit of the SR model
and CFI are the best-possible results for each index—
with 5 paths among the factors is significantly worse
respectively, 0 and 1.0—and SRMR = .023 is not prob-
than that of CFA measurement model with 6 paths,
lematic. The 90% confidence interval for the RMSEA,
or chiD (1) = 33.535, p < .001. Although values of the
or (0, 0) is degenerate, which means it falsely suggests
RMSEA, CFI, and SRMR for the whole SR model are
that RMSEA = 0 has no sampling error. Confidence
not grossly problematic, the value of the RMSEA-P for
interval degeneracy can result when sample estima-
just the structural part of the model is troubling, or .455,
tors are equal or close to the boundaries of their values
90% CI (.314, .581).
(Kline, 2013a), which is zero for the lower bound of
Local fit of the model in Figure 15.3 is poor, too.
the RMSEA. None of the absolute correlation residuals
For example, absolute correlation residuals for several
exceed .10, and the largest correlation residual is .075
pairs of indicators for the achievement and adjustment
for the spelling and motivation variables—see the out-
factors exceed .10, and the corresponding standardized
put file for analysis 1 (Table 15.1). Likewise, none of the
residuals are usually significant, too. Selected examples
standardized residuals are significant at the .05 level, so
are listed next as indicator pair (achievement factor,
local fit seems generally satisfactory. Values of omega
adjustment factor), correlation residual, and standard-
reliability coefficients range from .577 for the risk fac-
ized residual—see the output file for analysis 2 in Table
tor to .851 for the cognitive ability factor. Thus, mea-
15.1 for more detailed information about the residuals:
surement of risk is less precise, but this factor has just
two indicators (i.e., information is limited). Given all
Reading, Motivation, .219, 3.466
these results, the basic four-factor CFA measurement
model analyzed at Step 1 is retained. Spelling, Motivation, .242, 3.348
Reading, Harmony, .199, 2.903
TABLE 15.3. Values of Selected Global Fit Statistics for Two-Step Modeling
of a Full Structural Regression Model of Achievement and Classroom Adjustment
among Students at Risk for Mental Disorder
RMSEA RMSEA-P
Model chiML dfM p chiD dfD p (90% CI) (90% CI) CFI SRMR
Measurement
Four-factor CFA 16.212 38 .999 — — — 0 (0, 0)a — 1.000 .023
Structural regression
Five paths 49.747 39 .116 33.535 1 < .001 .042 (0–.073) .455 (.314, .581) .983 .074
Six paths 16.212 38 .999 — — — 0 (0, 0)a — 1.000 .023
Note. The CFA measurement model and the SR model with six paths among factors are equivalent versions. CI, confidence interval.
aCI is degenerate (i.e., invalid).
Pt3Kline5E.indd 272 3/22/2023 2:56:20 PM

Based on all these results about global fit and local ified SR model are listed for analysis 2 in Table 15.1.
fit, the initial SR model in Figure 15.3 with five paths This second analysis in lavaan of an SR model con-
among the factors is rejected. verged to an admissible solution and, as expected,
The absence of a path between the achievement and values of fit statistics for the respecified SR model
adjustment factors in Figure 15.3 is clearly inconsistent are identical to those for the CFA model analyzed
with the data. The three options to add such a path and as Step 1—see Table 15.3. Reported in Table 15.4
keep the structural model recursive are listed next: are parameter estimates for the measurement part
of the respecified SR model with standard errors for
Achievement → Adjustment all results, unstandardized and standardized. All the
Adjustment → Achievement results in the table—factor loadings, error variances,
and variances and covariance for the exogenous fac-
DAch DAdj
tors (Cognitive, Risk)—equal their counterparts in the
four-factor CFA model analyzed at Step 1 of two-step
In other words, add a direct effect between the factors
modeling (see the output file for analysis 1, Table 15.1).
(two options) or allow their disturbances to covary. In
This is expected because the structural part of the
my experience, it would be hard to justify either direct
revised SR model with 6 paths has zero degrees of
effect over the other: Poor scholastic skills could pos-
freedom. The estimated correlation between the exog-
sibly worsen classroom adjustment just as behavioral
enous cognitive ability and risk factors, –.459, makes
problems at school could negatively affect achievement.
sense: It is negative (higher risk, lower ability), and its
Specification of reciprocal causation, or
absolute value supports the hypothesis of discriminant
validity (i.e., it is not virtually 1.0).
Achievement  Adjustment
Listed in Table 15.5 are estimates for the struc-
tural part of the respecified SR model. Given a 1-point
would make the structural model nonrecursive, but
increase in the cognitive factor in its metric (i.e., the
the model would not be identified without imposing
common variance of its verbal reasoning indicator)
constraints I believe would be unrealistic. We discuss
while controlling for the risk factor, the score on the
nonrecursive structural models in Chapter 19, but such
achievement factor is expected to increase by .719 points
models have very particular identification require-
in its metric (i.e., the common variance of its reading
ments.
skill indicator). In the standardized solution, an increase
In this pedagogical example, next we respecify
of a full standard deviation in cognitive ability predicts
Figure 15.3 by allowing the disturbances between the
an increase in achievement of .657 standard deviations,
achievement and adjustment factors to covary. This
while controlling for risk. The absolute effect of risk on
respecified SR model is an equivalent version of the
achievement is relatively smaller: A 1-point increase
CFA model analyzed at Step 1: Both models not only
in the unstandardized metric of risk (i.e., the common
have the maximum possible number of paths among
variance of its parental disorder indicator) predicts a
the factors (6), but also the values of their fit statistics,
.175-point decrease in achievement, and an increase of
residuals, and predicted covariances or correlations are
a full standard deviation in risk predicts lower achieve-
all identical within slight rounding error. They differ
ment by .100 standard deviations, all while controlling
only in their assumptions about causal effects among
for cognitive ability. Exercise 3 asks you to interpret the
factors, either none (CFA model) or in the patterns rep-
standardized path coefficients for effects on classroom
resented in the structural part Figure 15.3 plus a distur-
adjustment, and Exercise 4 asks you to compute R2 for
bance correlation (SR model).5
each endogenous factor in Table 15.5. The disturbance
The input and syntax files for analysis of the respec-
correlation of .643 is the estimated partial correlation
between achievement and adjustment after controlling
5I believe it is plausible in this example that general cognitive for cognitive ability and risk. Thus, at least one common
ability and family-related risk affect both school achievement unmeasured cause that is independent of both exoge-
and classroom adjustment: Values of IQ scores are reasonably nous factors affects achievement and adjustment in the
stable among older children and adolescents, and family environ- same direction; that is, an increase in omitted causes
ment predates school enrollment. lead to increases in both factors.
Pt3Kline5E.indd 273 3/22/2023 2:56:20 PM

TABLE 15.4. Maximum Likelihood Estimates PARTIAL SR MODELS

for Indicators and Exogenous Factors in a Full WITH SINGLE INDICATORS
Structural Regression Model of Achievement
and Classroom Adjustment among Students There is an alternative to representing a single indica-
at Risk for Mental Disorder
tor in the structural part of a partial SR model as one
Unstandardized Standardized would in manifest variable path model (Figure 15.1(a)).
Parameter Estimate SE Estimate SE It requires an a priori estimate of the proportion of vari-
ance in an observed variable that is due to measurement
Factor loadings error (.10, .20, etc.). This estimate may be based on the
Cognitive researcher’s experience with a measure or on results of
Verbal 1.000 — .868 .032 prior empirical studies, including estimation of score
Visual 1.000 .090 .806 .037 reliability in the researcher’s sample. Recall that one
Memory .788 .077 .747 .043
minus a reliability coefficient, 1 – rXX, estimates the
proportion of total variance due to the type(s) of mea-
Risk surement error estimated by the method used to gener-
Parent Disorder 1.000 — .620 .100 ate the coefficient. Because a particular coefficient may
Low SES .773 .224 .677 .104 estimate only one kind of error, the quantity 1 – rXX
may underestimate the full extent of measurement
Achievement error—see the Psychometrics Primer on this book’s
Reading 1.000 — .876 .031 website. Another alternative is to conduct a sensitivity
Arithmetic .925 .083 .791 .038 analysis where the results are estimated for a range of
Spelling .678 .080 .639 .054 1 – rXX values: If the results in model testing are robust
over alternative-but-plausible estimates of score preci-
Adjustment sion, then greater confidence in such results may be
Motivation 1.000 — .761 .049 warranted.
Harmony .861 .136 .561 .065 Suppose that X1 is the only measure for an exog-
Stability .940 .114 .781 .048 enous factor A and that Y2 is the single indicator for
endogenous factor B. There are multiple indicators
Error variances (3) for a second endogenous variable, factor C. Given
rXX = .80 and rYY = .75 for, respectively, indicators X1
Verbal 46.254 9.782 .246 .056
and Y1, we can say that (1) at least 1 – .80 = .20, or 20%
Visual 76.171 12.153 .350 .060 of the total variance in X1 and (2) at least 1 – .75 = .25,
Memory 69.732 9.717 .442 .064 or 25% of the total variance in Y1 are due to random
Parent disorder 87.999 18.345 .615 .124 measurement error. Now we can specify a partial SR
Low SES 38.899 10.207 .542 .141 model like the one in Figure 15.4. Note in the figure
Reading 51.273 11.238 .232 .054 that X1 is specified as a single indicator of factor A and
Arithmetic 86.341 13.049 .374 .060 has an error term. The unstandardized error variance
is specified as a fixed parameter that equals (1 – rXX)
Spelling 112.797 14.078 .592 .068
times the observed variance, or .20 s X2 1, given rXX = .80.
Motivation 37.737 6.463 .421 .074 For example, if the observed variance of X1 is 30.00,
Harmony 83.953 10.590 .686 .073 then 20% of this value, or .20 (30.00) = 6.00, is speci-
Stability 29.306 5.386 .390 .075 fied as the error variance. Because factor A must be
scaled, the unstandardized loading for X1 is fixed to
Exogenous factor variances and covariance equal 1.0. With the specification of an error term for
Cognitive 141.612 22.097 1.000 — X1, the variance of factor A, its direct effect on fac-
Risk 55.090 19.991 1.000 — tor B, and the disturbance variance for factor B are
all estimated controlling for measurement error in the
Cognitive Risk –40.520 12.448 –.459 .098
single indicator.
Note. Standardized estimates for disturbances are proportions of un- The logic for specifying Y1 as a single indicator with
explained variance. an error term for endogenous factor B in Figure 15.4
Pt3Kline5E.indd 274 3/22/2023 2:56:20 PM

TABLE 15.5. Maximum Likelihood Estimates for the Structural Part

in a Full Structural Regression Model of Achievement and Classroom
Adjustment among Students at Risk for Mental Disorder
Direct effects
Cognitive → Achievement .719 .109 .657 .079
Risk → Achievement –.175 .190 –.100 .107
Cognitive → Adjustment .261 .070 .431 .103
Risk → Adjustment –.146 .127 –.151 .127
Disturbance variances and covariance

Achievement 84.318 16.325 .498 .077
Adjustment 38.008 8.188 .732 .080
DAch DAdj 36.374 7.967 .643 .085
Note. Standardized estimates for disturbance variances are proportions of unexplained variance.
is similar: Briefly, given rYY = .75 for this indicator, its 2. When a manifest outcome variable is measured
error variance is fixed to equal .25 times the observed with error but it has only a disturbance, such as Y1 in
variance of this indicator, or .25sY21. Because Y1 has an the path model of Figure 15.1(a), absolute standard-
error term, the direct effects of factors A and B and the ized path coefficients tend to be too small. This hap-
disturbance variance for factor C (which has multiple pens because (a) measurement error shows up in the
indicators) are all estimated controlling for measure- disturbance, which (b) increases the overall proportion
ment error. Four points should be mentioned about this of unexplained variance. When this error is controlled,
such as for Y1 in Figure 15.4, absolute standardized path
method for single-indicator respecification:
coefficients tend to increase because the proportion of
unexplained variance is lower in the respecified model.
1. It does not affect the complexity of the model (i.e., But unstandardized path coefficients are unchanged by
df M is not changed) or the global model fit. the specification of separate measurement error and
(1 − rXX) s 2 (1 – rYY) s 2
X1 Y1
X1 Y1 Y4 Y5 Y6
1 1 1
A B C
FIGURE 15.4. Example of a partial SR model with single indicators—respectively, X1 and Y1—of an exogenous concept and
an endogenous concept approximated by, respectively, factors A and B. Shown in compact graphical symbolism for measure‑
ment errors and disturbances. rXX, rYY are, respectively, reliability coefficients for X1 and Y1.
Pt3Kline5E.indd 275 3/22/2023 2:56:20 PM

disturbance terms for a single indicator. An exercise in estimated error variances or covariances can become
the next section demonstrates the pattern of expected ad hoc parameters or “fudge factors” that absorb differ-
results just described. ent types of specification errors.
If all single indicators in the structural part of an SR
3. A common question is, why not just specify the
model are concept measures (i.e., there are no covari-
error variance for a single indicator as a free param-
ates), those variables can be included in a measurement
eter and let the computer estimate it? Such a specifi-
model that is evaluated at Step 1 in two-step modeling.
cation may result in an identification problem. That is,
The model would not be a conventional CFA model
whether the measurement error variance for a single with multiple indicators per factor, but it is still a mea-
indicator is identified depends on other features of the surement model, especially if each single indicator is
model, such as whether the structural model is recur- assigned an error term in as Figure 15.4. For example,
sive or nonrecursive (Bollen, 1989, pp. 172–175). It is Brydges et al. (2017) administered a self-report mea-
generally safer to fix the error variance to a constant sure of attention deficit hyperactivity disorder (ADHD)
based on an a priori estimate—see Hayduk and Littvay symptoms to 142 adults ages 18–40 years. Participants
(2012) for examples. were also given tests of fluid intelligence, processing
4. A manifest variable path model such as Figure speed, and working memory. Speed was measured with
15.1(a) can be respecified to control for measurement three reaction time (RT) tasks, simple, two-choice, and
error in every single indicator. This tactic is akin to fit- four-choice RT. The choice tasks required participants
ting a path model to a data matrix based on correlations to indicate where on the computer screen a target stimu-
disattenuated for unreliability (see the Psychometrics lus appeared. Working memory tasks included memory
Primer on this book’s website). span (recall of pictures shown on a computer screen in
sequence), backward digit span (recall of spoken digits
5. The method described to this point assumes inde- in reverse order), and a computerized Corsi blocks test
pendent measurement errors. Possible consequences of (recall of the order in which nine identical, spatially
neglecting to also control for correlated measurement separated blocks were illuminated). Their measurement
error and a suggested remedy are discussed in Topic model included two factors (speed, memory) each with
Box 15.1. three indicators and other observed variables treated as
single indicators (Brydges et al., 2017, p. 370). Results
Single indicators in partial SR models are not always from analyses of the measurement model just described
measures of substantive concepts. Covariates included informed the specification of partial SR models tested
in the model to control bias can also be viewed as sin- by these authors at a second analysis step.
gle indicators: They are observed variables on which Brown (2015, p. 124) described the analysis of mea-
common factors are regressed and, thus, covariates are surement models where covariates are included as
part of the structural component of SR models. Hayduk single indicators, including a model with two common
and Littvay (2012) recommended the single-indicator factors each with three indicators and two covariates,
respecification that assigns error terms to demographic age and general health. An advantage of including
variables because such variables are sometimes mea- covariates in a measurement model at Step 1 of two-
sured with error, such as when a participant by acci- step modeling is that it can help to avoid specification
dent or on purpose reports the wrong age. Specifying a error when covariates are added to the model only at
small, nonzero error variance, such as .05, or 5% of the Step 2. For example, a covariate may have poor dis-
total variance, is safer than assuming that demographic criminant validity relative to certain factors (i.e., they
or other control variables are perfectly measured. The are not distinct). Another possibility occurs when val-
same authors also noted that among multiple indicators ues of absolute correlations between covariates and fac-
for the same factor, there may be a single best indicator tors are much lower than expected, which may presage
with the greatest relevance to theory. If so, then fixing low predictive power of the covariate in a structural
the error variance of the best indicator to a constant model (it is an ineffective control variable) (Brown,
using the method just described may improve fac- 2015). Other suggestions for controlling measurement
tor measurement compared with freely estimating the through single-indicator respecification in small sam-
error variances of all indicators. This is because freely ples are offered in Chapter 17.
Pt3Kline5E.indd 276 3/22/2023 2:56:20 PM

TOPIC BOX 15.1
Controlling for Correlated Measurement Error

The respecification that controls for measurement error in single indicators depicted in Figure 15.4 assumes
independent error terms. Next, we consider an example based on Williams et al. (2013) about conse‑
quences of failing to also consider correlated measurement error. Suppose that rXX = rYY = .64 for variables
X and Y in Figure 15.5(a) that assumes independent error terms. Variable X is the single indicator for fac‑
tor A, and variable Y is the single indicator for factor B. The factor correlation is .30. For standardized
variables, the standardized loadings for both single indicators equal the square root of their reliability
coefficients, or .641/2 = .80. Assuming Figure 15.5(a) is the population model, we can use the tracing rule
to generate the expected observed correlation between the two indicators, or
rXY = .30 (.802) = .192
Thus, measurement error attenuates rXY compared with the factor correlation of .30 when the errors are
independent. The true correlation (i.e., r = .30) can be recovered by applying the classical disattenuation
for measurement error to the observed correlation (see the Psychometrics Primer on this book’s website), or
.192
=rˆXY = .30
.64 × .64
(continued)
a) rXY = .192 < .30 b) rXY = .372 > .30
.50
.60 .60
X Y X Y
.80 .80 .80 .80
A B A B
.30 .30
FIGURE 15.5. Predicted correlations between X and Y for models with independent error terms (a) and with corre‑
lated error terms (b). All variables are standardized, and rXX = rYY = .64. Standardized factor loadings are the square
root of the reliability coefficient, or .641/2 = .80, and the standardized residual path coefficients for the error terms are
the square root of the complement of the reliability coefficients, or (1 – .64)1/2 = .60.
Pt3Kline5E.indd 277 3/22/2023 2:56:20 PM

The alternative scenario of dependent measurement errors is depicted in Figure 15.5(b), in which the
error correlation is .50, not zero. The coefficient next to the symbols for the indicator error terms in the
figure, or .60, is the standardized residual path coefficient, which is the correlation between the error terms
and the indicators. It equals for standardized variables the square root of the complements of reliability
coefficients, or (1 – .64)1/2 = .60. These residual path coefficients are irrelevant in Figure 15.5(a) with inde‑
pendent error terms, but in Figure 15.5(b) these path coefficients belong to an additional tracing between
variables X and Y; specifically, their correlation implied by the model in Figure 15.5(b) is
rXY = .30 (.802) + .50 (.602) = .372
Now rXY is a positively biased estimator of the factor correlation because rXY > r = .30. This bias is made
even worse if the researcher were to apply the classical disattenuation for measurement error that assumes
independent errors, or
.372
=rˆXY = .58
.64 × .64
which even further overestimates r = .30 compared with rXY.
As noted by Williams et al. (2013), measurement error can bias sample estimates, but not necessar‑
ily in a downward direction (underestimation). This is especially true when measurement error terms are
not independent such as in Figure 15.5(b). In this case, blind application of corrections for disattenuation
that assume independent errors can actually increase bias instead of reducing it. For single indicators like
variables X and Y in Figure 15.4, it is possible to add a parameter for an error covariation, but its value
should generally be specified as a fixed nonzero parameter; otherwise, the model may not be identified.
Also, parameters are typically fixed to nonzero constants in the unstandardized solution, not in the stan‑
dardized solution as in Figure 15.5. Suppose that rXX = .75, s X2 = 15.00, rYY = .80, and sY2 = 25.00, and the
hypothesized correlation between the measurement error terms of X and Y is .30. Given these values, the
error covariance for this pair of indicators could be specified as fixed to equal
(1− .75) 15.00 × (1− .80) 25.00 × .30 =

1.299
which is the product of the two error standard deviations and the expected error correlation (.30).
EXAMPLE FOR A PARTIAL SR MODEL rest, or 132 patients, were classified as exhibiting recur-
rent multiple episodes (MEP). Data from both groups
The partial SR model analyzed next was introduced in were analyzed together for this example. The patients
Chapter 3. For convenience, the diagram for the final were administered a total of 12 tests from the computer-
version of this model is presented here as Figure 15.6. administered CogState Schizophrenia Battery (CSB)
Before fitting the model in the figure to the data sum- (Pietrzak et al., 2009). Patient unawareness and misat-
marized in Table 15.6, I’ll briefly recap how we (i.e., tribution of four symptoms—hallucinations, delusions,
Sauvé et al., 2019) arrived at the final version: The flat affect, and asociality—were rated in interviews
sample consisted of 193 patients with nonaffective psy- based on the Scale to Assess Unawareness of Mental
chotic disorders who attended a treatment program at Disorder (SUMD) (Amador et al., 1993). Scores in
a large health institute in Montréal. A total of 61 were each area are summed, and higher total scores indicate
classified as first episode of psychosis (FEP) and the greater unawareness (i.e., less awareness of symptoms).
Pt3Kline5E.indd 278 3/22/2023 2:56:21 PM

ISL .360
ISLR SUMD
1
1
GML
Cognitive Symptom
Unaware
GMR
OCL
CPAL
FIGURE 15.6. Final latent variable model of cognitive capacity and symptom unawareness. Values of identifying constraints
are shown, including scaling constants for factors or error terms (1) and the error variance for the single indicator of symptom
unawareness (.360). ISL, International Shopping List; ISLR, International Shopping List Immediate Recall; GML, Groton Maze
Learning task; GMR: Groton Maze Learning task Delayed Recall; OCL, One-Card Learning task; CPAL, Continuous Paired
Associate Learning task; SUMD, Scale to Assess Unawareness of Mental Disorder. Shown in compact graphical symbolism for
error terms and disturbances. From “Cognitive Capacity Similarly Predicts Insight into Symptoms in First- and Multiple-Episode
Psychosis.” by G. Sauvé et al., 2019, Schizophrenia Research, 206, p. 239. Copyright © 2019 Elsevier B.V. Adapted with
permission.
TABLE 15.6. Input Data (Correlations, Standard Deviations) for Analysis of a Partial
Structural Regression Model of Cognitive Level and Symptom Unawareness
Indicator 1 2 3 4 5 6 7
Cognitive level
1. ISL —
2. ISLR .753 —
3. GML .329 .334 —
4. GMR .316 .307 .672 —
5. OCL .398 .347 .411 .451 —
6. CPAL .430 .439 .526 .532 .470 —
Symptom awareness
7. SUMD –.144 –.155 –.215 –.108 –.146 –.143 —
M 22.166 7.083 6.073 4.370 .951 1.241 2.933
SD 4.713 2.670 1.664 1.009 .116 .121 1.225
Note. Input data are from Sauvé et al. (2019), N = 193. ISL, International Shopping List; ISLR, International Shopping List
Delayed Recall; GML, Groton Maze Learning task; GMR, Groton Maze Learning task Delayed Recall; OCL, One-Card
Learning task; CPAL, Continuous Paired Associate Learning task; SUMD, Scale to Assess Unawareness of Mental Disorder.
Means are reported but not analyzed.
Pt3Kline5E.indd 279 3/22/2023 2:56:21 PM

The analysis was conducted over two steps, measure- factor CFA model with the six indicators represented in
ment and structural. Because we did not expect all 12 Figure 15.6 met the criterion for minimum factor load-
CSB tests to measure general ability in the same way, we ings, but the fit to the data was relatively poor until the
analyzed at Step 1 a series of single-factor CFA models two error covariances depicted in the figure were added
for these tasks. After retaining an adequate measure- to make up the final measurement model for CSB tests.
ment model for CSB tests, in the Step 2 analysis we Values of selected fit statistics for the final single-factor
specified an SR model where the SUMD is represented CFA model for just the CSB tests are reported next.
as the single indicator with an error term for a symptom The estimation method is default ML, and the analysis
unawareness factor, which is regressed on the cognitive converged to an admissible solution:
ability factor. The SR model just described (i.e., Figure
15.6) is equivalent to a CFA model where the two factors, chiML(7) = 5.220, p = .633
cognitive ability and symptom unawareness, simply RMSEA = 0, 90% CI [0, .073]
covary. Because general cognitive ability is relatively CFI = 1.000, SRMR = .023
stable among adults and predates illness insight, we
argued that the SR model analyzed at Step 2 is more spe- The final single-factor measurement model passed the
cific and informative than its equivalent CFA version. exact-fit test, values of approximate fit indexes are not
Before inspecting the data, we developed two guide- problematic, all absolute correlation residuals are < .10,
lines for respecifying a single-factor CFA model for and no standardized residuals are statistically signifi-
CSB tests at the first analysis step (Sauvé et al., 2019): cant—see the output file for analysis 3 in Table 15.1.
Given an adequate measurement model of general
1. An indicator should share at least 30% of its total cognitive ability for CSB tests, the SR model in Fig-
variance with the common factor, which corre- ure 15.6 was analyzed at Step 2. Total scores on the
sponds to a standardized factor loading of at least symptom unawareness measure (SUMD) were speci-
.55 (i.e., .552 = .30). This minimum proportion of fied as the single indicator of an underlying symptom
common variance seemed consistent with some unawareness factor. The unstandardized error variance
prior factor analytic results with CSB tests (Chou et was fixed to equal the product of the complement for
al., 2015); specifically, a more demanding criterion, the median SUMD reliability coefficient reported by
such as .50 for a minimum proportion of common Raffard et al. (2010), or (1 – .76) = .24, and the sample
variance, would be unrealistic. variance, or 1.2252 = 1.501 (Table 15.6). The value of
2. After eliminating CSB tests from the model that the product just described, or .360, is represented in
did not meet the first guideline just described (i.e., Figure 15.6 as the unstandardized error variance for the
≥ 30% of variance is common), error covariances SUMD, which is specified as a fixed parameter. Fixing
could be added to the model, but only for pairs of the unstandardized loading of the SUMD single indi-
scores from the same test (e.g., ISL and ISLR in cator on its factor to 1.0 completes the single-indicator
Figure 15.6). respecification that controls for unreliability.
Exercise 5 asks you to verify that df M = 12 for Fig-
Listed in Table 15.1 as analysis 3 are the syntax and ure 15.6, and Exercise 6 asks you to estimate the power
output files for analyses of single-factor CFA measure- of the chi-square test for this analysis where N = 193
ment models in lavaan. The initial model included in the MacCallum–RMSEA method. (Hint: It is low.)
all 12 tests of the CSB, but not all attained a stan- Listed for analysis 4 in Table 15.1 are the input and
dardized loading of at least .55. A total of three tests output files for fitting the partial SR model in Figure
were removed from the initial model, and the respeci- 15.6 to the summary data in Table 15.6. The analysis
fied model was fitted to the same data. This process in lavaan with default ML estimation converged to
was iterative: If any indicator in a respecified model an admissible solution, and values of selected global fit
attained a standardized loading < .55, then that indica- statistics listed next do not suggest an obvious problem
tor was removed from the model, and so on. A single- with global fit:
Pt3Kline5E.indd 280 3/22/2023 2:56:21 PM

chiML(12) = 9.885, p = .626 fest variable path model, such as Y4 in Figure 15.1(a),
RMSEA = 0, 90% CI [0, .062] and then compare the results with those in Table 15.7
for the symptom unawareness outcome specified as an
CFI = 1.000, SRMR = .027
observed variable, not as a common factor with a single
indicator.
The value of the largest absolute correlation residual for
the partial SR model is .069 and no standardized resid-
uals are significant—see the output file for this analysis
(Table 15.1). Given all the results just summarized, the
SUMMARY
model in Figure 15.6 is retained.
Parameter estimates for the partial SR model of cog- Structural regression (SR) models in SEM have explicit
nitive level and symptom unawareness are reported measurement and structural components. The measure-
in Table 15.7. The cognitive ability factor explains, on ment part of the model associates common factors and
average, about 41% of the variance in its six indicators indicators, just as in CFA models. But SR models also
(AVE = .414). As expected, (1) the standardized load- have a structural part that represents hypotheses about
ing for the SUMD single indicator, .871, equals within direct or indirect causal effects. If every variable in the
rounding error the square root of the reliability coef- structural model is a common factor, then (1) the SR
ficient specified in the analysis for this measure, or model is fully latent and (2) causal hypotheses con-
.76. Also, (2) the complement of the squared standard- cern common factors that approximate latent variables;
ized factor loading for the SUMD single indicator, or otherwise, the structural model has at least one single
1 – .8712 is equal within rounding error to the comple- indicator for a theoretical concept and the whole model
ment of the reliability coefficient specified for this mea- is partially latent. It is possible to specify that the error
sure, or 1 – .76 = .24. The standardized disturbance variance of a single indicator is fixed to a nonzero con-
variance for the symptom unawareness factor is .936. stant provided by the researcher. The constant is usu-
This means that the cognitive ability factor explains a ally the product of the observed unstandardized vari-
total of 1 – .936 = .064, or 6.4% of the total variance in ance for the indicator and the proportion of this vari-
the symptom unawareness factor. The unstandardized ance that is due to measurement error. This proportion
regression coefficient is –.105; thus, for every 1-point can be specified as the complement of a score reliabil-
increase on the cognitive factor, the symptom unaware- ity coefficient for indicator, preferably estimated in the
ness factor is expected to decrease by .105 points (i.e., researcher’s own sample. This specification forces the
symptom understanding increases by this amount). In computer to control for measurement error in a single
the standardized solution, an increase in cognitive level indicator when estimating parameters for the structural
of a full standard deviation predicts a decrease in symp- model. In order for an SR model to be identified, both
tom unawareness of about a quarter of a standard devia- its measurement and structural components must be
tion (.254). identified. Two-step modeling in the analysis helps to
These results just summarized suggest that although isolate potential sources of poor model fit into its two
patients suffering from psychotic disorder who have components, measurement versus structural. Specifi-
greater cognitive capacity as measured with computer- cally, an acceptable measurement model is required
administered tests have better comprehension of their before going to the second step, which involves testing
symptoms, the magnitude of this relation is not extraor- hypotheses about the structural model. The evaluation
dinarily large (R2 = .064). Results of other analyses by of SR models represents the apex in SEM family for
Sauvé et al. (2019) based on the results in Table 15.7 analyses that assume reflective measurement. The next
indicate that the effect of cognitive level on symptom chapter deals with composite models as analyzed in
awareness did not vary appreciably by patient status techniques for composite SEM that generally assume
(first episode vs. multiple episodes). Exercise 7 asks formative measurement. Depending on the research
you to respecify the model in Figure 15.6 by repre- context, a composite model may be a viable—or even
senting the SUMD variable as one would in a mani- preferred—alternative to an SR model.
Pt3Kline5E.indd 281 3/22/2023 2:56:21 PM

TABLE 15.7. Maximum Likelihood Estimates for a Partial

Structural Regression Model of Cognitive Level and
Symptom Unawareness
Factor loadings
Cognitive
ISL 1.000 — .546 .061
ISRL .555 .059 .535 .062
GLM .428 .070 .663 .055
GMR .261 .042 .672 .054
OCL .029 .005 .634 .069
CPAL .037 .006 .778 .067
Unawareness
SUMD 1.000 — .871 .014
Error variances and covariances

ISL 15.498 1.784 .701 .067
ISLR 5.059 .587 .713 .066
ISL ISLR 5.764 .873 .651 .045
GML 1.543 .206 .560 .072
GMR .562 .076 .555 .072
GML GMR .383 .102 .412 .075
OCL .008 .001 .598 .087
CPAL .006 .001 .395 .104
SUMD .360 — .241 .025
Factor variance
Cognitive 6.599 1.806 1.000 —
Direct effect and disturbance variance

Cognitive → Unaware –.105 .040 –.254 .088
Unaware 1.060 .146 .936 .045
Note. Standardized estimates for disturbance variances and error variances are pro-
portions of unexplained variance ISL, International Shopping List; ISLR, Interna-
tional Shopping List Delayed Recall; GML, Groton Maze Learning task; GMR,
Groton Maze Learning task Delayed Recall; OCL, One-Card Learning task; CPAL,
Continuous Paired Associate Learning task; SUMD, Scale to Assess Unawareness
of Mental Disorder.
Pt3Kline5E.indd 282 3/22/2023 2:56:21 PM

LEARN MORE Anderson, J. C., & Gerbing, D. W. (1988). Structural equa‑

tion modeling in practice: A review and recommended
The classic work by Anderson and Gerbing (1988) outlines two-step approach. Psychological Bulletin, 103(3)
the rationale of two-step modeling, Bentler (2000) describes 411–423.
types of models where neither two-step nor four-step model‑
Bentler, P. M. (2000). Rites, wrongs, and gold in model test‑
ing would be ideal, and Ropovik (2015) outlines the potential
ing. Structural Equation Modeling, 7(1), 82–91.
hazards of ignoring the chi-square test followed by model
retention based solely on values of approximate fit indexes Ropovik, I. (2015). A cautionary note on testing latent vari‑
when testing models with common factors. able models. Frontiers in Psychology, 6, Article 1715.
EXERCISES
1. Prove that df M = 25 for Figure 15.1(b). 6. Use the MacCallum–RMSEA method to estimate
the power of the chi-square test for Figure 15.6 for
2. Show that df M = 39 for the full SR model in Figure a = .05, e0 = 0, and e1 = .05.
15.3.
7. Respecify Figure 15.6 by representing SUMD as a
3. Interpret the standardized path coefficients in Table single indicator in same way as variable Y4 in the
15.5 for effects on classroom adjustment. manifest variable path model in Figure 15.1(a) (i.e.,
with just a disturbance but no separate measure-
ment error term). Fit the respecified model as just
4. Calculate R2 for each of the two endogenous factors
described to the data in Table 15.6. Describe any
in Table 15.5.
differences in the results compared with those in
Table 15.7 for Figure 15.6.
5. Show that df M = 12 for Figure 15.6.
Pt3Kline5E.indd 283 3/22/2023 2:56:25 PM

16
Composite Models
Composite-based methods for SEM, referred to here as composite SEM, approximate formative concepts
without disturbances, or emergent variables. The measurement models analyzed are composite–formative
with composite indicators (M → C block; e.g., Figure 13.2(c)). Composites constructed from error-prone
indicators in such models contain measurement error, too. Such models are practically always identified in
composite SEM because of how they are specified. Specifically, such models are parameterized in terms of
the correlations among indicators, weights for sets of indicators that form a composite, and path coefficients
for direct effects between composites. The researcher has relatively little freedom to specify models within
this framework, so analysis options are more or less limited to identified models (Henseler, 2021). The same
framework generally precludes approximation in composite SEM of formative concepts with error terms, or
composite latent constructs, based on causal–formative models with causal indicators (M → L block; e.g.,
Figure 13.2(b)), where measurement error is represented at the construct level. Such models are not generally
identified, given their parameterization as just described.
More traditional SEM techniques, such as CFA, approximate latent variables with common factors
analyzed in reflective models with effect indicators (L → M block; e.g., Figure 13.2(a)), where measurement
error is represented at the level of the indicators. Assumptions for effect indicators in reflective models are not
generally appropriate for composite or causal indicators in formative models, and vice versa (Chapter 13).
It is theoretically possible to analyze causal–formative models for composite latent constructs in traditional
SEM, but such models have quite strict requirements for identification that are described later in this chapter.
Both kinds of models just mentioned are parameterized in terms of the variances and covariances of the
exogenous variables and path coefficients and factor loadings for endogenous variables when means are
not analyzed (e.g., Rule 7.1). Software for traditional SEM, widely available since the mid-1970s (Chapter
5), has extensive capabilities for evaluating both local and global model fit in the framework just described
(e.g., Chapters 14–15).
Composite–formative models for emergent variables loadings in standard SEM computer software can be
can also be analyzed in computer tools for traditional tricky (e.g., Kline, 2013c). Special software for ana-
SEM, but until recently doing so has been relatively lyzing composite structural equation models—also
difficult. This is because “translating” the parameter- described later in this chapter—has been available
ization of composite models in terms of correlations, since the mid-1980s, but it generally had limited capa-
weights, and path coefficients into the parameterization bilities for global fit testing, which narrowed the role of
based on variances, covariances, path coefficients, and composite SEM to more exploratory than confirmatory
I thank Jörg Henseler for his astute comments on a draft this chapter. Any remaining shortcomings of this presentation are entirely mine.
284
Pt3Kline5E.indd 284 3/22/2023 2:56:25 PM

Composite Models 285
studies. But recent technical developments and concep- fit and the evaluation of local model fit through inspec-
tual refinements summarized next are quickly chang- tion of the residuals:
ing the status quo:
1. The combination of the partial least squares path
1. Newly developed methods for global fit testing of modeling (PLS-PM) algorithm (Wold, 1982) for
composite models are available in both commer- estimating parameters of composite models that
cial and free software for composite SEM that analyze correlation matrices and recently developed
analyze composite–formative models for emergent techniques for inferential testing of global model fit
variables. Information about local model fit, or the based on nonparametric bootstrapping (Schuberth
residuals, can also be obtained when using these et al., 2018). Estimators in the PLS-PM algorithm
computer tools. include ordinary least squares (OLS) for recursive
2. There is a now a method to specify and analyze models and two-stage least squares (2SLS) with
composite–formative models in standard SEM instruments for nonrecursive models, among other
computer tools that is (a) relatively straightforward options. These estimators and bootstrap tests and
and (b) capitalizes on the advantages of global confidence intervals do not assume normality. Spe-
estimation methods, such as maximum likelihood cial software is required for this option.
(ML), in deriving parameter estimates and evaluat- 2. Henseler–Ogasawara (HO) specification (Schu
ing both local fit and global model fit for composite berth, 2021) provides a method for researchers to
models. include composites in their models that can be ana-
lyzed with standard SEM computer tools such as
I believe the developments just summarized render Mplus, LISREL, Amos, and lavaan, among oth-
obsolete any broad characterization of techniques for ers. The HO specification also applies to models
composite SEM as a junior, kiddy, or wannabee ver- with both common factors that approximate latent
sion of traditional SEM (Chapter 1). Specifically, I variables and composites that approximate emer-
believe these two members of the SEM family will gent variables.
eventually merge as simply options for proxies (i.e.,
common factors or composites) in measurement mod- An advantage of the second method just described
els that approximate reflective or formative constructs. is that the whole range of capabilities in traditional
In the meantime, researchers who are familiar with SEM, including global fit testing, evaluation of local
both composite and more traditional SEM can test an fit, the option to analyze covariance matrices instead of
even wider range of hypotheses about causal effects correlation matrices, and modern techniques for deal-
between target concepts. Now, entire books are writing with missing data, such as full information maxi-
ten about composite-based SEM, including Hair et al. mum likelihood (FIML), is available. Another is that
(2022), Henseler (2021), Hwang and Takane, (2015), no special software is needed, which is convenient for
and Latan and Noonan (2017), among others, so it is researchers already familiar with standard SEM com-
impossible to cover the whole range in a single chap- puter programs. Schuberth (2021) described additional
ter. Instead, the goal is to provide you with a sense methods for analyzing composites, but they are not as
of both the promise and the challenges of contempo- flexible as the two modern approaches demonstrated in
rary approaches to composite SEM. Software options this chapter.
are described, and there is a detailed example where
reflective models and composite models are fitted to
the same data. DISAMBIGUATION OF TERMS
Newcomers to composite SEM encounter what can

MODERN COMPOSITE ANALYSIS seem like a bewildering variety of overlapping names
IN SEM and terms for estimation algorithms or techniques.
Thus, the aim of the discussion that follows is to help
Summarized next and elaborated afterward are two readers navigate some of the vocabulary in this area.
contemporary approaches to analyzing composite mod- The PLS-PM algorithm is an iterative computational
els in SEM that support formal testing of global model method for weighting composites and estimating asso-
Pt3Kline5E.indd 285 3/22/2023 2:56:25 PM

ciations between them with regression-based methods estimation of measurement error correlations within
that analyze a single equation at a time. It was devel- blocks of indicators in the PLSc method. Next, consis-
oped by Wold (1982) and later expanded by Lohmöller tent estimates for path coefficients are generated from
(1989), among others. The PLS-PM algorithm is the construct correlations disattenuated for measurement
main calculational engine for the technique referred to error. Results of computer simulations by Dijkstra and
by the same name, or just PLS-PM, which is based on Henseler (2015a, 2015b) indicated that bias in PLSc
Wold’s (1982) ideas about “soft modeling” as a more parameter estimates was generally similar to that of the
flexible alternative to “hard modeling” (i.e., traditional ML estimator in common factor models with a slight
SEM), one that at the time emphasized predictive rela- advantage in PLSc results for nonnormal data.
tions estimated with composites over causal relations Three additional extensions of the PLM-PM algo-
estimated with common factors (Falk & Miller, 1992). rithm are briefly described next: Schamberger et al.
Early computer programs for the PLS-PM technique, (2020) described robust PLS, in which the minimum
such as Lohmöller’s (1984) LVPLS (Latent Variable Path covariance determinant (MCD) method to estimate con-
Analysis with Partial Least Squares), added relatively struct correlations while minimizing distorting effects
few capabilities beyond the basic PLS-PM algorithm. of outliers is used. The method iteratively selects repre-
For example, there was no inferential test of global model sentative subsamples unaffected by outliers to estimate
fit in the original algorithm or a method to take explicit population variances and covariances. The estimator is
account of measurement error in composites. Lack of robust up to the point where just under of the 50% of
control for measurement error in the original PLS-PM the data consists of outliers. In contrast, Pearson cor-
algorithm meant that (1) absolute estimates of correlations relations can be greatly distorted by a single outlier
among constructs are generally too small, but (2) abso- even in large samples. Robust construct correlations
lute estimates of path coefficients for causal effects are then analyzed in the PLS-PM algorithm to gener-
between them can be too small or too large (Dijkstra & ate estimates of path coefficients. Their robust PLSc
Henseler, 2015b). It might seem that the original PLS-PM method is both robust and consistent when approximat-
algorithm is completely vulnerable to the biasing effects ing reflective constructs with composites instead of
of measurement error when composites are analyzed, but common factors. The method of confirmatory tetrad
this is not exactly true—see Topic Box 16.1. analysis (CTA) is incorporated in a modified PLS-PM
But today there are so many extensions of the origi- algorithm by Gudergan et al. (2008) called CTA-PLS.
nal PLS-PM algorithm that the phrase “partial least It relies on a bootstrap vanishing tetrad test (VTT) to
square path modeling” has relatively little meaning distinguish a set of effect indicators for a reflective con-
without elaboration. For example, Dijkstra and Hensel- struct from a set of composite indicators for a formative
er’s (2015a, 2015b) method of consistent PLS (PLSc) is construct; specifically, failing the VTT does not sup-
for estimating population reflective models when latent port the hypothesis of reflective measurement. But any
variables are approximated with composites, not com- respecification of the measurement model should be
mon factors as in CFA. The method disattenuates cor- consistent with theory, not just analysis results (Chapter
relations between composites estimated in the PLS-PM 13). Henseler (2021, chap. 5) described other extensions
algorithm for unreliability in observed variables. Some of the PLS-PM algorithm.
computer programs that implement PLSc, described We must also disambiguate the term confirma-
in the next section, permit the researcher to manually tory composite analysis (CCA), because it is used to
specify the reliabilities of composite scores, if such describe two very different approaches for analyzing
coefficients are already known. Otherwise, composite composite models. The original method of CCA was
reliability can be estimated in the data by the coeffi- developed by Henseler et al. (2014) and Schuberth et
cient rA, or rho-A, which is a consistent estimator when al. (2018) as a composite-based analogue to CFA. Both
the population model is reflective but the proxies are CCA and CFA share the same basic steps—specifica-
composites.1 Rademaker et al. (2019) described the tion, identification, analysis, respecification (if neces-
sary), and reporting—and both feature inferential tests
1 The subscript (“A”) in the symbol rA designates Mode A for the of global model fit. The original global test in CCA is
generating composite scores in the PLS-PM algorithm. Options based on an approach by Beran and Srivastava (1985)
for weighting indicators in this algorithm are described later in for deriving confidence intervals in nonparametric
this chapter. bootstrapping from covariance matrices. The same
Pt3Kline5E.indd 286 3/22/2023 2:56:25 PM

TOPIC BOX 16.1
Composites and Measurement Error

The impact of measurement error is reduced when composites are estimated as linear combinations over
≥ 2 indicators even in methods such as the basic PLS-PM algorithm, where error variances are not directly
estimated for observed variables. Suppose that variables X1 and X2 are continuous. The variance of the
sum C = X1 + X2 can be expressed as
sC2 = s X2 + s X2 + 2cov X X (16.1)

1 2 1 2
which equals the sum of the total variance for each of X1 and X2 and twice their covariance. Each total
variance in Equation 16.1 reflects common (shared) variance and unique variance. Measurement error is
part of unique variance, but only common variance in X and Y contribute to their covariation. Thus, com‑
mon variance is counted basically four times in Equation 16.1—twice within the total variances and twice
among the covariances—but unique variance is counted only twice within the total variances for a ratio
of 4:2, or 2:1.
A general expression for the variance of C summed over n ≥ 2 elements Xi is
n n n
sC2
= ∑ s X2 i
+ 2∑ ∑ cov X X
i j
(16.2)
=i 1 =i 1 i < j
Equation 16.2 says that with each additional element in a composite beyond two, common variance
increases quadratically while unique variance increases linearly. With n = 3 indicators, for example, com‑
mon variance is counted 9 times, or 3 times within the total variances and 6 times among the covariances
(2 × 3(2)/2 = 6). But unique variance is counted just 3 times within the total variances for a ratio of 9:3,
or 3:1, for the overall contribution of common versus unique variance. Exercise 1 asks you to demonstrate
that common variance versus unique variance contribute to a composite based on summing n = 4 elements
is 4:1. The facts just stated explain how composite reliability increases geometrically with its number of
elements (Rigdon, 2012).
Equations 16.1 and 16.2 describe composites with unit-weighted (i.e., 1.0) elements. Differential
weights can change the relative contribution of common variance versus unique variance. For example,
the variance of C = w1 X1 + w 2 X2 is
sC2 =w1s X2 + w 2s X2 + 2 w1w 2cov X X (16.3)

1 2 1 2
where w1 and w 2 are weights. If w1 ≠ w 2 and the greater weight is applied to the element, X1 or X2, with
the most reliable scores, then the effect of underweighting unique variance compared with common vari‑
ance is exaggerated, which reduces the overall effect of measurement error even more. As Rigdon (2012)
noted, the creation of a composite itself partly achieves the goal of controlling for measurement error in
observed variables.
Pt3Kline5E.indd 287 3/22/2023 2:56:25 PM

method underlies the Bollen–Stine bootstrap as it is the methods described by Hair et al. (2020), but there is
more often called in the behavioral sciences (Chapter a greater emphasis on evaluating the predictive perfor-
9). Schuberth et al. (2018) developed parallel versions mance of models, such as out-of-sample prediction for
for sampling from correlation matrices, which are gen- new cases. Hubona et al. (2021) referred to the suite of
erally analyzed in composite SEM. The fit statistic is Hair et al. (2020) methods and statistics as PLS-CCA
the SRMR, which measures the approximate average to distinguish it from CCA by Henseler et al. (2014) and
absolute discrepancy between the observed and model- Schuberth et al. (2018).
implied correlations (Chapter 10). The null hypothesis
for bootstrapped significance tests based on the SRMR
and related distance measures in CCA is SPECIAL COMPUTER TOOLS
H0: P = P(qz) (16.4) Three modern computer programs that can analyze
a wide range of composite models with many of the
which predicts no difference between the population options and advances just described are described next.
correlation matrix P (uppercase Greek letter rho) and Two are commercial products, but the third option is
the predicted correlation matrix P(qz), where qz repre- free:
sents model parameters estimated from standardized
scores (i.e., normal deviates). 1. The R package cSEM (composite SEM) (Rade-
The Henseler et al. (2014) and Schuberth et al. (2018) maker & Schuberth, 2022) analyzes linear, nonlinear,
technique of CCA also includes the features summa- multiple-group, and hierarchical models with multiple
rized next: Unlike the basic PLS-PM algorithm, esti- estimators, including derivatives of the PLS-PM algo-
mation in CCA does not rely solely on OLS or 2SLS rithm and methods based on canonical correlation anal-
estimators. This is because other estimators, such as ysis, principal component analysis, and generalized
full-information methods for composites based on structured component analysis, among other options.2
generalized structured component analysis (Hwang & It can estimate error correlations within a block of
Takane, 2015) or ML estimators when composite mod- indicators using consistent methods when composites
els are represented in HO specification and analyzed approximate reflective constructs. Nonparametric tests
using standard SEM computer tools, among others, are of global model fit are available, and the technique of
available in CCA. There are specific identification rules CCA is supported (see Henseler, 2021, chap. 4). Mod-
in CCA for counting numbers of observations and free els are specified using lavaan syntax. A different R
parameters, scaling unmeasured variables, and deter- package with comparable features is seminr (Ray et
mining whether the measurement or structural parts of al., 2022).
composite models are identified (Henseler et al., 2021,
2. The commercial program ADANCO (Advanced
chap. 4). Thus, identification is just as important for
Composite Modeling) (Henseler, 2022) has a graphical
composite models analyzed in CCA as it is for reflec-
user interface.3 It runs on Windows and Apple (macOS)
tive models analyzed in CFA.
platform computers. The user specifies the model by
Hair et al. (2020) also used the term “confirmatory
drawing it on screen without the need for syntax. The
composite analysis,” but it refers to sets of criteria,
ADANCO program analyzes a wide range of compos-
analysis strategies, and rules of thumb for statistical
ite models, including hierarchical models and models
measures of model quality applied to measurement
with mediators or moderators. Multiple estimators,
or structural parts of a composite model within the
including variations of the PLS-PM algorithm and
PLS-PM framework. As discussed later in the chapter,
methods based on principal components or canonical
there are methods in the PLS-PM algorithm to weakly
correlation, among others, are available plus nonpara-
simulate reflective measurement through selection of
metric tests of global model fit. The technique of CCA
options for weighting composites. But reflective mea-
is supported (see Henseler, 2021, chap. 8). The program
surement models analyzed in techniques such as CFA
can be downloaded in a 1-year trial version for a nomi-
are quite different from their simulated counterparts
nal cost.
in the PLS-PM algorithm. In contrast, it is compos-
ite models for emergent variables that are analyzed
2 https://m-e-rademaker.github.io/cSEM/
in CCA by Henseler et al. (2014) and Schuberth et al.
(2018). There is less emphasis on global fit testing in 3 https://www.composite-modeling.com/
Pt3Kline5E.indd 288 3/22/2023 2:56:25 PM

3. A popular commercial program with a graphical MOTIVATING EXAMPLE

user interface that runs on Windows or Apple (macOS)
platform computers, SmartPLS (Ringle et al., 2022) A partial structural regression (SR) model that is a sub-
analyzes models with estimators based on derivatives set of a larger model analyzed by Shen and Takeuchi
of the PLS-PM algorithm or other methods, including (2001, p. 406) for effects of acculturation and SES on
least squares methods based on total scores.4 Media- stress and depression is introduced next. The partici-
tion, moderation, multiple-group, hierarchical, and pants were 983 native-born Chinese Americans (5.5%
nonlinear models can be analyzed. The techniques of the sample) and immigrants of Chinese descent
of PLS-CTA and PLS-CCA are available plus meth- (94.5% of the sample) living in Southern California,
ods that assess out-of-sample prediction and or detect United States (U.S.). Their mean age was 38.7 years, a
unobserved heterogeneity that could potentially limit total of 56% were men and 44% were women, and all
the generalizability of the results (Hair et al., 2022, participants were employed.
chap. 8). There are two versions: Professional, which Presented in Figure 16.1 is the model analyzed in
requires a purchase of a license, and Student, which this example. Its structural part represents the hypoth-
is free but features more limited capabilities and is eses that depression is directly affected by stress and
restricted to datasets with N ≤ 100 cases. A future ver- SES and that acculturation indirectly affects depres-
sion of SmartPLS will include capabilities for tradi- sion through stress. In the measurement part of the
tional (covariance-based) SEM. This development will model, indicators of acculturation include a self-report
support the ongoing integration of composite SEM and measure about daily language use, patterns of social
traditional SEM. (The HO specification is another such contact, and participation in cultural activities where
development.) higher scores indicate greater adaption to U.S. culture.
The reliability coefficient for the self-report scale is
4 https://www.smartpls.com/
Acculturation Percent Generation

Interpersonal Job
Scale Life U.S. Status
1 1
Acculturation Stress
SES Depression
1 1 1.369
Education Income SCL90D
FIGURE 16.1. Partial structural regression model of stress and depression as a function of acculturation and socioeconomic
status. The error variance for the single indicator of depression is fixed to equal 1.369.
Pt3Kline5E.indd 289 3/22/2023 2:56:25 PM

rXX = .97. Other acculturation indicators included the ure 16.1 is identified, and Exercise 3 asks you to show
percentage of lifetime spent in the United States and that df M = 16.
generation status, or the length of time the respon- The data for this analysis are summarized in Table
dent’s family resided in the United States. Both vari- 16.1. These data are based on, but not exactly equiva-
ables just mentioned were expected to share unique lent to, data analyzed by Shen and Takeuchi (2001) for
variance (see the error covariance in the figure). This their larger model. Their original data matrix is ill-
is because except for first-generation immigrants, scaled because the ratio of the largest over the small-
most participants beyond the first generation would est variance is nearly 200. There are Heywood cases
have spent nearly all of their lives in the United States that can crop up in analyses of the original data for the
(Shen & Takeuchi, 2001). acculturation scale indicator (see Figure 16.1); specifi-
Indicators of SES in Figure 16.1 include annual cally, its error variance can be negative. I believe this
family income and level of education. Participants happens because the error variance for this indicator
completed two measures about stress, one about inter- is nearly zero, given rXX = .97, and negative error vari-
personal (relationship-oriented) stressors and the other ances are within the limits of sampling error for a pop-
about job-related stressors. Higher scores indicate ulation variance that is close to zero. To avoid technical
greater levels of stress in each area. The single indicator problems in this pedagogical analysis, I generated the
for depression is the Symptom Checklist-90 Depression correlations in Table 16.1 to match the basic patterns of
scale (SCL90D), where higher scores indicate greater associations in the original data matrix, and standard
levels of clinical depression (Derogatis et al., 1976). Its deviations in the table correspond to values based on
reliability coefficient is rXX = .90. Given a sample vari- rescaling the original variables to reduce heteroscedas-
ance of 3.70 for the SCL90D, the error variance for this ticity (Kline, 2011, p. 279). Also reported in Table 16.1
single indicator is fixed to equal (1 – .90) 3.702, or 1.369 are variance inflation factor (VIF) values for each set of
(see Figure 16.1). Exercise 2 asks you to verify that Fig- multiple indicators. Given VIF ≤ 2.20 for all variables,
TABLE 16.1. Input Data (Correlations, Standard Deviations) for Analyses of Models
of Stress and Depression as Functions of Acculturation and Socioeconomic Status
Indicator 1 2 3 4 5 6 7 8 VIF
Acculturation
1. Acculturation scale — 1.93
2. Generation status .44 — 1.43
3. Percent life U.S. .69 .54 — 2.20
SES
4. Education .21 .08 .16 — 1.04
5. Income .23 .15 .19 .19 — 1.04
Stress
6. Interpersonal .12 .08 .08 .08 –.03 — 1.17
7. Job .09 .06 .04 .01 –.02 .38 — 1.17
Depression
8. SCL90D .03 .02 –.02 –.07 –.11 .37 .46 — —
SD 3.60 3.30 2.45 3.27 3.44 2.99 3.58 3.70 —
Note. N = 983; VIF, variance inflation factor computed for each set of multiple indicators; SCL90D, Symptom Checklist-90
Depression scale.
Pt3Kline5E.indd 290 3/22/2023 2:56:25 PM

TABLE 16.2. Analyses, Methods, Script and Data Files, and Packages
in R for Models of Stress and Depression as a Function of Acculturation
and Socioeconomic Status
1. Partial SR model with reflective measurement shen-partial-sr.r lavaan
component semTools
2. Zero-error variance model and partially shen-ses-composite.r lavaan

reduced form model for SES only
3. Generate standardized scores for cases based shen-generate-scores.r semTools

on sample correlations, compute VIF values lavaan
psych
4. CCA of a composite model in the PLS-PM shen-cca-ols.r cSEM

algorithm with the OLS estimator semTools
lavaan
psych
5. CCA of a composite model with the ML shen-cca-ml.r lavaan

estimator for HO specification
Note. Output files for all analyses have the same names except the extension is “.out.” The file with standard-
ized scores generated in analysis 3 is shen.csv. CCA, confirmatory composite analysis; PLS-PM, partial
least squares path modeling; OLS, ordinary least squares; ML maximum likelihood; HO, Henseler–Ogasawara.
there is relatively little concern about extreme multicol- significant standardized residuals in this analysis seem
linearity within each of the three sets of multiple indi- more attributable to the relatively large sample size
cators (there a single indicator for depression). than to absolute magnitudes of discrepancies between
Listed in Table 16.2 are the syntax and output files for observed and predicted correlations.
fitting Figure 16.1 to the data in Table 16.1 in lavaan.
The syntax, output, and data files for all analyses in the
Critique of Reflective Measurement
table can be download from this book’s website. To save
space in this example, the model in analysis 1 was not The partial SR model in Figure 16.1 has satisfactory
analyzed over two steps, measurement and structural. global fit and local fit, so let’s cheer and get on with
As we shall see, the fit of the initial model is satisfac- interpreting the parameter estimates, right? Ah, no,
tory, so there is no respecification phase. The analysis and this is because the hypothesis of reflective mea-
converged to an admissible solution with the default surement is generally implausible, given the indicators
ML estimator. You should verify in the output file that and constructs for this analysis. For example, income
Figure 16.1 has acceptable fit to the data. For exam- and education are specified as effect indicators that
ple, the model passes the exact-fit test, or chiML(16) = are caused by a reflective SES factor, but that logic is
21.341, p = .166. The power of this test in the MacCal- flawed: Income and education, among other unmea-
lum–RMSEA method exceeds .99, so it is very likely sured variables, cause SES, not the reverse.
over random samples that a model without perfect fit Unfortunately, it is impossible to analyze a model
would be detected.5 Values of no other global fit statistic such as Figure 16.2(a) where income and education are
indicate deficient overall fit. Although two standardized specified as cause indicators of a formative SES con-
residuals are significant at the .05 level, the correspond- struct with a disturbance. The reason is identification:
ing absolute correlation residuals are ≤ .047. Thus, the Although the formative SES factor is scaled by the ref-
erence variable method, Figure 16.2(a) is not identified,
5 Parameters: e0 = 0, e1 = .05, df M = 19, N = 983, a = .05. and MacCallum and Browne (1993) explained why:
Pt3Kline5E.indd 291 3/22/2023 2:56:25 PM

(a) Causal indicators (not identified) (b) Composite indicators (identified)
292
Pt3Kline5E.indd 292
Acculturation Percent Generation Acculturation Percent Generation
Interpersonal Job Interpersonal Job
Scale Life U.S. Status Scale Life U.S. Status
1 1 1 1
Acculturation Stress Acculturation Stress
Education 1 Education 1
SES Depression SES Depression
Income 1.369 Income 1.369

1 1
SCL90D SCL90D
(c) No composite (identified)

Interpersonal Job
1 1
Education
Depression
Income 1.369
1
SCL90D
FIGURE 16.2. Structural regression models with causal indicators for a latent socioeconomic status composite with a disturbance (a). Zero-error variance model
with composite indicators for an SES composite with no disturbance (b). A partially reduced form model with no SES composite (c). Model (a) is not identified, models
(b) and (c) are equivalent.
3/22/2023 2:56:26 PM
Unless a formative factor with cause indicators only is identified even though the SES composite emits a
has direct effects on at least two other variables in the single direct effect. Bollen and Davis (2009) described
model, its disturbance is not identified. This require- models like Figure 16.2(b) as a zero-error variance
ment is the 2+ emitted paths rule. Bollen and Davis model because a composite has no error term.
(2009, pp. 503–504) described additional requirements A disadvantage of the model in Figure 16.2(b) is that
for models that satisfy the 2+ emitted paths rule, but a single coefficient for the SES composite estimates the
they are only sufficient, not necessary, conditions. Wait, combined or weighted effect of education and income
there’s more: In models with multiple indirect pathways together on depression, while controlling for the direct
from reflective factors to other such factors that pass effect of stress. Although possible, in theory, to com-
through formative factors with disturbances, some path pute separate indirect effects of education and income
coefficients in the structural model may be underidenti- on depression through the SES composite, the inter-
fied (e.g., MacCallum & Browne, 1993, p. 537). pretation of such effects would be problematic. This is
In Figure 16.2(a), the disturbance variance for the because the SES composite in the figure represents the
latent SES composite is underidentified because this combined effect of both its indicators as a set, that is, it
formative factor with a disturbance emits a single direct does not control for the effect of one variable versus the
effect. An option to identify this parameter is to add other the other on depression.
a direct effect from SES to stress. Now the formative MacCallum and Browne (1993) noted that a compos-
SES factor would emit two direct effects (on stress and ite in a zero-error variance model that emits a single
depression), which would satisfy the 2+ emitted paths path such as SES in Figure 16.2(b) can actually be
rule. But the hypothesis that SES directly causes stress deleted from the model with no change in overall fit.
is not part of the original model in Figure 16.1, where All causal paths that passed through the composite in
SES has no direct effect on stress. A second option to the original model are replaced by direct effects in the
identify the disturbance variance for SES in Figure new model with no composite. That new model, a par-
16.2 is to add effect indicators that are caused by SES, tially reduced form model (Bollen & Davis, 2009),
not the reverse. Possible examples include measures of is an equivalent version of the original model with the
perceived family wealth or financial security that are composite. An example is the model in Figure 16.2(c),
expected to depend on a common SES factor. An SES which has no SES composite. Instead, the depression
factor with two cause indicators and two effect indi- factor is directly regressed on education and income, or
cators would be identified as a stand-alone multiple- the composite indicators of SES in Figure 16.2(b). The
indicators and multiple-causes (MIMIC) model (e.g., two models just mentioned are equivalent because any
Figure 13.3), and the presence of at least two effect equation for a variable in Figure 16.2(b) that includes
indicators would provide a testable implication: The the composite, such as
two effect indicators are uncorrelated after controlling
for their common SES factor, if their error terms are SES → Depression
independent.
But it is not always possible to add variables, such as can be substituted with an equation where the com-
effect indicators for SES in this case, to a model after posite is replaced by its indicators, or education and
the data are collected. A second option to identify Fig- income, in Figure 16.2(c), with no change in model fit.
ure 16.2(a) is to fix the disturbance variable for SES to Table 16.2 for analysis 2 lists script and output files
zero, which drops this parameter from the model. The that fit each of Figures 16.2(b)–16.2(c) to the data
result would be Figure 16.2(b) with a formative SES in Table 16.1. Specification of the model in Figure
factor with two composite indicators, income and edu- 16.2(b), the zero-error variance model, features a spe-
cation, but no disturbance. The SES “factor” is now just cial lavaan symbol for formative measurement, “<~,”
a linear combination of its two indicators, which is des- which is used in the command
ignated in the figure by the special hexagon symbol for
a composite. That composite is scaled using the refer- SES <~ 1*educ + income
ence variable method by fixing the unstandardized path
coefficient for the direct effect of education to 1.0. With to regress the SES composite on education and income
no disturbance variance to estimate for SES, the 2+ and to scale the composite. By default, the disturbance
emitted paths rule no longer applies, so Figure 16.2(b) variance for the SES is fixed to equal zero, which
Pt3Kline5E.indd 293 3/22/2023 2:56:26 PM

defines SES as a composite. Releasing this constraint might be an MIMIC factor with two causal indicators,
would specify the formative SES factor with a freely percent of lifetime and generation status, and a single
estimated disturbance variance in Figure 16.2(a), but effect indicator, or scores on the self-report accultura-
that model is not identified. In the output file for this tion scale. This MIMIC factor in a respecified version
analysis (see Table 16.2), you should verify that of Figure 16.1 would emit two paths, one for its effect
indicator and the other for its direct effect on stress;
1. Figure 16.2(b), the zero-error variance model, and thus, its disturbance variance would be identified.
Figure 16.2(c), the partially reduced form model, But without a second effect indicator for accultura-
are equivalent models with identical global fit sta- tion, there is no testable implication for the assumption
tistics and residuals. for local independence, which requires at least a pair
2. The exact-fit test is passed, or chiML(15) = 22.294, of effect indicators for the same common factor with
p = .100, and the largest absolute correlation resid- uncorrelated error terms. Thus, the option to specify
ual is .063 (i.e., both global and local fit are reason- acculturation as an MIMIC factor is not pursued fur-
able). ther. Next, we consider a composite-based alternative
to the common factor model in Figure 16.1.
3. Value of R2 for indicators of acculturation, stress,
and depression and for the endogenous stress and
depression factors are also identical for both mod- ALTERNATIVE COMPOSITE MODEL
els.
4. There is a single path coefficient for the direct Presented in Figure 16.3(a) is a composite model for
effect of the SES composite on depression in Figure emergent variables in the Shen and Takeuchi (2001)
16.2(b), but there are two coefficients each for sepa- data set. Optional diagram elements are represented
rate effects of education and income on depression with dashed lines. The model assumes that (1) indica-
in Figure 16.2(c), all controlling for stress. Their tors for the same composite may covary, but (2) indica-
unstandardized values are very similar, or about tors of different composites, such as education for SES
–.095, which indicates that depression is lower and the SCL90D for depression, are allowed to covary
when income or education are higher, again con- only through their respective composites. This is a
trolling for stress. model-wide proportionality constraint on correlations
between indicators for different composites (Dijkstra,
Let’s continue our critique of the specification for 2017). Note that the error correlation between percent
reflective measurement in Figures 16.1–16.2 for the of life and generation status indicators in the common
other two factors with multiple indicators, stress and factor model of Figure 16.1 is “absorbed” by the cor-
acculturation. Both interpersonal and job-related stress relation between the same two variables in the com-
are represented as outcomes of a reflective stress con- posite model of Figure 16.3(a). This is because any
struct, but that logic is peculiar, too. Crosswell and unique variance represented as shared by this pair of
Lockwood (2020) differentiated between exposures to variables in Figure 16.1 is part of their overall correla-
stressful events, or stressors, and the responses to those tion in Figure 16.3(a). The more compact diagram in
events. Stressors are discrete events with the potential Figure 16.3(b) features symbolism from the graphical
to alter or unsettle usual psychological functioning. In user interface of the ADANCO computer program.
contrast, stress responses are the cognitive, behavioral, Correlations among indicators for the same composite
or physiological reactions elicited by stressful events. are assumed, but are not explicitly represented in the
Thus, problems in relationships or at work are stressors, compact diagram. Likewise, the correlation between
not stress reactions; that is, they are not effect indica- the exogenous composites, acculturation and SES, is
tors under this definition. assumed but not shown in the compact diagram.
The specification of percent of lifetime and genera- You may have noticed in Figure 16.3 the absence of
tion status in Figure 16.1 as effect indicators is also graphical symbols for indicator error terms or distur-
puzzling. This is because these variables are histori- bances for the endogenous stress and depression com-
cal facts determined by family history, or by how long posites, which are specified as outcomes of the accul-
respondents lived in a particular country, not by any turation and stress composites. This is because error or
underlying reflective construct. At best, acculturation disturbance variances are not free parameters in the
Pt3Kline5E.indd 294 3/22/2023 2:56:26 PM

(a) Composite model for emergent variables (PLS-PM algorithm)

Interpersonal Job
w1 w2 w3 w6 w7
p1
p3
p2
SES Depression
w4 w5 w8 = 1
(b) Compact representation
w1
Accul Scale w6
w2
p1 Interpersonal
Generation Acculturation Stress w7
w3 Job
Percent Life
p3
w4
Education p2 w8 = 1
w5 SES Depression SCL90D
Income
FIGURE 16.3. Composite model of stress and depression as a function of acculturation and socioeconomic status as speci‑
fied in the partial least squares path modeling (PLS-PM) algorithm (a). Compact graphical version (b). Dashed lines represent
optional graphical elements. Dominant indicators are in boldface.
Pt3Kline5E.indd 295 3/22/2023 2:56:26 PM

original PLS-PM algorithm. Thus, their values cannot tion between the composite and the dominant indica-
be fixed, freed, or constrained in the analysis. Error tor is negative, then the direction of the composite is
correlations can be estimated in the PLSc method when reversed by multiplying its scores by –1.0. For example,
composites estimate reflective constructs, but the idea the acculturation scale is the dominant indicator for the
of overlapping error variances does not really apply in acculturation composite in Figure 16.3. This specifica-
a model for emergent variables like the one in Figure tion guarantees that higher scores on both the scale and
16.3. its composite indicate greater adoption of United States
cultural elements. By default, the single indicator for
the depression composite in the figure is the dominant
Outer Model
indicator. In cases of high multicollinearity, it can hap-
The measurement part of Figure 16.3 is also called the pen that an indicator has a positive correlation with its
outer model or synthesis model.6 The symbols wi esti- composite, but at the same time a negative weight. Due
mate the contribution of each indicator to its respec- to standardization, correlations between observed vari-
tive composite. Methods to generate weights for the ables and composites equal standardized factor load-
outer model in the PLS-PM algorithm are described in ings.7
the next section. Weights for single indicators are an
exception—their weights are 1.0, or unit weights. For
Inner Model
example, the weight for the single indicator of depres-
sion in the figure is w8 = 1.0, which says that indicator The inner model is the structural part of Figure 16.3. It
and composite are identical. represents hypotheses about causal effects or noncausal
Weights in the PLS-PM algorithm are usually associations among emergent variables estimated by
applied to standardized variables—raw scores con- their respective sets of composites. Its parameters
verted to normal deviates—which all have the same include the correlation between the exogenous compos-
variance, or 1.0. An advantage is that weights for stan- ites, acculturation and SES, and the path coefficients
dardized variables are not influenced by differences in p1–p3 for direct effects on endogenous composites,
raw score metrics over indicators for the same compos- stress and depression. For standardized variables, coef-
ite. A disadvantage is that the basis for standardiza- ficient p1 is interpreted as a Pearson correlation because
tion—the standard deviations in a particular data set— there are no causes of stress other than acculturation,
is not invariant to sampling error (i.e., those values but coefficients p2 and p3 take the form of standardized
change over samples). Weights are usually scaled by the partial regression coefficients because depression has
computer so that all composites are standardized, too. two correlated causes, stress and SES.
In this way, variances of composites are fixed to 1.0,
which scales them (i.e., a unit variance identification
Model Parameters
[UVI] constraint).
and Degrees of Freedom
Analyzing standardized composites can give rise to
sign indeterminacy, which means that absolute values
The number of observations for a composite model
for weights can be uniquely determined but not their
analyzed equals the number of unique elements in the
signs. As a result, weights can generally be multiplied
sample correlation matrix in lower diagonal form and
by –1.0 without affecting model fit, and it can happen
excluding the 1.0s in the main diagonal. This number
in composite analyses that indicators have weights with
can be derived using a simple rule:
signs opposite to what is expected (Henseler, 2021).
In CFA, common factors are oriented in the refer-
RULE 16.1 If v is the number of observed variables
ence variable method by fixing a single unstandard-
in the model, the number of observations equals
ized loading to 1.0. An analogous method in composite
v(v – 1)/2 for a composite model
analyses involves the specification of dominant indi-
cators. This means that the researcher selects one indi-
In Figure 16.3, there are v = 8 indicators in the model,
cator (i.e., the dominant indicator) that should correlate
so the number of observations is 8(7)/2, or 28.
positively with its composite. If the estimated correla-
6 J. Henseler, personal communication, January 16, 2022. 7 J. Henseler, personal communication, January 23, 2022.
Pt3Kline5E.indd 296 3/22/2023 2:56:26 PM

The parameters of Figure 16.3 are defined next tiple indicators, such as education for SES in Figure
(Rademaker, 2022): 16.3.
A rule for the inner model is stated next:
RULE 16.2 Parameters of a composite model where
all observed variables contribute to proxies for RULE 16.4 The inner part of a composite model is
emergent variables include: identified if
1. the number of paths (including covariances) in the 1. every composite has at least one nonzero path
inner model; with another variable or all but one of its weights
2. the number of weights (wi) in each block of are fixed; and
indicators minus one; and 2. the inner model is recursive
3. the number of unique correlations (i.e.,
observations) in each block minus one The first part of Rule 16.4 concerns that fact the “stand-
alone” composites that emit no paths are not identi-
If we apply Rule 16.2 to Figure 16.3(a), there are 4 paths fied unless all weights for their indicators, except for
in the inner model; a total of one, are fixed in advance; otherwise, a composite must
be part of a nomological network, or connected with
(3 – 1) + (2 – 1) + (2 – 1) + (1 – 1) = 4 other variables in the model, to be identified (Henseler,
2021). The second part of Rule 16.4 concerns recursive
freely estimated weights across the four blocks of indi- structural models, which are identified. Nonrecursive
cators in the outer model; and a total of inner models can also be analyzed in composite SEM,
if they are identified (Chapter 19).
3(2)/2 + 2(1)/2 + 2(1)/2 + 1(0)/2 = 5 Figure 16.3 satisfies each of the three parts of Rule
16.3 because df M = 15 (i.e., ≥ 0), every composite is
unique correlations over the four blocks for a total of 4 + scaled when its variance is fixed to 1.0 in the analysis,
4 + 5 = 13 parameters altogether. With 28 observations and each composite is oriented through specification
available to estimate 13 parameters with no constraints, of a dominant indicator. These two features, scaling
the model degrees of freedom are df M = 28 – 13 = 15. and orientation, also identify the outer model in the
Henseler (2021, chap. 8) described more extensive rules figure. Rule 16.4 is also met because each composite
for computing df M in models where some, but not all, of the inner model in the figure is connected with at
composites are proxies for reflective constructs. least one other composite, and the whole inner model is
recursive. Henseler (2021, chap. 4) described extended
identification rules for models where some composites
Identification Requirements approximate latent variables in reflective measurement
Three general requirements for identification must be models.
satisfied by composite models like Figure 16.3:
RULE 16.3 Necessary but insufficient requirements PARTIAL LEAST SQUARES PATH
for identification of composite models are MODELING ALGORITHM
1. df M ≥ 0;
Outlined in Table 16.3 are the four basic steps in the
2. every composite must be scaled; and original PLS-PM algorithm (Rigdon, 2013). At initial-
3. the orientation of every composite must be ization (Step 1), weights in each block of composite
determined indicators are fixed to equality, such as w1 = w2 = w3
in Figure 16.3 for the indicators of acculturation, and
The second requirement just stated about scaling is the weights are scaled to standardize each composite.
met when the computer automatically fixes the vari- These starting values for weights are the initial values
ances of composites to 1.0. The third requirement is in Step 2, which proceeds through iterations alternating
the sign rule (Henseler, 2021). It can be met by speci- between outer estimation of composites as weighted
fying a dominant indicator for composites with mul- combinations of their indicators in the outer model and
Pt3Kline5E.indd 297 3/22/2023 2:56:26 PM

TABLE 16.3. Analysis Steps in the Partial Least Squares Path Modeling Algorithm
Step Name Description
1 Initialization Starting values for weights wi (equal within blocks)
2 Iterative steps Continue until stopping criterion is reached (change in outer

weights is less than predefined minimum)
2a Outer estimation Express each composite as a weighted combination of its
indicators
2b Estimate inner weights Estimate inner weights eij for each pair of adjacent composites in
the inner model
2c Inner estimation Regress each composite on others adjacent to it based on the eij
weights from step 2b
2d Estimate weights Update weights wi
3 After convergence
3a Calculate composite scores Use wi estimates in final iteration
3b Estimate path coefficients OLS regression of each endogenous composite on its presumed
cause(s), given the inner model
4 Estimate standard errors or Nonparametric bootstrapping

confidence intervals
inner estimation of composites as weighted combina- pute weights in the PLS-PM algorithm. Some com-
tion of other composites based on the structural part puter programs also generate loadings for the outer
of the model. The inner weights eij mentioned in Table model. A loading estimates the Pearson correlation
16.3 only play a role during the iterative phase of the between an indicator and its composite. Loadings can
PLS-PM algorithm, and their values are not typically be expressed as transformations of the outer weights
printed by the computer as part of the output. Estimates based on the sample correlation matrix, and vice versa
of wi and eij are updated within each iteration so that (e.g., Henseler, 2021, p. 93). In contrast, wi values are
(1) the composites remain standardized and (2) predic- not generally interpreted as correlations; instead, they
tion of endogenous composites is optimal (i.e., R2 is are scaled to generate standardized scores on the com-
maximized). posites. Next (Step 3b), path coefficients pi are esti-
Iterative estimation at Step 2 in the PLS-PM algo- mated in a series of regression analyses, given the pat-
rithm terminates when the stopping criterion is tern of correlations or direct effects specified for the
reached—see Table 16.3. That criterion is a predefined inner model. Because the PLS-PM algorithm makes no
minimum change in values of weights wi between two distributional assumptions, it relies on nonparametric
successive iterations. If this change is less than the cri- bootstrapping in Step 4 to estimate standard errors or
terion value, iterative estimation has converged to the confidence intervals for weights, loadings, correlations,
final set of estimates for the weights. Convergence fail- and path coefficients.
ure can happen in the PLS-PM algorithm, but perhaps
only in cases rarely encountered in practice (Henseler,
Options for Weights
2010). Failure of iterative estimation is a much more
frequent problem in CFA. Computer programs for PLS-PM analyses generally
After convergence, two things happen at Step 3: offer a choice of methods for calculating weights wi for
Scores for each composite are computed based on the outer model and inner weights eij at Step 2b (Table
the final set of wi values (Step 3a; Table 16.3). Unlike 16.3). The choice of inner weights is not generally as
indeterminant factor scores based on common factors important as the choice of weights for the indicators.
in CFA, there is only a single, unique way to com- For example, in perhaps most analyses with second-
Pt3Kline5E.indd 298 3/22/2023 2:56:26 PM

order constructs, the results should be very similar 2. The method PLS Mode B generates regression
over different options for the inner weights (Noonan & weights, which are basically partial regression coef-
Wold, 1982). Perhaps the most general choice is fac- ficients that control for intercorrelations among indi-
torial (factor) weighting, where the inner weights are cators of the same composite. Conceptually, Mode B
the estimated correlations between each pair of com- weights are generated by regressing each composite on
posites. They apply in both recursive and nonrecur- all its indicators, which models the composite as the
sive models, they take account of correlations between criterion with the indicators as predictors. Because
explanatory composites like SES and acculturation in regression weights take account of all intercorrelations
Figure 16.3, and they are a good general choice when within a block of indicators, their values maximize in-
correlations within blocks of indicators for the same sample prediction at the cost of lower performance in
composite are relatively low. Esposito Vinzi et al. out-of-sample prediction relative to correlation weights.
(2010), Henseler (2021), and Hair et al. (2022) describe In small samples, R2 values generated by regression
additional options for inner weights. weights can overestimate the corresponding population
There are two general types of weights for composite values by appreciable amounts (Becker et al., 2013).
indicators, user-defined and empirical. User-specified Regression weights can also be adversely affected
weights or fixed weights are predefined values estab- by strong collinearity within blocks of indicators. For
lished before data analysis. An example is the specifi- example, it can happen that some indicator weights are
cation of unit weights (1.0) for multiple indicators of the negative even though bivariate correlations between a
same composite, which equalizes their contributions composite and its indicators are all positive. This result
to the composite. Unit weights are typically rescaled is not a computer error, but it can complicate interpre-
in the analysis so that the composite is standardized. tation; specifically, if indicators represent quantities
This means that weights within each block of com- such as prices of goods or frequency counts, then nega-
posite indicators will be equal, but their values in the tive coefficients are not very meaningful. Regression
final iteration may not equal 1.0. Unit weights are not weights generated in PLS Mode B are consistent esti-
affected by sampling error, and they might be preferred mators for composite models like Figure 16.3 (Dijkstra,
in very small samples with relatively low R2 values or 2017). Regression weights generally perform as well as
low correlations among indicators, such as < .20 for correlation weights in larger samples, such as N > 500
both (Becker et al., 2013). or so, or when R2 > .60 or so, but regression weights
Three options for generating empirical weights might be preferred when correlations among indicators
or unknown weights estimated in the data (Esposito are relatively low (Becker et al., 2013).
Vinzi et al., 2010; Henseler, 2021) are described next: 3. The method PLS Mode BNNLS, implemented
in the R package cSEM, is an advanced option that
1. The option PLS Mode A generates correlation avoids negative weights when all bivariate correlations
weights, which are generated as bivariate correla- among indicators and their composite are positive. It is
tions between composites and their indicators, but they based on nonnegative least squares (NNLS), a type
ignore intercorrelations among indicators for the same of constrained least squares estimation that generates
composite. Conceptually, Mode A weights are gener- positive regressions weights for sets of variables that
ated by regressing each indicator on its associated com- are positively correlated. Dijkstra and Henseler (2011)
posite; that is, an indicator is modeled as the criterion applied NNLS to construct best-fitting proper indices
and its predictor is the composite. Correlation weights (BFPI) in composite analyses that (a) take into account
do not maximize in-sample prediction and they are nonlinear relations between blocks of variables and (b)
not consistent estimators when the population model generate indicator weights that are sign-restricted (i.e.,
is reflective (Henseler, 2021). The R2 values generated > 0) in predefined ways.
by correlation weights may be the closest to population
values compared with other weighting schemes for the You should know that PLS Mode A and Mode B
outer model. Correlation weights might be preferred in are merely different weighting schemes that gener-
small- to medium-size samples (e.g., N = 100–500) or ate, respectively, correlation weights versus regression
when R2 values are low to medium (e.g., R2 = .20–.60), weights for the outer model (Rigdon, 2013). In both
especially when indicators for the same composite are modes, composites as proxies for concepts are ana-
appreciably collinear (Becker et al., 2013). lyzed. Unfortunately, Mode A is too often equated with
Pt3Kline5E.indd 299 3/22/2023 2:56:26 PM

“reflective measurement” and Mode B with “formative were specified to enhance the possibility that the results
measurement” in some widely cited publications about generalize to other samples. The use of Mode A is not
the PLS-PM method published roughly before 2015— viewed in this analysis as simulating reflective mea-
see Henseler (2021, p. 96) for an extensive list. It is surement.
also true that specification of Mode A is represented The cSEM package automatically checks for solu-
in the graphical user interface of some computer tools, tion admissibility including whether estimation con-
such as SmartPLS, with arrows that point from circles, verged, all absolute standardized factors loading are
which symbolize concepts, to rectangles that represent < 1.0, and all model-implied correlation matrices are at
indicators, a visual display suggesting reflective mea- least positive semidefinite, of which positive definite
surement (e.g., Figure 13.2(a) but without error terms). matrices (Chapter 4) are a subset, except that a posi-
In contrast, specification of Mode B in SmartPLS is tive semidefinite matrix has at least one eigenvalue that
represented graphically in the same way except that the is zero while all the rest are positive (i.e., there is no
arrows are reversed, that is, they point from rectangles matrix inverse). No solution admissibility problems
to circles, which suggests formative measurement (e.g., were detected by cSEM.8
Figure 13.2(b) but with no disturbance). The value of the SRMR for the whole model is .025.
But conflating Mode A with reflective measurement In the Schuberth et al. (2018) nonparametric model
and Mode B with formative measurement in the PLS- fit test, the computer transforms the data so that they
PM algorithm is not justified for the reasons summa- exactly match the model-implied correlation matrix.
rized next: Model equations are not simultaneously Next, bootstrapped samples are selected from the trans-
estimated in Mode A as they are when common fac- formed data, say, 1,000 times, and the 95th percentile in
tor models are analyzed in CFA (Rigdon, 2013). It is the bootstrapped sampling distribution for the SRMR
total variance, not common variance, that is analyzed is determined. In this analysis, the 95th percentile for
in both Mode A and Mode B. That is, neither weight- the SRMR is .038. Because the observed value of the
ing scheme removes unique indicator variance from the SRMR, or .025, is less than the critical value, or .038,
data matrix. In contrast, unique variance is removed the model passes the global fit test at the .05 level. The
from the data matrix when common factor models are cSEM program also computes other global fit statistics
analyzed in CFA. The real difference between Mode that measure the distance between the sample and pre-
A and Mode B in the PLS-PM algorithm is the choice dicted correlation matrices, but these test statistics are
between, respectively, analyzing regression-weighted not described here—see Henseler (2021, chap. 6) for
composites versus correlation-weighted composites more information.
(Rigdon, 2016). The specification of Mode A in PLSc, The cSEM package can optionally print values
a consistent estimator when the population model is of parametric global fit statistics of the kind often
reflective, provides a closer simulation of reflective reported in covariance-based SEM. Examples include
measurement due to disattenuation for measurement a model chi-square with df M, the RMSEA, and the CFI,
error. among others. But you should know that the suitability
of these parametric global fit statistics for a compos-
ite model is unknown; thus, researchers should avoid
PLS‑PM ANALYSIS
interpreting them in the same way as in CFA analyses
OF THE COMPOSITE MODEL
with continuous indicators. For example, the chi-square
for the composite model in this analysis is based on the
For analysis 3 in Table 16.2, standardized scores for
difference between the sample and predicted correla-
N = 983 cases that exactly matched the correlation
tion matrices, not the corresponding covariance matri-
matrix for the Shen and Takeuchi (2001) data in Table
ces. The standard model chi-square test in covariance-
16.1 were generated with the R package semTools (Jor-
based SEM assumes unstandardized variables, so the
gensen et al., 2022). In the same analysis, I show you
how to compute the VIF for each block of multiple 8I also fitted the same model to the data in SmartPLS 3 and
indicators (3 sets in total) in standard regression analy- ADANCO 2.3 with comparable analysis options. Parameters
ses. In analysis 4, the composite model in Figure 16.3 estimates are equal within very slight rounding error over all
was fitted to the generated data just described in the R three programs, and results of bootstrapped tests for overall
package cSEM. The method is the basic PLS-PM algo- model fit in cSEM and ADANCO are likewise the same (this
rithm, the estimator is OLS, and PLS Mode A weights method was not available in SmartPLS 3).
Pt3Kline5E.indd 300 3/22/2023 2:56:26 PM

p value for the chi-square statistic in this analysis with are represented in the figure as loadings instead of
standardized variables could be meaningless. weights, which parallels the specification of common
It is just as critical in composite SEM to evaluate factor models.
local model fit by inspecting the residuals as is done Another feature of Figure 16.4 is the presence of a
in CFA and related techniques. The cSEM package second set of composites labeled Ex1–Ex4, which rep-
generates the model-implied correlation matrix as an resent excrescent variables. They are linear combina-
option, but it does not print correlation residuals, or tions that (1) capture remaining indicator variance not
differences between the sample and predicted matri- extracted by their corresponding substantive compos-
ces. In the script file for analysis 4 in Table 16.2, I ites and (2) are unrelated to all other variables in the
show you how to subtract the predicted correlation model, including each another (Schuberth, 2021). For
matrix computed in cSEM from the sample correla- example, excrescent variables Ex1 and Ex2 in Figure
tion matrix, which generates the correlation residu- 16.4 account for variation among the three indicators
als. None of these residuals exceeded .10 in absolute of acculturation that does not overlap with their sub-
value, and their values ranged from –.053 to .052 (see stantive composite. The number of excrescent variables
the output file), so local fit seems acceptable. Values required for each set of indicators is one less than the
of parameter estimates and their bootstrap standard number of variables in that set. Thus, a single excres-
errors appear reasonable, and high multicollinearity cent variable is needed for each pair of indicators for
within blocks of indicators for the same composite is the stress and SES composites—respectively, Ex3 and
not a problem for these data (Table 16.1). Given all Ex4 —and none is required for the single indicator of
these results, the composite model in Figure 16.3 is depression. Excrescent variables are needed for cor-
retained. Parameter estimates for this analysis are rect model parameterization, but analysis results for
reported below. them have no meaningful interpretation. Although not
explicitly shown in the diagram, the error variances for
all indicators are fixed to zero except for the single indi-
HENSELER–OGASAWARA cator of depression, which is fixed to equal 1.369, just
SPECIFICATION AND ML ANALYSIS as in the partial SR model in Figure 16.1.
What is more straightforward is that the HO specifi-
A limitation of the analysis just described is that the cation depicted in Figure 16.4 can be directly translated
composite model in Figure 16.3 was fitted to a correla- into the syntax for a standard SEM tool—Mplus, EQS,
tion matrix, not a covariance matrix, which may com- Amos, LISREL, lavaan, among others—and fitted
plicate the generalizability of the results to samples to the data without first standardizing the raw scores.
with different variances on the original variables. A Model parameters are counted in the conventional way,
second drawback is that special software is needed (i.e., too, as the variances and covariances of exogenous
cSEM), but that issue per se is not a major obstacle. variables and direct effects on endogenous variables
Presented in Figure 16.4 is the HO specification of the when means are not analyzed (Rule 7.1). For Figure
composite model in Figure 16.3 that permits the model 16.4, these include
to be fitted to a covariance matrix with the ML method
or other estimators (GLS, WLS, etc.) using a standard 1. 8 variances (4 for excrescent variables [Ex1–Ex4],
SEM computer tool. The number of free parameters 2 for exogenous substantive composites [accultura-
and the value of df M are unaffected by HO specification, SES], and 2 for endogenous composites [stress,
tion. For example, we determined that df M = 15 for the depression]);
composite model in Figure 16.3; thus, df M = 15 for the
2. 1 covariance between exogenous composites;
HO specification of the same model, too.
The model diagram for HO specification is presented 3. 9 loadings of endogenous variables (4 for indica-
in Figure 16.4. This diagram might seem confusing at tors of substantive composites, 5 for indicators of
first glance. For example, arrows point from the sub- excrescent variables); and
stantive composites for acculturation, SES, stress, and 4. 3 path coefficients for direct effects on endogenous
depression to their indicators, which does not show, in composites,
an intuitive way, that these composites are weighted
combinations of their indicators. Instead, relations which altogether equals 21. When fitted to a covariance
between substantive composites and their indicators matrix for 7 variables, there are 7(8)/2 = 36 observa-
Pt3Kline5E.indd 301 3/22/2023 2:56:26 PM

Ex1 Ex2 Ex3
1 1 1

Interpersonal Job
1 1
SES Depression
1.369
1 1
Ex4
FIGURE 16.4. Composite model of stress and depression as a function of acculturation and socioeconomic status repre‑
sented in Henseler–Ogasawara specification with emergent variables that approximate formative concepts and excrescent
variables (Ex) that account for all remaining variation after extraction of emergent variables. Error terms for all indicators are
fixed to equal zero except for the single-indicator of depression for which error variance is fixed to 1.369. Variances for all
emergent and excrescent variables are fixed to equal 1.0.
tion, so df M = 36 – 21 = 15 for Figure 16.4, which is the of the percent of life and generation status indicators
same value for the composite model in Figure 16.3. on the acculturation composite—see the syntax file for
Reported in Table 16.2 for analysis 5 are the syn- this analysis. Values of selected global fit statistics are
tax and output files for fitting Figure 16.4 to a covari- reported next:
ance matrix assembled from the data in Table 16.1 in
lavaan. The estimator is default ML. The analysis chiML(15) = 19.334, p = .199
converged to an admissible solution, although it was RMSEA = .017, 90% CI [0, .037]
necessary to specify start values of zero for loadings CFI = .997, SRMR = .017
Pt3Kline5E.indd 302 3/22/2023 2:56:27 PM

The model passes the chi-square test for its fit to the range of correlations for the same indicators of accul-
sample covariance matrix, and values for approximate turation is .755–.991 in ML estimation. Unstandard-
fit indexes are not troublesome. None of the standardized loadings are available only for the ML estimator,
ized residuals for differences between observed and and these results are consistent with expectations. For
predicted covariances is significant at the .05 level, and example, education is the reference indicator for the
the largest absolute correlation residual is .051—see the SES composite (see Figure 16.4). Its unstandardized
output file for analysis 5 in Table 16.2. Given all these loading is fixed to 1.0, and there is no standard error.
results, the model in Figure 16.4 is retained. Only the unstandardized estimates in Table 16.5 would
Listed in the second column of Table 16.4 are OLS be directly comparable over samples with appreciably
estimates of loadings for the indicators in Figure 16.3 unequal variances on the same measures.
analyzed with the PLS-PM method in the cSEM pack- Listed in the second column of Table 16.5 are the
age (analysis 4, Table 16.2). All variables are standard- OLS estimates for structural (inner model) parameters
ized in this analysis, so the unstandardized and stan- (i.e., path coefficients) of Figure 16.3, and in the third
dardized solutions are the same. The third column of column of the table are the corresponding ML esti-
the table lists ML estimates for the same indicators; mates for Figure 16.4. Only the ML estimates feature
in Figure 16.4 they are analyzed in lavaan with the distinct unstandardized and standardized solutions.
default ML estimator (analysis 5, Table 16.2). Distinct Values of standardized results are similar across both
unstandardized and standardized solutions are avail- estimators, so only ML estimates are described next.
able in this analysis with covariances. Over both sets Stress is expected to increase by .122 standard devia-
of results, all indicators have relatively highly correlations for an increase in acculturation of one standard
tions with their respective composites. For example, deviation. Thus, respondents of Chinese descent living
the range of correlations for composite indicators of in the United States who reported greater adaption to
acculturation is .750–.895 in OLS estimation, and the that culture reported higher level of stress. Depression
TABLE 16.4. Ordinary Least Squares (OLS) and Maximum Likelihood

(ML) Estimates of Loadings for Composite Indicators in a Model of Stress
and Depression as Functions of Acculturation and Socioeconomic Status
Estimator
Indicator OLSa MLb
Acculturation
Acculturation scale .895 (.040) 1.000 (—) .991
Generation status .750 (.070) .519 (.051) .755
Percent life in U.S. .858 (.042) .507 (.086) .547
SES
Education .644 (.169) 1.000 (—) .728
Income .874 (.167) 1.173 (.196) .811
Stress
Interpersonal .794 (.020) 1.000 (—) .728
Job .864 (.013) 1.440 (.122) .899
Depression
SCL90D 1.000 (—) 1.000 (—) .948
aStandardized (bootstrap standard error).
bUnstandardized (standard error) standardized.
Pt3Kline5E.indd 303 3/22/2023 2:56:27 PM

TABLE 16.5. Ordinary Least Squares (OLS) and Maximum Likelihood

(ML) Estimates of Structural Parameters for a Model of Stress
and Depression as Functions of Acculturation and Socioeconomic Status
Estimator
Parameter OLSa MLb
Direct effects
Acculturation → Stress .117 (.028) .076 (.020) .122
SES → Depression –.121 (.032) –.190 (.046) –.129
Stress → Depression .504 (.023) .839 (.064) .535
Covariance
Acculturation SES .275 (.045) 2.447 (.367) .289
Stress — 1.000 (—) .728
Depression — 1.440 (.122) .899
Note. Values of R2 for stress and depression in OLS estimation are, respectively, .014 and .268, and the cor-
responding values in ML estimation are, respectively, .015 and .298.
aStandardized (bootstrap standard error).
bUnstandardized (standard error) standardized.
is estimated to decrease by .129 standard deviations, for emergent variables in formative measurement mod-
given an increase in SES of 1 standard deviation while els and the analysis of common factors as proxies for
controlling for stress. When holding SES constant, latent variables in reflective models. For example, both
depression should increase by .535 standard deviations, types of proxies can be analyzed in the same model,
given an increase in stress of 1 standard deviation. given a substantive basis to do so—see Henseler (2021)
Exercise 4 asks you to compare standardized estimates for examples. A family reunion indeed.
for the indirect effect of acculturation on depression
through stress in the output for analysis 4 of Figure 16.3
(i.e., OLS) with the corresponding estimates in the out- SUMMARY
put for analysis 5 of Figure 16.4 (i.e., ML).
To sum up and offer perspective, I believe that esti- Rigdon et al.’s (2017) perspective on the comparison of
mates for the partial SR model in Figure 16.1 are suspect SEM techniques that model concepts as common fac-
because they assume reflective measurement, which is tors versus composites provides an excellent summary
generally implausible for the reasons explained. But the for this chapter: There is no single modeling technique
composite SEM method of CCA offers a meaningful that works in all situations, so the question, “Which
alternative in this case. One approach fits a composite one is best?” is pointless. The choice between the two
model to a correlation matrix in the PLS-PM algorithm approaches may also be of secondary importance com-
with the added capabilities to test global model fit and pared with the decisions about research design, sam-
evaluate local fit through the inspection of residuals. pling, measurement, and data integrity. This is because
The second approach fits a composite model specified no statistical technique, however complex or sophisti-
so that it can be analyzed with standard SEM computer cated, can fix up bad data. Avoid falling into mechanis-
tools and fitted to a covariance matrix. Of the two meth- tic, ritualized, or cookie-cutter use of either approach
ods, the second approach just mentioned based on HO that relies on unsubstantiated “golden rules” about
specification is quite accessible to researchers familiar cut-off values for fit statistics or effect sizes. There are
mainly with standard SEM software. It also offers a relatively few occasions when a single statistical model
bridge between the analysis of composites as proxies is consistent with the data to the exclusion of all other
Pt3Kline5E.indd 304 3/22/2023 2:56:27 PM

models. This is the lesson of equivalent models but now LEARN MORE
extended to multiple alternative techniques such as con-
ducting SEM with common factors versus composites. Henseler (2021) gives a modern perspective on composite-
Choose the method with the strongest theoretical justi- based SEM, the book by Latan and Noonan (2017) deals
fication. For example, a research problem grounded in with advanced applications of PLS-PM, and Rigdon et al.
classical measurement theory would point to traditional (2017) compare composite-based and common-factor-based
SEM as the likely best choice, but work based on other approaches to statistical modeling and offer suggestions for
assumptions about measurement, such as synthesis the- applied researchers about their use.
ory, is more consistent with composite SEM. In more
Henseler, J. (2021). Composite-based structural equation
exploratory research where the strict assumptions of
modeling: Analyzing latent and emergent variables.
traditional SEM are unlikely to hold, composite SEM
Guilford Press.
techniques allow for greater flexibility. Neither tech-
nique is a silver bullet, or a magic cure for complex Latan, H., & Noonan, R. (Eds.). (2017). Partial least squares
research problems. Above all, focus on the actual phe- path modeling: Basic concepts, methodological issues
nomenon to be studied and “don’t let the modeling get and applications. Springer.
in the way of the learning” (Rigdon et al., 2017, p. 12). Rigdon, E. E., Sarstedt, M., & Ringle, C. M. (2017). On
Part IV deals with special problems and more advanced comparing results from CB-SEM and PLS-SEM. Journal of
analyses. Best practices are covered in the last chapter, Research and Management, 39(3), 4–16.
which may be the most important one in the book.
EXERCISES
1. Verify that the relative contribution of common 3. Show that df M = 16 for Figure 16.1.
variance versus unique variance is 4:1 for a com-
posite with n = 4 equally weighted elements (see 4. Compare the standardized estimates for the indirect
Topic Box 16.1). effect of acculturation on depression through stress
in analysis 4 for Figure 16.3 with those for the same
2. Prove that Figure 16.1 is identified. effect in analysis 5 for Figure 16.4 (see Table 16.2).
Pt3Kline5E.indd 305 3/22/2023 2:56:29 PM

Pt3Kline5E.indd 306 3/22/2023 2:56:29 PM
Part IV
Advanced Techniques
Pt4Kline5E.indd 307 3/22/2023 2:05:46 PM

Pt4Kline5E.indd 308 3/22/2023 2:05:46 PM
17
Analyses in Small Samples
Analysis of complex statistical models in small samples is challenging when using multivariate statistical
techniques, and SEM is no exception. One reason is that most estimation and inference methods in SEM
are asymptotic, or they assume large random samples. Applied in small samples, such as N < 200, itera-
tive methods can fail to converge, solutions can be inadmissible due to Heywood cases or other anomalous
results with no meaningful interpretation, or parameter estimates can be very biased. Sometimes researchers
who are reluctant to use SEM in smaller samples opt to use simpler techniques, but small-sample bias can
be even worse in procedures like multiple regression or manifest-variable path analysis with no control for
measurement error (Rosseel, 2020). Discussed in this chapter are ways to cope when applying SEM in small
samples. Suggestions are offered for specification, parameter estimation, and evaluation of model fit. Some
of these options have been available for some time, but others, including special estimation strategies and
test statistics for small samples, are newer and perhaps less familiar to applied researchers. Their application
is demonstrated in the analysis of a common factor model in a small sample.
SUGGESTIONS FOR ANALYZING to prevent inadmissible solutions (Marsh & Hau, 1999).
COMMON FACTOR MODELS This tactic makes more sense if the indicators share the
same metric; otherwise, fixing loadings for indicators
Iterative estimation failure or inadmissible solutions of the same factor to nonzero constants that reflect dif-
are more likely in analyses of reflective measurement ferences in their standard deviations is an option.
models where some factors have only two indicators
or the sample size is less than 100–150 cases or so 3. Consider local estimation methods that analyze a
(Boomsma, 1985; Gerbing & Anderson, 1987). The single equation at a time such as model-implied instru-
options for analyzing models under these conditions mental variables using two-stage least squares (MIIV-
are listed next, followed by a discussion of each: 2SLS) (Bollen, Fisher, Giordano, et al., 2022) (Chapter
14). Because the MIIV-2SLS estimator is not iterative,
1. Use indicators with good psychometric character- there are no convergence problems. The method does
istics that will also have standardized loadings > .70 not rely on normality assumptions, and it might better
or so for continuous indicators. Such models are less isolate effects of specification error than simultaneous
susceptible to Heywood cases (Wothke, 1993). This is methods such as ML.
good advice for any sample size, but it is especially so 4. Estimation for the measurement and structural
in small samples.
parts of the model is decoupled in the structural-
2. Imposing equality constraints on the unstandard- after-measurement (SAM) approach (Rosseel & Loh,
ized loadings of indicators for the same factor may help 2021). First, parameters in the measurement model are
309
Pt4Kline5E.indd 309 3/22/2023 2:05:46 PM

310 Advanced Techniques
estimated. Next, structural parameters are estimated regression (FSR), common factors in the measurement
while keeping fixed the measurement parameters from model are replaced by factor scores, which means that
Step 1. This two-step estimation may better isolate variables in the structural model are observed (i.e, they
specification error in the measurement model from esti- are weighted linear combinations of the indicators).
mation in the structural model compared with simulta- After correcting the covariance matrix of factor scores
neous methods.1 so that it more closely estimates the covariance matrix
for latent variables, the structural model is then esti-
5. It is generally easier to test more complex mod-
mated with standard regression techniques (Devlieger
els in smaller samples using composite SEM. This is
et al., 2016).
because composite-based methods can produce results
In Step 1 of the SAM method implemented in lavaan
for complex models that would require prohibitively
(Rosseel et al., 2023), by default the computer estimates
large samples in traditional SEM (Willaby et al., 2015). a separate single-factor model for each block of indica-
But results in composite SEM can be very biased in tors in the measurement model. Thus, misspecification
small samples, too, so they are not foolproof (Kock & in one block does not affect estimates of factor loadings
Hadaya, 2018). or error variances in another block. If a block has insuf-
6. Analyzing item-level data in small samples is ficient indicators for a single-factor model without con-
challenging. This is because models with items as indi- straints (< 3 indicators) or if there are error covariances
cators can be relatively large when analyzed with meth- between indicators in different blocks, the researcher
ods for ordinal, not continuous, data. An alternative is can specify a single measurement block that includes
parceling, which involves making aggregates of two two or more common factors and all their indicators.
or more items, or parcels, by summing or averaging In Step 2, the option for global SAM holds measure-
the items. Next, the parcels are specified as indicators, ment parameters constant while the computer estimates
which simplifies the model. Finally, estimation meth- structural parameters. Option local SAM means that
ods for continuous, not categorical, data are used. Par- estimates of measurement parameters in Step 1 are
celing also tends to reduce error variance in the analy- combined with summary statistics for the indicators to
sis, but its assumptions are quite restrictive (Rioux estimate means and covariances for latent variables,
et al., 2020). There are also many decision points, or which are used in Step 2 to estimate structural param-
researcher degrees of freedom, where results in parcel- eters. In both options, standard errors are adjusted for
ing can be appreciably changed at each point (Sterba & estimation over two steps. In computer simulations,
Rights, 2023). local SAM outperformed both global SAM and con-
ventional simultaneous estimators in small samples
(Rosseel & Loh, 2021).
Structural‑After‑Measurement
Approach
Parceling in Small Samples
The basic idea behind the SAM approach, or estima-
tion of structural parameters after measurement param- Suppose that a 120-item questionnaire consists of three
eters, is not new. For example, Burt (1976) described nonoverlapping sets of 40 items each. Each set of items
limited-information estimation of measurement param- is expected to measure a distinct domain. The sample
eters in the first analysis step that are held constant in size is N = 150. A model analyzed in categorical CFA
the second step when estimating structural parameters would have 3 factors each with 40 indicators (items) for
as way to avoid interpretational confounding. In the a total of 120. The analysis of a such a large model in a
PLS-PM algorithm, the weights for the outer (measure- small sample could lead to imprecise estimates or tech-
ment) model are estimated before the parameters of nical problems in the analysis. To cope, the researcher
inner (structural) model (Table 16.3). In factor score divides each set of 40 items into 4 parcels with 10 items
each. Items in each parcel are summed.2 Next, the total
1 Two-step modeling is a model evaluation strategy to detect scores replace the individual items as indicators in a
specification error in the measurement model at Step 1 before 3-factor, 12-indicator (4 parcels/factor) CFA model. If
the full model is analyzed at Step 2 (Chapter 15). In contrast,
two-step estimation is an estimation strategy that assumes the 2 Little
(2013) recommended averaging item scores as opposed
measurement model is based on substantive theory and fits the to summing them, if the number of items per parcel is unequal;
data (Rosseel & Loh, 2021). otherwise, analyzing total scores over parcels is fine.
Pt4Kline5E.indd 310 3/22/2023 2:05:46 PM

Analyses in Small Samples 311
distributions of all parcels are normal, then default ML conditions are relatively infrequent. An alternative is to
could be used to analyze the data; otherwise, a robust empirically check unidimensionality by fitting a series
ML estimator is an option. of single-factor CFA models for the items within each
Rioux et al. (2020) described other potential benefits parcel. If the model is retained, then aggregating the
to parceling in addition to reducing the number of indi- items that make up a parcel is supported. There are
cators: Total scores over item sets generally have more two problems with such empirical checks: (1) Data
numerous, smaller in width, and more equal intervals generated by multidimensional population models can
that better approximate a continuous measurement be consistent with a single-factor model by sampling
scale than responses to individual items, which may be error. (2) Respecifying parcels made by adding or drop-
coded as integers that represent discrete points along ping items capitalizes on sampling error, especially if a
a Likert scale (e.g., 0, 1, 2 for, respectively, disagree, CFA model with modified parcels is analyzed using the
neutral, agree). Total scores tend to be more precise same data—see Matsunaga (2008), Rioux et al. (2020),
(reliable) than responses to individual items. Ratios of and Sterba and Rights (2023) for more information.
common-to-unique variances and levels of communal-
ity are generally higher for parcels than for items, and
these characteristics of parcels can help to avoid esti- ANALYSIS OF A COMMON FACTOR
mation problems in small samples. Analyzing parcels MODEL IN A SMALL SAMPLE
can reduce the potential impact of correlated errors or
loadings of indicators on multiple factors that arise due In Chapter 14, we fitted the two-factor CFA model in
to sampling error. For example, when items with corre- Figure 17.1 to the data in Table 14.6 from Sabatelli and
lated errors are aggregated, the effect of this correlation Bartle-Haring (2003), who administered measures of
may be reduced and more completely isolated in the family-of-origin experiences and marital adjustment in
error term for the parcel. a sample of N = 103 women. The solution was inad-
But there are significant drawbacks, too. One is that missible due to a Heywood case; specifically, the error
there are many ways to parcel items, including ran- variance for the intimacy variable is negative, –39.892;
dom assignment and grouping items based on rational see the output file for analysis 3 in Table 14.1.
grounds, among others, and the choice can affect the Listed in Table 17.1 are two reanalyses of the model
results. Parceling is not recommended if unidimension- in Figure 17.1, both of which generate admissible solu-
ality of items within each parcel cannot be assumed. tions. Analysis 1 features the imposition of a constraint
This is because analyzing total scores for sets of items on the unstandardized loadings for both indicators of
that are really multidimensional can seriously distort marital adjustment. The reference variable is marital
the results so that a misspecified model could neverthe- problems, but also constraining the loading for inti-
less fit the data. The most straightforward option is to macy to a = 1.0 (see the figure) is not ideal because
use well-established measures with item sets that are these two variables do not have the same metric. For
known in advance to be unidimensional, but these ideal example, the standard deviations for intimacy and
EPr EIn
Father-
Father Mother Problems Intimacy
Mother
1 1 a
Family Marital
of Origin Adjustment
FIGURE 17.1. Two-factor model of family-of-origin experiences and marital adjustment analyzed in a small sample (N =
103).
Pt4Kline5E.indd 311 3/22/2023 2:05:47 PM

TABLE 17.1. Analyses, Script Files, and Packages in R for a Two-Factor Model
of Family of Origin Experiences and Marital Adjustment
1. Two-factor model, proportionality-constrained sabatelli-loadings-ml.r lavaan
loadings, ML estimation, alternative test semTools
statistics based on correlation residuals
2. Two-factor model, 2SLS estimation with sabatelli-miiv-2sls.r MIIVsem

model-implied instruments lavaan
problems are, respectively, 22.749 and 32.936 (Table reserve judgment on model fit until other results for this
14.6), and their ratio is 22.749/32.936 = .691. Thus, the analysis (i.e., the residuals) are considered. Exercise 2
loading for intimacy is fixed in analysis 1 to a = .691 involves rerunning the analysis just described, except
to reflect the proportionate difference in their standard to fix the loadings for both indicators of marital adjust-
deviations. The estimator is default ML. In analysis 2, ment to 1.0.
Figure 17.1 is fitted to the data with no constraints in The estimator for analysis 2 in Table 17.1 is MIIV-
the MIIV-2SLS estimator of the R package MIIVsem 2SLS. No global fit statistics are computed by the
(Fisher et al., 2021). How the method locates model- MIIVsem package, but Bollen (2019, p. 17) describes
implied instruments is explained in Topic Box 17.1. options for overidentification tests of the whole model,
Analysis 1 with constrained loadings for the two including the vanishing tetrad test (Chapter 14). Instead,
indicators of marital adjustment in Figure 17.1 con- the MIIVsem package computes the Sargan test for
verged to an admissible solution. Exercise 1 asks you each set of model-implied instruments in Table 17.2.
to verify that df M = 5 when the loading for intimacy Sargan test statistics approximate central chi-square
is fixed to equal a nonzero constant. It is no surprise distributions with degrees of freedom that equal the
that the power of the chi-square test estimated in the number of instruments minus one; thus df = 2 for each
MacCallum–RMSEA method is only .11, given the result in the table. The null hypothesis is that each set of
small sample size, or N = 103.3 Thus, the likelihood multiple instruments is uncorrelated with the error term
is only about 10% over random samples that a popula- for the equation. Failure to reject the null hypothesis for
tion model without perfect fit would be detected by the the Sargan test suggests that the equations in Table 17.2
chi-square test. The minimum sample size required for are consistent with the model in Figure 17.1. See Topic
power ≥ .90 is about 1,320 cases, or over 10 times the Box 17.1 for more information.
actual sample size (N = 103). Reported next are selected Listed in the second column of Table 17.3 are the
values of global fit statistics (see the output file for anal- parameter estimates with constrained factor loadings
in default ML estimation (analysis 1), and reported in
ysis 1, Table 17.1):
the third column are results for the MIIV-2SLS esti-
mator (analysis 2). Note that (1) standardized estimates
chiML(5) = 8.449, p = .133
are not computed in the version of the MIIVsem pack-
RMSEA = .082, 90% CI [0, .174] age used in this analysis. Also, (2) standard errors are
CFI = .987, SRMR = .045 reported in Table 17.3 for MIIV-2SLS estimates only
for indicators that are not reference variables in Fig-
Although the model passes the chi-test, its power is ure 17.1. The MIIVsem package can optionally gen-
very low as noted. Neither the relatively high value for erate bootstrap standard errors, but nonparametric
the RMSEA (.082) nor the width of its 90% CI is unex- bootstrapping does not apply when the input data are
pected for this statistic in small samples. Values for the summary statistics (Table 14.6) instead of a raw data
CFI and SRMR are not obviously problematic. We will file. Bootstrap standard errors in small samples can be
very inaccurate, though, so their absence here may not
3 Other parameters: e0 = 0, e1 = .05, df M = 5, a = .05. be a great loss.
Pt4Kline5E.indd 312 3/22/2023 2:05:47 PM

TOPIC BOX 17.1
Model‑Implied Instrumental Variables

The R package MIIVsem automatically respecifies Figure 17.1 to replace each factor with its reference vari-
able minus the error term (i.e., latent-to-observed transformation, or L20; Bollen, Fisher, Giordano, et al.,
2021). For example, equations for both indicators of the marital adjustment factor are listed next:
Problems = Marital + EPr (17.1)

Intimacy = a × Marital + EIn
where a is the unstandardized loading for intimacy, which is a free parameter (the loading for problems
is fixed to 1.0), and EPr and EIn are the indicator error terms. The marital factor is replaced by its scaling
indicator minus the corresponding error term, or
Marital = Problems – EPr (17.2)
Next, substituting Equation 17.2 for the factor in Equation 17.1 generates the L20 respecification for inti-
macy, or
Intimacy = a × Problems + EIn – a EPr (17.3)
The predictor in Equation 17.3, marital problems, is correlated with the composite error term for inti-
macy (i.e., EIn – a EPr), which rules out OLS regression but not 2SLS regression with instruments. Figure 17.1
implies a total of three instruments for Equation 17.3, or all indicators of the family-of-origin (FOE) factor.
You can apply the tracing rule to Figure 17.1 to verify that FOE indicators are all independent of errors
for marital adjustment indicators, or both EPr and EIn in Equation 17.3 Listed in Table 17.2 are the model-
implied instruments for each indicator in the model that is not a reference variable. The MIIVsem package
reports the Sargan test for each set of multiple instruments in the table. A significant result suggests that
at least one model-implied instrument fails to meet the requirements for an instrumental variable, which
suggests a specification error. Applied in a small sample, though, the Sargan test may have relatively low
power to detect errors. See Jin (2022) for other cautions about the Sargan test.
TABLE 17.2. Model-Implied Instruments for Nonscaling Indicators

in a Two-Factor Model of Family-of-Origin Experiences and Marital
Adjustment and Sargan Test Results
Criterion Predictor Model-implied instruments Sargan df p
Intimacy Problems Father, Mother, Father-Mother 4.980 2 .083
Mother Father Intimacy, Problems, Father-Mother 1.763 2 .414
Father-Mother Father Intimacy, Problems, Mother 3.590 2 .166
Note. All results computed by MIIVsem; N = 103.
Pt4Kline5E.indd 313 3/22/2023 2:05:47 PM

Unstandardized factor loadings in Table 17.3 are the table are from 2SLS estimation. An absolute correla-
generally similar over the two estimators, ML and tion residual in ML estimation, shown in boldface in the
2SLS. There are somewhat greater differences over the table, just exceeds .10 for the problems and father indica-
two sets of results for error variances and factor vari- tors of, respectively, marital adjustment and FOE. Spe-
ances. This is especially true for the intimacy factor, cifically, the model overestimates the correlation in this
for which the ML error variance estimate (106.214) is pair of indicators by about .10 (the sample value is .265;
about twice as large as the 2SLS estimate (50.871). The Table 14.6). The corresponding standardized residual for
estimated factor correlation in the ML solution is .469, this indicator pair, –3.994 (see the output file for analy-
so better marital adjustment is associated with higher sis 1, Table 17.1), is significant, too. These are relatively
levels of respect and acceptance between dyads in early poor results for a such a small model. The computer was
family experiences. For Exercise 3, you are asked to unable to calculate all possible standardized residuals,
compute the factor correlation from the factor covari- which is not surprising in a small sample. None of the
ance of 157.495 in the 2SLS estimates in the table. normalized residuals is significant, but these tests are
Reported in Table 17.4 are the correlation residuals for even less powerful than standardized residuals.
each estimator. Residuals in the top part of the table are The MIIVsem package does not compute either
from ML estimation, and results in the bottom part of model-implied correlations for indicators or correlation
TABLE 17.3. Maximum Likelihood (ML) and Two-Stage Least

Squares (2SLS) Estimates for Analyses of a Two-Factor Model
of Family-of-Origin Experiences and Marital Adjustment
Parameter ML 2SLS
Factor loadings
Marital adjustment
Problems 1.000 (—) .840a 1.000 (—)b
Intimacy .691 (—) .885 .805 (.155)
Family of origin
Father 1.000 (—) .929 1.000 (—)
Mother .935 (.091) .850 .899 (.089)
Father-Mother .821 (.100) .710 .787 (.099)
Error variances
Problems 335.923 (80.080) .294 422.427 (—)
Intimacy 106.214 (34.373) .216 50.871 (—)
Father 24.457 (11.060) .138 20.856 (—)
Mother 51.489 (11.730) .278 54.195 (—)
Father-Mother 101.712 (16.070) .497 103.301 (—)

Marital 806.040 (132.946) 1.000 702.393 (—)
Family 153.095 (26.669) 1.000 158.501 (—)
Marital Family 164.822 (42.788) .469 157.495 (—)
Note. Loading for intimacy fixed to .691 in the ML results; N = 103.

aEstimate (standard error) standardized.
bEstimate (standard error).
Pt4Kline5E.indd 314 3/22/2023 2:05:47 PM

TABLE 17.4. Correlation Residuals for Maximum Likelihood (ML)

and Two-Stage Least Squares (2SLS) Analyses of a Two-Factor
Model of Family of Origin Experiences and Marital Adjustment
Variable 1 2 3 4 5
ML
1. Problems 0
2. Intimacy –.004 0
3. Father –.101 .036 0
4. Mother .030 .048 .002 0
5. Father-Mother .035 .056 .003 –.016 0
2SLS
1. Problems 0
2. Intimacy –.010 0
3. Father –.086 .001 0
4. Mother –.086 .001 .003 0
5. Father-Mother –.086 .001 .003 .002 0
residuals. To obtain the correlation residuals for the cation, which moves the effects of measurement error
2SLS estimator in Table 17.4, I used lavaan to specify from the structural model to the measurement model
the model in Figure 17.1 but where all unstandardized (e.g., Figures 15.4–15.6). The method requires speci-
parameters are fixed to equal their 2SLS counterparts fication of the score reliability for a single indicator.
in Table 17.3. Next, I fitted the model with fixed param- There are four basic options (Savalei, 2018):
eters as just described to the covariance matrix in Table
14.6. The predicted correlation matrix in this analysis is 1. Estimate reliability in the data (i.e., the researcher’s
based on the 2SLS parameter estimates, and getting the sample).
correlation residuals is as easy as specifying this output 2. Specify reliability based on results in other sam-
option in lavaan—see the syntax file for analysis 2, ples, such as the normative sample described in a
Table 17.1. These residuals from the lavaan analysis test manual.
are reported in the bottom part of Table 17.4 for the
3. Fix reliabilities to arbitrary-but-common values in
2SLS solution. None of the absolute correlation residu-
als based on 2SLS results exceeds .10, including the the research literature, such as .70 or .80.
residual for the indicator pair problems and father. In 4. Specify reliability based on prior knowledge or
terms of local fit, I prefer the results for the 2SLS esti- experience with a measure about the proportion of
mator over those for the ML estimator in this example. total variance due to measurement error. Ideally,
this choice is preregistered as part of the analysis
plan.
CONTROLLING MEASUREMENT ERROR
IN MANIFEST‑VARIABLE PATH MODELS A drawback of Options 1 and 2 just listed is that empiri-
cal reliability coefficients are subject to sampling error,
Failure to control for measurement error in both stan- which can add variability to parameter estimates based
dard regression analysis and manifest-variable path on single-indicator respecification (Oberski & Satorra,
analysis can lead to parameter estimates biased in 2013). Empirical estimates based on the alpha coef-
directions and magnitudes that can be difficult to ficient—the most widely reported type of reliability
predict and also inflation of Type I error rates in sig- coefficient—can be problematic, too. This is because
nificance testing (Cole & Preacher, 2014; Westfall & if the assumptions of unidimensionality and tau-equiv-
Yarkoni, 2016). An option is single-indicator respecifi- alence are implausible, then alpha is not an accurate
Pt4Kline5E.indd 315 3/22/2023 2:05:47 PM

reliability measure—see the Psychometrics Primer on (Equations 10.1–10.2) to reject too many true models in
this book’s website. small samples, especially when the number of observed
In computer simulations for models with a single variables exceeds 30 or so. Deng et al. (2018) reviewed
indirect effect among three conceptual variables, Sava- about a dozen alternative test statistics based on correc-
lei (2018) compared three versions of single-indicator tions by Bartlett (1950), Swain (1975) (cited in Deng et
(SI) respecification in path models: SIa based on the al., 2018), or Yuan (2005), among others, that account
alpha coefficient estimated in the data, and two varia- for sample size, model size, or distributional charac-
tions based on fixed reliabilities of either .70 or .80— teristics. Jiang and Yuan (2017) described more recent
respectively, SI7 and SI8—with analyses of a multi- corrected chi-squares for small samples and nonnormal
ple-indicator common factor model in small samples distributions.
(N = 30–200). Single indicators in the path models were Rosseel (2020) reviewed computer simulation stud-
composites of the corresponding blocks of indicators in ies of adjusted model chi-squares for small samples
the common factor model. Other simulation variables published since 2007. Although some corrected test
included the magnitude of an indirect effect, the size of statistics performed better than others over variations
the direct effect of the causal variable on the outcome in sample size or model size for normal or nonnormal
variable, and three levels of population composite reli- distributions, no single version worked best in all situa-
ability (rXX = .65, .75, .85) selected so that the SI7 and tions. This assessment concurs with the observation by
SI8 models were never exactly correct. Jiang and Yuan (2017, p. 493) that “there does not exist
Savalei (2018) reported that bias in estimates for path a test statistic that performs universally well” in small
models with no control for measurement error was subsamples and that more work in this area is needed.
stantial and ranged from about 10% to 40% when reli- Given all these results, Rosseel (2020) suggested that
ability is low. Results for fixed-reliability models SI7 perhaps the use of the chi-square test in small-sample
and SI8 suggested appreciably lower amounts of bias analyses should be abandoned.
when reliabilities are, respectively, rXX = .65 or .75 and Alternatives include the use of confidence intervals
rXX = .75 or .85 (i.e., the specified reliability is close or tests of exact model fit or close model fit based on
to the population value). But if the SI7 or SI8 models correlation residuals described by Maydeu-Olivares et
underestimated reliability, such as SI7 (i.e., reliability al. (2018). The test statistics are standardized measures
is .70) when rXX = .85, then the degree of bias could be of model misfit (Maydeu-Olivares, 2017a). They include
relatively extreme, or exceeding 80% in some condi- the SRMR, or the square root of the average covariance
tions. The power of fixed-reliability SI methods under residuals each standardized by the product of the stan-
the same conditions was also lower compared with SIa. dard deviations for each pair of measured variables,
In larger samples (e.g., N = 200), where mainly bias, and the correlation root mean residual (CRMR), or
and not sampling error in reliability estimates, affects the square root of the average squared correlation resid-
estimate accuracy, the SIa method was optimal. The ual. Values for the SRMR and CRMR are not always
fixed-reliability methods SI7 and SI8 performed nearly equal for the same model and data, but their values
as well, but only if the “guess” (.70 or .80) about reli- should be similar. Unlike the Schuberth et al.’s (2018)
ability on which each method is based is close to the bootstrapped SRMR-based global fit test for compos-
true values. None of the methods studied by Savalei ite models (Chapter 16), the Maydeu-Olivares et al.
(2018) had sufficient power to detect smaller direct (2018) tests rely on unbiased estimators of population
effects even when N = 200. Overall, Savalei (2018) rec- SRMR and CRMR (i.e., there is no bootstrapping). In
ommended the use of fixed-reliability SI methods in computer simulations by Maydeu-Olivares et al. (2018),
small samples. their tests performed reasonably well in small samples
of N = 100–200 for a model with two common factors
(i.e., a relatively small model).
ADJUSTED TEST STATISTICS The confidence intervals and tests just described are
FOR SMALL SAMPLES implemented in lavaan, and the results are accessed
through the “lavResiduals( )” function. The CRMR
Just as there are corrected model chi-squares that is the test statistic for Bollen-type correlation residuals,
account for nonnormality (e.g., Satorra–Bentler scaled and the test statistic for Bentler-type correlation residu-
chi-square; Equation 10.7), there are also versions that als is the SRMR (see Topic Box 9.3 for a summary of
reduce the tendency of the uncorrected chi-square the two types of correlation residuals just mentioned).
Pt4Kline5E.indd 316 3/22/2023 2:05:47 PM

The method also generates a test statistic for each cor- this way is analogous to user-specified starting values
relation residual (Maydeu-Olivares, 2017a). These test in standard (non-Bayesian) ML estimation. In computer
statistics are normal deviates (i.e., z), but I would not simulations, Smid and Rosseel (2020) found that Bayes-
refer to them as “standardized residuals,” which in ian estimation with informative priors generally out-
simultaneous estimation methods are test statistics for performed frequentist estimation methods, such as ML,
covariance residuals, not correlation residuals. when analyzing common factor models in small sam-
Tests statistics in the Maydeu-Olivares et al. (2018) ples. Both frequentist methods and Bayesian estimation
method for both types of correlation residuals avail- with weakly informative priors were also more prone
able in lavaan are included in the output file for to iteration failure or Heywood cases in such samples.
analysis 1 of Figure 17.1 with the default ML estimator See McNeish (2016) for more information about using
(Table 17.1). For Bollen-type residuals, CRMR = .044 Bayesian methods for SEM in small samples.
with an estimated standard error of .015. The result Penalized likelihood estimation—also called reg-
for the exact-fit test is p = .307, so the null hypothesis ularization methods—works by penalizing param-
that the population and model-implied correlations are eters in complex models according to criteria specified
equal (Equation 16.1) is retained at the .05 level. The by the researcher. If the penalty exceeds the criterion
unbiased version of the CRMR, designated as "ucrmr" value during estimation, the corresponding parameter
in the output, equals .023, and the 90% confidence is automatically removed, which simplifies the model.
interval is [–.023, .071], which includes zero within The goal is to avoid overfitting the model. Three basic
its bounds. The standard chi-square test of the exact- regularization methods in regression analysis are the
fit hypothesis that the population and model-implied ridge, lasso, and elastic net. The ridge method and
covariances are equal (Equation 9.3)—or chiML (5) lasso (least absolute shrinkage and selection opera-
= 8.449, p = .133—points to a similar conclusion at a tor) method minimize both the residual sum of squares
somewhat more extreme level of statistical significance. and a penalty factor that controls the amount of shrink-
Again, the small sample size in this analysis (103 cases) age. Penalized coefficients in ridge regression never
limits the power of these global fit tests. Exercise 4 asks equal zero, but lasso regression drives some coefficients
you to describe the results of the Maydeu-Olivares et al. to equal zero, which removes the corresponding predic-
(2018) tests based on Bentler-type correlation residuals tor from the equation. The elastic net method incorpo-
for the same model and data. rates penalties from both the ridge and lasso methods.
It also selects predictors for removal, but the elastic net
method is preferred when (1) there is extreme multicol-
BAYESIAN METHODS linearity among predictors or (2) there are more predic-
AND REGULARIZED SEM tors than cases (Zou & Hastie, 2005).
Jacobucci et al. (2016) described regularized
Two more advanced methods are mentioned next, but it structural equation modeling (regularized SEM) as
is beyond the scope of this chapter to describe them in extending the ridge, lasso, and related penalties to the
detail; see the sources cited next for more information. analysis of structural equation models. It adds penal-
Bayesian estimation of structural equation models in ties to model parameters specified by the researcher
small samples involves the specification of informative and offers an alternative strategy for model selection.
priors, which are probability distributions for values of For example, instead of selecting among a discrete
a target parameter based on specific knowledge, such number of alternative models in trimming or building
as the results from previous empirical studies. In con- (Chapter 11), the method imposes a sequence of pen-
trast, weakly informative priors represent some, but alties for specific parameters that allows models to be
not all, current knowledge, and such probability distri- compared on a more or less continuous scale. Penalties
butions are typically wider (more variable, less precise) can also be gradually increased as a further safeguard
than those for informative priors.4 In small samples, against overfitting. An advantage of regularized SEM
the specification of informative priors helps the com- is its flexibility: Penalties can be added (or not) to any
puter to narrow the range of parameter estimates and in part of a structural equation model. It also supports
more exploratory analyses through reduction of model
4I avoided use of the more controversial term “noninformative complexity. Estimates based on regularized methods
priors,” which are typically based on minimal prior knowledge are biased toward zero (they are too small in abso-
instead of absolutely none (Chakraborty & Ghosh, 2012). lute value), but reducing model complexity generally
Pt4Kline5E.indd 317 3/22/2023 2:05:47 PM

decreases error variance, which could be a reasonable while still representing core hypotheses when sample
trade-off in small samples (Liang & Jacobucci, 2020). size is not large. As Cudeck and Henly (1991, p. 513)
The R package regsem (Jacobucci, 2021) implements suggested, “It may be better to use a simple model in
regularized SEM. Models are specified and fitted to a small sample rather than one that perhaps is more
the data in lavaan before regularization methods are realistically complex but that cannot be accurately esti-
applied by regsem. mated.”
SUMMARY LEARN MORE
Deng et al. (2018) cautioned that blind application of Deng et al. (2018) describe issues and possible remedies for
SEM methods intended for large samples can generate SEM analyses in small samples with large numbers of vari-
ables, Marcoulides et al. (2023) discuss the analysis of large
misleading results in small samples or fail outright due
models in small samples and suggest ways to cope, and the
to technical problems in the analysis. Thus, research-
edited volume by van de Schoot and Miočević (2018) is a
ers who intend to analyze structural equation models in
valuable resource for researchers who apply complex statis-
small samples ought to have a set of coping skills. Sug- tical techniques, including SEM and multilevel modeling, in
gestions for analyzing common factor models include small samples.
the imposition of constraints on loadings for indicators
of the same factor; estimating measurement parameters Deng, L., Yang, M., & Marcoulides, K. M. (2018). Structural
before structural parameters; the use of noniterative, equation modeling with many variables: A systematic
single-equation estimators with model-implied instru- review of issues and developments. Frontiers in Psychol-
ments; and analyzing parcels instead of items as indica- ogy, 9, Article 580.
tors. Applying the method of single-indicator respecifi-
Marcoulides, K. M., Yuan, K. H., & Deng, L. (2023). Struc-
cation with a priori values for score reliability might
tural equation modeling with small samples and many
reduce bias due to measurement error when analyzing
variables. In R. H. Hoyle (Ed.), Handbook of structural
manifest-variable path models in small samples. Other
equation modeling (2nd ed., pp. 525–542). Guilford
possibilities available to researchers without advanced Press.
quantitative training include replacement of the model
chi-square test with versions based on correlation resid- van de Schoot, R., & Miočević, M. (Eds.) (2020). Small
uals that may perform better in small samples. Above sample size solutions: A guide for applied researchers.
all, specify models that are as parsimonious as possible Routledge.
EXERCISES
1. Show that df M = 5 for Figure 17.1 when unstandard- 3. In Table 17.3 for the MIIV-2SLS estimates, com-
ized loadings for the problems and intimacy vari- pute the factor correlation.
ables are fixed to equal, respectively, 1.0 and .691.
4. For analysis 1 in Table 17.1, describe results for the
2. Rerun analysis 1 in Table 17.1 but constrain the SRMR-based confidence intervals and significance
loadings for both indicators of marital adjustment, tests for the Bentler-type residuals.
problems and intimacy, to 1.0. Comment on the
results.
Pt4Kline5E.indd 318 3/22/2023 2:05:48 PM

18
Categorical Confirmatory Factor Analysis
In basic CFA models, the indicators are continuous variables that are generally analyzed using methods for
continuous data, such as default ML for normal distributions or robust ML for nonnormal distributions (Chap-
ter 14). In categorical CFA, the indicators are ordered categorical variables with (1) an intrinsic
ordering among two or more response categories, but where (2) the intervals between those categories are
not assumed to be equal (i.e., the data are ordinal). An example is when individual Likert items (Likert,
1932) are specified as the indicators in a reflective measurement model. Such items have a discrete number
of response categories. In the detailed example for this chapter, the items involve the frequency of symptoms
for depression experienced in the prior week. The response options for all items, and the corresponding
numerical values that represent each option in the raw data file, are listed next:
0 = < 1 day, 1 = 1–2 days, 2 = 3–4 days, 3 = 5–7 days (18.1)
The numerical scale just listed (0–3) can distinguish only four levels of frequency per week, and it would
be hard to argue that intervals between adjacent categories are all equal. Also, these particular numerical
values (i.e., 0, 1, 2, 3) are arbitrary with no objective or theoretical basis. This is because alternative coding
schemes, such as (1, 3, 5, 7) for the same response options, would work just as well as any other set of four
numbers in either ascending or descending order where the distance between successive categories is the
same. Accordingly, means, variances, and covariances among Likert items are also arbitrary. The estimation
methods for continuous endogenous variables generally analyze covariance matrices, but such summary
statistics for Likert items are generally meaningless for the reasons just explained. Another problem is that
covariances include Pearson correlations (i.e., covXY = rXY SDX SDY ), and such correlations assume continu-
ous variables.
BASIC ESTIMATION OPTIONS errors that were generally too small. This distortion
FOR CATEGORICAL DATA generally inflates the rate of Type I errors in signifi-
cance tests of parameter estimates. Values of model
Results from computer simulation studies generally test statistics, such as mean-and-variance adjusted chi-
indicate that use of estimators for continuous variables squares that correct for nonnormality (Chapter 10),
to analyze measurement models with Likert items as were generally too high when (1) population response
indicators is potentially problematic. For example, distributions were marked skewed, or asymmetrical,
Rhemtulla et al. (2012) reported that for Likert items and (2) patterns of asymmetry alternated over items in
with fewer than 5 response categories, the robust ML the same model. Under these conditions, the chi-square
method for continuous variables generated biased esti- test rejected too many correctly specified models. Oth-
mates of factor correlations and it generated standard erwise, the adjusted chi-square under robust ML was
319
Pt4Kline5E.indd 319 3/22/2023 2:05:48 PM

more accurate in detecting misspecified models except with special methods for ordinal data such as in CCVM.
in small samples, such as N < 100 or so, where power Another advantage is that the ML estimator for contin-
is low. uous data generally requires smaller samples compared
The results from Rhemtulla et al. (2012) just sum- with methods used for CCVM. But I would not recom-
marized are generally consistent with those reported by mend estimation methods for continuous data in CFA
Bernstein and Teng (1989), who in computer simula- when categorical variables have 2–4 categories or when
tions categorized continuous indicators in population there are ≥ 5 categories but response distributions are
factor models to form ordered-categorical indicators very skewed. In such cases, the CCVM approach is a
analyzed in sample models. They found that when there better option.
is a single factor in the population but indicators have
few categories, such as 2–4, one-factor models tend to
be rejected too often. That is, categorization can spuri- OVERVIEW OF CONTINUOUS/
ously suggest the presence of multiple factors. Li (2016) CATEGORICAL
reported that the ML method substantially underesti- VARIABLE METHODOLOGY
mates factor loadings when ordinal indicators have
only 4 response categories. There is also appreciable Muthén’s (1984) CCVM is a general strategy for ana-
attenuation in psychometric precision in responses to lyzing common factor models with any combination of
personality questionnaire items for Likert items with continuous or categorical observed variables as indica-
2–5 response options compared with items with at least tors. The CCVM is distinguished by the features listed
6 options (Simms et al., 2019). next and elaborated afterward:
But when Likert items have at least 5–7 response cat-
egories, the accuracy of the ML method with adjusted 1. Input for the analysis recognizes the presence of
test statistics was better, especially when population categorical variables among the indicators.
response distributions were approximately symmetri- 2. Parts of the measurement model for categorical
cal (Rhemtulla et al., 2012). In this case, the ML option variables are specified in a different way compared
performed as well or even better than the estimation with continuous indicators. Briefly, continuous
method for categorical data described in the next sec- indicators are specified as directly regressed on
tion that represents Likert items as discretized indica- their common factors such as when analyzing basic
tors of underlying (latent) and normally distributed CFA models (e.g., Figure 14.1). In contrast, each
response variables that are continuous. The option just categorical variable is associated with a theoreti-
mentioned is called continuous/categorical variable cal continuous variable estimated in the data, and
methodology (CCVM) by Muthén (1984) and the latent that estimated continuous variable is regressed on
response formulation (LRF) by Koziol et al. (2023), a common factor, not the corresponding observed
but just the term “CCVM” is used from this point. In (categorical) variable.
this approach, Likert items-as-indicators are not ana-
lyzed as though they were continuous variables. But 3. The factor model is fitted to a correlation matrix,
when response distributions for Likert items were very not a covariance matrix. But that correlation matrix
asymmetrical in Rhemtulla et al.’s (2012) simulations, is not observed, that is, it is not computed with
the CCVM method generated more accurate estimates measured variables. Instead, the matrix analyzed
of factor loadings than robust ML, especially when pop- is the estimated correlation matrix for the theoreti-
ulation factor loadings were relatively high. But relative cal continuous variables presumed to underlie the
bias in ML estimates was generally less than 10%, again categorical data (and perhaps also the continuous
for items with at least 5 response options. data for models with both continuous and categori-
To summarize, there may be relatively little harm cal indicators).
in treating data from ordinal indicators as continuous 4. An estimator that makes no assumptions about indi-
when the number of response categories is ≥ 5 and dis- cator distributions is generally used. Corrected test
tributions of item responses are generally symmetrical. statistics and a method to generate robust standard
An advantage is that models analyzed with methods for errors for parameter estimates can also be gener-
continuous variables are simpler than models analyzed ated.
Pt4Kline5E.indd 320 3/22/2023 2:05:48 PM

Categorial Confirmatory Factor Analysis 321
LATENT RESPONSE VARIABLES scale. Cumulative probabilities over the three catego-
AND THRESHOLDS ries are also shown in the figure. Thresholds are based
on the cumulative response probabilities. For example,
In CCVM, each observed categorical variable is associ- the cumulative probability for endorsing “1” (disagree)
ated with a latent response variable, or the underly- is .25, so t1 = –.674, which is the value of the normal
ing amount of a continuous and normally distributed deviate z that falls at the 25th percentile in a normal
continuum required to respond in a certain way on curve. You should verify this statement with an online
the corresponding indicator. This amount or thresh- normal distribution calculator.1 The same result also
old is the point on the latent response variable where says that responses to item X are expected to shift
one response option is given (e.g., strongly agree), if from “1” (disagree) to “2” (agree) when the level of X*
the threshold is exceeded. It is also the point where the increases to about two-thirds below the mean. Exercise
next lowest response option is given (e.g., agree), if the 1 asks you to interpret the result t2 = .253 in Figure
threshold is not exceeded. Dichotomous (binary) items, 18.1(b), given the cumulative probability of .60 over the
such as those coded as 0 for “false” or “incorrect” or response categories of “1” (disagree) and “2” (neutral)
as 1 for “true” or “correct,” have a single threshold that together.
marks the point on the underlying dimension where the When analyzing Likert items as indicators in
observed response shifts from false/incorrect to true/ CCVM, there is no requirement that all items have the
correct. The number of thresholds for polytomous same number of response categories. This is because it
items with three or more response categories equals is not a special problem to analyze items with different
the number of categories minus one. response scales (e.g., some items are true–false, oth-
Suppose that item X has the 3-point Likert scale ers have ≥ 3 response categories). The total number of
listed next: thresholds for each item is just one less than the number
of its response options.
1 = disagree, 2 = neutral, 3 = agree
The scale just described is considered to be a crude POLYCHORIC CORRELATIONS

categorization of X*, the underlying latent continuous
response variable. Item X has two threshold parame- For a set of categorical indicators, estimated thresh-
ters, t1 and t2 (lowercase Greek letter tau). When X* olds and sample cross-tabulations of item responses are
has a mean of zero and a variance of 1.0, thresholds used by the computer to estimate the matrix of Pear-
are values of the normal deviate z that divide a normal son correlations between the latent response variables.
distribution into ordered categories and thus relate dis- These estimated correlations are called polychoric
crete responses on X to continuous X* values. Specifi- correlations.2 Modern computer programs gener-
cally, the data generating process is considered to be as ally use simultaneous methods to estimate the whole
follows: polychoric correlation matrix. Pairwise estimation of
Pearson correlations is an older method that required
1, if X * ≤ t1; less computer memory, but this method is less relevant
 (18.2)
=X 2, if t1 < X * ≤ t2 ; today given the availability of personal computers with
3, if X * > t . relative large memory capacities. Fortunately, modern
 2
computers can simultaneously estimate the whole cor-
In other words, an observed response of “1” (disagree) relation matrix in many, if not most, applied studies
is predicted if the level of X* is less than or equal to with ordinal data.
that of t1 in standard deviation units. For levels of
X* greater than t1 but less than or equal to t2, the 1 https://onlinestatbook.com/2/calculators/normal_dist.html
predicted response is “2” (neutral), and X* > t2 cor- 2 Fordichotomous items where all cross-tabulations are 2 × 2
responds to a response of “3” (agree) on item X. An contingency tables, the polychoric correlation is called the tetra-
example follows. choric correlation, but I use the more general term “polychoric
Presented in Figure 18.1(a) is the histogram of correlation,” which includes ordinal items with two or more
responses to a hypothetical item X with a 3-point Likert response categories.
Pt4Kline5E.indd 321 3/22/2023 2:05:48 PM

(a) Likert-scale item X cumulative probability histogram
1.00
1.00
.90
.80
.70
Proportion
.60
.60
.50
.40
.30
.20 .25
.10
1 2 3
Disagree Neutral Agree
Response Category
(b) Latent response variable X* with thresholds
−3.0 −2.0 −1.0 0 1.0 2.0 3.0

τ1 = −.674 τ2 = .253
X*
FIGURE 18.1. Histogram for observed responses on hypothetical item X with a 3-point Likert scale (a) and the corresponding
latent response variable X* with thresholds, t (b).
Polychoric correlations are typically estimated under mating polychoric correlations for each pair of ordinal
the assumption that the joint distribution of the continu- indicators (Jöreskog & Sörbom, 2021). There is evidence
ous latent response variables is multivariate normal, that estimates of polychoric correlations are generally
which implies that their univariate distributions are nor- robust to moderate nonnormality, such as when absolute
mal and that their associations with each other are strictly values of standardized skew and kurtosis do not exceed,
linear (Chapter 4). Homoscedasticity is also assumed; respectively, 1.25 and 3.75. But they are not robust to
that is, the latent response variables have equal variances. extreme nonnormality, such as when absolute standard-
Some computer tools, such as PRELIS in LISREL, offer ized skew and kurtosis exceed, respectively, 5.0 and 50.0
methods to evaluate the normality assumption when esti- (Flora & Curran, 2004).
Pt4Kline5E.indd 322 3/22/2023 2:05:49 PM

MEASUREMENT MODEL so on for each of the remaining items. This threshold

AND DIAGRAM structure represents the presumed causal effects of
latent response variables on observed variables (items).
When all observed variables are categorical, the mea- Specifically, when the level on a latent response vari-
surement model analyzed in CCVM consists of the able X* exceeds a threshold, then the observed response
latent response variables as the continuous indicators of on X is expected to shift to the next response category
the common factor(s). Both the number of common fac- (e.g., Equation 18.2).
tors and the correspondence between factors and their The common factor in Figure 18.2 is designated as A,
indicators, the latent response variables, are specified and its indicators are the five latent response variables,
by the researcher just as in “regular” CFA when analyz- each of which has an error term because they are all
ing continuous indicators only. Relations between latent endogenous. Unlike error terms for continuous indi-
response variables and common factors are assumed to cators, which represent unique variance as the sum of
be linear. However, relations between latent response specific variance and random measurement error, the
variables and their observed counterparts are generally error terms for the latent response variables represent
nonlinear. An example follows. only specific variance not shared with the common fac-
Presented in Figure 18.2 is the diagram for the sin- tor. This is because only observed, not latent, variables
gle-factor CFA model analyzed for the detailed exam- are affected by measurement error. The error terms in
ple described later in this chapter. The observed vari- the figure are specified as independent, but correlated
ables X1–X5 are items on a self-report questionnaire. errors for latent response variables can be estimated
The response scale for all items is the 4-point Likert within the limits of identification requirements for
scale defined in Equation 18.1. Thus, each item has a doing so with observed variables (Appendix 14.A), and
total of 4 – 1, or 3 thresholds that are estimated in the with a solid rationale, too.
analysis. Each item in the diagram is associated with
a latent response variable X* by the graphical symbol
for a unidirectional “zigzag” that represents the set of METHODS TO SCALE LATENT
threshold parameters for each item (e.g., Edwards et al., RESPONSE VARIABLES
2012, p. 198). For example, the notation t11–t13 in the
figure represents the three thresholds for item X1 and Latent response variables X* in Figure 18.2 require
scales or metrics so that the computer can estimate
their parameters. There are two general methods: In
delta scaling (parameterization), the total variance
X1 X2 X3 X4 X5 of latent response variables is fixed to 1.0. This metric
is consistent with that of the polychoric correlations,
τ11–τ13 τ21–τ23 τ31–τ33 τ41–τ43 τ51–τ53 which assume variances of 1.0 for the latent response
variables. In the standardized solution of delta scaling
where all common factor variances are fixed to 1.0,
the factor loadings estimate the amount of standard
X1* X2* X3* X4* X5* deviation change in a latent response variable, given a
change of one standard deviation in the common factor.
Also, thresholds are normal deviates that correspond
1
to cumulative areas under the curve to the left of a par-
ticular response category (Finney & DiStefano, 2013).
A
These interpretations are familiar and straightforward.
Specification of theta scaling (parameterization)
instead of delta scaling does not change model fit in
FIGURE 18.2. Single-factor model for five Likert-scale
items (X), each with four response categories shown with a single sample analysis. The main difference is that
item thresholds (t), latent response variables (X*), and com- in theta scaling the error variance (not the total vari-
pact graphical symbolism for error terms. In delta scaling, ance) of each latent response variable is fixed to 1.0.
the total variance for each X* is fixed to equal 1.0. In theta This metric for the error variance is consistent with
scaling, the error variance for each X* is fixed to equal 1.0. the scaling in probit regression. In the unstandardized
Pt4Kline5E.indd 323 3/22/2023 2:05:49 PM

solution, factor loadings estimate the amount of change WLS estimator, which may preclude its use in many
in probit (normal deviate) units in the latent response applied studies.
variable for every one-point change in the common fac- An alternative estimator is diagonal weighted
tor. Thresholds are predicted normal deviates between least squares or diagonally weighted least squares
the next lowest and next highest response categories (DWLS), which is intended to reduce the computational
where the latent response variable is not standardized complexity of the WLS estimator and avoid the neces-
(its total variance is not 1.0). Interpretation of unstan- sity for very large samples. Specifically, the weight
dardized estimates in theta scaling can be more chal- matrix for DWLS consists of just the diagonal in the
lenging, given that the total variance of each latent full weight matrix for WLS. That is, the off-diagonal
response variable is not 1.0. Fortunately, the completely elements in the DWLS weight matrix are constrained
standardized solution in theta scaling is identical to the to equal zero. Using only the diagonal elements from
corresponding solution in delta scaling, which is easier the full WLS weight matrix would result in biased
to interpret than the unstandardized solution in theta values of model test statistics and standard errors. But
scaling. information from the full WLS weight matrix can be
summarized in the DWLS estimator in a different, less
complex way compared with the WLS estimator that
ESTIMATORS, ADJUSTED TEST can generate robust standard errors and corrected model
STATISTICS, AND ROBUST test statistics. These simpler computational methods do
STANDARD ERRORS not affect the parameter estimates, only their standard
errors and the value of the model chi-square (Finney
In the upcoming detailed example, the computer fits & DiStefano, 2013). This parallels the relation between
the model in Figure 18.2 to the polychoric correlation default ML and robust ML for continuous data: Both
matrix for the latent response variables. In lavaan, the methods generate the same parameter estimates, but
researcher must specify an estimator, which implies not generally the same standard errors or values of the
corresponding default methods for computing an model chi-square, given nonnormal distributions.
adjusted (scaled) model test statistic and generating The combination of the DWLS estimator with meth-
robust standard errors for parameter estimates (Chapter ods to compute robust standard errors and scaled test
10). It may be possible to select a different combination, statistics is called robust DWLS or just robust WLS.
such as a researcher-specified method for computing The keywords in lavaan for specifying an estimator or
robust standard errors—see Rosseel et al. (2023). the combination of an estimator with robust computa-
One option for the estimator is fully weighted tion of test statistics or standard errors (Rosseel et al.,
least squares (WLS), which requires no distributional 2023) are summarized next:
assumptions for any observed variable. Thus, it can be
applied to the analysis of continuous or categorical data 1. Option “DWLS” specifies the DWLS estimator.
in CFA (and in other kinds of SEM analyses, too). The There are separate options to request post-estima-
weight matrix for the WLS estimator is the matrix of tion adjustment of the model chi-square or standard
variances and covariances of the estimated polychoric errors.
correlations and also the thresholds depending on the 2. Option “WLSMV” specifies the combination of the
particular computer program (Finney & DiStefano, DWLS estimator with robust standard errors and a
2013). This matrix is also the asymptotic covariance mean-and-variance adjusted model chi-square that
matrix for the parameters just mentioned. A problem generally follows central chi-square distributions
with the WLS estimator is that the size of its weight with correct means and variances in large samples.
matrix can be so large that the computer may be unable 3. Option “WLSM” also specifies the DWLS estima-
to derive its inverse (Chapter 9). For example, if 250 tor with robust standard errors, but the model chi-
items in the same questionnaire are analyzed in the square is only mean adjusted.
WLS method, the weight matrix represents all the
latent response variables linked with those items and Option 2 just listed may be somewhat more accurate
perhaps all item thresholds, too. Another challenge is than Option 3 over random samples, but the difference
that very large samples are generally needed for the may not be striking (Savalei & Rhemtulla, 2013).
Pt4Kline5E.indd 324 3/22/2023 2:05:49 PM

MODELS WITH CONTINUOUS the normal-theory ML estimator to the matrix of poly-

AND ORDINAL INDICATORS choric correlations and then calculate the RMSEA and
CFI based on values of ML fit functions. The second
Models with indicators of mixed types, continuous or method also relies on the ML fit function applied to
ordinal, can also be analyzed in the CCVM approach. categorical data, but it assumes that the data were ana-
As noted by Li (2021), however, relatively little is known lyzed with DWLS. This is accomplished by evaluating
about the impact of mixed scale types on the accuracy the ML fit function for values of parameter estimates
of results from extant estimation methods. In computer obtained using DWLS. In computer simulations, values
simulation studies of CFA models and structural regres- of the RMSEA and CFI computed in the two methods
sion (SR) models with indicators of mixed scale type, generally outperformed their counterparts based on the
Li (2021) compared the performance of the WLSMV DWLS fit function except in samples where N = 200,
method with that of the robust ML method in gener- where all approximate fit indexes were typically inac-
ating parameters estimated, corrected test statistics, curate. These results add to the cautionary tale about
and robust standard errors over conditions for varying relying too much on approximate fit indexes to discern
sample sizes (N = 200, 500, 1,000), numbers of Likert model fit whether the data are categorical or continu-
response categories (2–7), and categorical response dis- ous. But there is a solution: Look to the residuals for the
tribution shape (symmetry, slight asymmetry). details of model fit.
The WLSMV method performed generally better in
Li’s (2021) computer simulations than the robust ML
method across most study conditions and, specifically, DETAILED EXAMPLE FOR ITEMS
required a smaller sample size (N = 200) to recover rea- ABOUT SELF‑RATED DEPRESSION
sonably accurate estimates of model parameters and
adjusted test statistics, but estimates of standard errors The data for this example of categorical CFA are the
were not stable unless the sample size was larger (N = responses from 2,004 White men to five items (nos.
500). The robust ML method tended to substantially 1, 2, 7, 11, and 20) from the Center for Epidemiologic
underestimate factor loadings for ordinal indicators and Studies Depression (CES-D) scale (Radloff, 1977).
overestimate the model chi-square, but bias in estimated These items cover somatic complaints or reports of
path coefficients and standard errors was relatively reduced activity related to depression. The 4-point Lik-
small. For researchers concerned mainly with estimat- ert response scale for all items indicates the degree to
ing structural parameters and less concerned with sig- which each symptom was experienced during the prior
nificance testing outcomes the robust ML method is a week—see Equation 18.1. These data were collected as
practical alternative to the WLSMV method for indica- part of the National Health and Nutrition Examination
tors with at least five or so response categories. Survey (NHANES) 1982–1984 Epidemiological Fol-
Savalei (2021) cautioned that approximate fit indexes low-up (Cornoni-HUntley et al., 1983; Madans et al.,
such as the RMSEA and CFI computed for ordinal 1986). The raw data are available from the website for
data based on the scaled model chi-square in the robust the Inter-university Consortium for Political and Social
DWLS method actually estimate a different quantity Research (ICPSR).3 Through permission of the ICPSR,
than their counterparts in the ML method for con- these raw data are also available on the website for this
tinuous data. One reason is that the ML fit function book.
is related to the difference between the observed and Listed in Table 18.1 are the syntax, data, and output
model-implied correlation matrix, but the DWLS fit files for estimating the model in Figure 18.2 with five
function concerns the difference between the estimated ordinal indicators for a single common factor (depres-
Pearson correlations for latent variables presumed to sion) with the raw data (item responses) for N = 2,004
underlie categorical observed variables and those pre- respondents in lavaan. The scaling method for the
dicted by the model. The two fit functions perfectly latent response variables is delta, and option WLSMV
coincide only when the model is correct and all residu- specified the DWLS estimator with a mean- and vari-
als are zero in very large samples. ance-adjusted chi-square with robust standard errors.
Savalei (2021) described two alternative methods The depression common factor (labeled as factor A
to compute modified versions of the RMSEA and CFI
for categorical data. One proposed solution is to apply 3 https://www.icpsr.umich.edu/web/pages/
Pt4Kline5E.indd 325 3/22/2023 2:05:49 PM

TABLE 18.1. Analyses, Script and Data Files, These tests are based on the SRMR, which are printed
and Packages in R for a for a Single-Factor by the “lavResiduals()” function for Bentler-type
Model of Depression with Ordinal Indicators correlation residuals (Topic Box 9.3; Chapter 17). In
Analysis Script file Monte Carlo simulations, Shi et al. (2020) evaluated
the accuracy of confidence intervals and significance
1. Single-factor model radloff-categorical-cfa.r
tests for the close-fit hypothesis based on the SRMR
for ordinal data,
option “WLSMV” and RMSEA in categorical CFA. Simulation condi-
tions for population two-factor models were based on
Note. The output file has the same name except the extension is varying numbers of items (10, 20, or 30) with either 2
“.out.” The external data file is “radloff.csv.” Option “WLSMV” or 5 response categories for four different sample sizes
specifies the diagonally weighted least squares estimator (DWLS) with
(N = 100, 200, 500, 1,000) and five different levels of
mean- and variance-adjusted test statistics and robust standard errors.
The lavaan package was used for the analysis. misspecification based on factor intercorrelations. For
unbiased forms of both approximate fit indexes, sample
mean values were generally close to the corresponding
population values. Confidence intervals and p values
in the figure) is scaled using the reference variable
for close-fit tests were generally accurate for the SRMR
method.
in larger samples (N ≥ 500), but results for the RMSEA
The number of observations in this analysis with
indicated appreciably lower accuracy, especially for
v = 5 items include 5(4)/2, or 10 polychoric correlations
larger models with ≥ 20 items or for models with
plus 15 thresholds (3 per item) for a total of 25. The
greater misspecification. Results for the sample (not
total number of free model parameters is 20, includ-
bias-corrected) RMSEA were even worse, including in
ing 15 item thresholds, 4 factor loadings for the latent
the largest sample size (N = 1,000). In general, confi-
response variables, and the variance of the common
dence intervals and tests of close fit for the RMSEA
depression factor (see the figure), so df M = 25 – 20 =
were inaccurate except for models with fewer items (10)
5. In delta scaling, the error variances of the latent
and relatively smaller amounts of misspecification.
response variables are not free parameters (they are
For Bentler-type residuals—see the output file for
fixed to equal 1.0). Because the number of sample ver-
analysis 1, Table 18.1—sample SRMR = .022 with
sus estimated thresholds is the same (15), each observed
an estimated standard error of .005. The result for the
threshold will equal its model-implied counterpart; that
exact-fit test is p = .002, so the model is tentatively
is, all threshold residuals will be zero.
rejected, which is the same conclusion as for the scaled
Estimation in lavaan converged to an admissible
chi-square test of exact fit. The value of unbiased
solution. Values of selected robust fit statistics are listed
SRMR is .019, and the 90% interval is [.010, .029], so
next
the model passes the close-fit test assuming the popula-
tion SRMR is ≤ .050 with a p value that is close to 1.0.
chiWLSMV(5) = 17.895, p = .003
Remember that passing a close-fit test does not war-
RMSEA = .036, 90% CI [.019, .055] rant ignoring a failed exact-fit test (Appendix 10.A).
CFI = .994, SRMR = .022 In a moment, we will inspect the correlation residuals
for this analysis, but I can tell you now that no obvi-
The exact-fit test is failed, so the model is tentatively ous local fit problems are indicated, so the single-factor
rejected. The sample size in this analysis is relatively model with ordinal indicators in Figure 18.2 is retained.
large (N = 2,004), so whether the failed exact-fit test is Parameter estimates for delta scaling are reported
due to more to sample size versus poor local fit (i.e., the in Table 18.2. The unstandardized factor loadings
residuals) is not yet known. Results on other global fit estimate the amount of change in each latent response
statistics do not seem grossly problematic, but note that variable, given a 1-point change in their common fac-
interpretative guidelines for approximate fit indexes tor, or A (depression) in Figure 18.2. Each standardized
developed for models with continuous endogenous loading estimates the Pearson correlation between the
variables may not generalize to methods for ordinal depression factor and each latent response variable.
data (Xia & Yang, 2019). These same coefficients also estimate the amount of
Next we consider alternative global significance tests the change in standard deviation units in the latent
for the categorical CFA model analyzed in this example. response variables, given a change of one full standard
Pt4Kline5E.indd 326 3/22/2023 2:05:49 PM

deviation in the depression factor. The squares of these test statistics (z) for the correlation residuals. A total
coefficients indicate the proportions of explained vari- of five correlation residuals that range in value from
ance (R2), but these values concern the latent response –.046 to .041 differ significantly from zero at the .05
variables, not the original items (observed ordinal level. No other absolute residuals exceed .046. Thus,
variables). Exercise 2 asks you to respecify the model these significant test results seem more attributable to
in Figure 18.2 so that the depression factor is scaled sample size, which is reasonably large in this analy-
using the effects coding method instead of the reference sis (N = 2,004) for a relatively simple model, than to
variable method (Chapter 14). The note for Table 18.2 marked disagreement between model and data at the
reports on the thresholds (3 per item) with standard level of the latent response variables. See Koziol (2023)
errors. These values are just the descriptive statistics for additional examples of analyzing dichotomous and
for each of the five items because the threshold struc- ordinal indicators in categorical CFA.
ture in this analysis has no degrees of freedom (df = 0).
Exercise 3 asks you to demonstrate that the thresholds
for item X1 generated in lavaan correspond to the OTHER ESTIMATION OPTIONS
cumulative response proportions for this item listed at FOR CATEGORICAL CFA
the beginning of the output file for this analysis.
Listed in the top part of Table 18.3 are the Bentler- Additional methods for analyzing models with ordi-
type correlation residuals, or differences between the nal indicators are discussed next. The EQS program
estimated (sample) polychoric correlations and values (Bentler & Wu, 2008) fits models with categorical or
predicted by the model for the latent response vari- continuous indicators to correlation structures using a
ables, and reported in the bottom part of the table are two-stage method by Lee et al. (1995). In the first stage,
TABLE 18.2. Diagonally Weighted Least Squares

Parameter Estimates with Robust Standard Errors for a
Single-Factor Model of Depression with Ordinal Indicators
Factor loadings
A → X1* 1.000 — .609 .028
A → X 2* 1.070 .065 .651 .029
A → X 3* 1.285 .065 .782 .020
A → X 4* 1.004 .056 .611 .023
A → X 5* 1.266 .065 .771 .021
Error variances
X1* .630 — .630 .034
X 2* .576 — .576 .038
X 3* .388 — .388 .032
X 4* .627 — .627 .028
X 5* .406 — .406 .033
Factor variance
Depression .370 .034 1.000 —
Note. Latent response variable (X*) scaling is delta. Item thresholds with standard er-
rors in parentheses: X1, .772 (.031), 1.420 (.041), 1.874 (.056); X2, 1.044 (.034), 1.543
(.044), 1.874 (.056); X3, .541 (.030), 1.152 (.036), 1.503 (.043); X4, .288 (.028), 1.000
(.034), 1.500 (.043); X5, .558 (.030), 1.252 (.038), 1.712 (.049).
Pt4Kline5E.indd 327 3/22/2023 2:05:49 PM

TABLE 18.3. Correlation Residuals and Test Statistics

for a Single-Factor Model of Depression with Ordinal Indicators
Variable 1 2 3 4 5
X1* 0
X 2* .041 0
X 3* –.005 –.029 0
X 4* .030 .020 –.024 0
X 5* –.046 –.013 .024 –.005 0
Test statistics (z)

X1* 0
X 2* 2.043 0
X 3* –.391 –2.102 0
X 4* 1.699 1.046 –2.112 0
X 5* –3.545 –.918 3.610 –.503 0
Note. Correlation residuals are the differences between estimated (i.e., sample) and model-im-
plied polychoric correlations.
a special form of ML estimation is used to approximate is a construct in calculus that represents areas, vol-
correlations between the latent response variables. In umes, or portions of a probability distribution defined
the second stage, an asymptotic covariance matrix is by its integral. The probability distribution for all latent
computed, and the model is analyzed with a form of response variables is defined by its v-tuple probability
the generalized least squares (GLS) estimator (Chapter integral, where v is the number of such variables. It is
9) designated as arbitrary GLS (AGLS). The Amos challenging enough to get computers to solve prob-
program (Arbuckle, 2021) takes a Bayesian approach ability integrals for even a single dimension, much less
to the analysis of ordinal data. It generates posterior over multiple dimensions, so approximate methods are
distributions of parameter estimates and provides the generally used.
user with different kinds of graphical displays about Computer analysis in FIML estimation for categori-
precision of the estimates, but knowledge of Bayesian cal data relies on simulated random sampling from
methods for ordinal data is required—see Taylor (2019) estimated joint probability distributions for the latent
for more information. response variables. One such algorithm is the Markov
Another option is a special form of the full-infor- Chain Monte Carlo (MCMC) method that is also used
mation maximum likelihood (FIML) estimator for non- in multiple imputation (Chapter 4) to draw random sam-
continuous variables that is available in SEM computer ples from theoretical probability distributions. Another
tools such as Mplus (Muthén & Muthén, 1998–2017) method is adaptive quadrature, which divides a prob-
and OpenMx (Neale et al., 2016). It does not fit the ability integral into progressively smaller intervals until
model to a correlation structure over two steps as in a stopping criterion is met for approximating the inte-
the robust WLS method. Instead, it directly analyzes gral over each interval. More subdivisions are required
the raw data and estimates the latent response variables for more complex probability integrals, especially if
using methods for numerical integration. In nontech- there are sharp peaks or cusps in regions of the distribu-
nical language, this means that the computer attempts tion (Heath, 2018). Computational requirements greatly
to estimate likelihoods of the data within the joint mul- increase as the number of dimensions increase to the
tivariate normal distribution for all v latent response point where even modern personal computers with fast
variables, where v is the number of indicators. The joint processors and large memory capacities may be insuf-
distribution is defined by a probability integral, which ficient for bigger models. Another drawback is a reduc-
Pt4Kline5E.indd 328 3/22/2023 2:05:50 PM

tion in information about model fit in the output. For credit instead of dichotomous scoring as correct versus
example, the Mplus program prints a small number of fit incorrect. In the past, computer tools for IRT were gen-
statistics, such as the AIC and BIC, for models with ordi- erally limited to the analysis of unidimensional models,
nal indicators estimated in the FIML method, and no but modern IRT software can estimate measurement
residuals are available, depending on analysis options. models with multiple factors just as in CFA.
Very large sample sizes are needed for the method, so it I believe that categorical CFA and IRT differ more
is not a practical alternative to robust DWLS in perhaps in the context of their application and the potential
many, if not most, applications of categorical CFA. range of analytical capabilities than in their underly-
The special FIML method for noncontinuous vari- ing statistical models. The technique of CFA is usually
ables can potentially extract all available information applied in sample sizes like N = 200–500 that would
from an incomplete data file without deleting cases and be considered as too small for some of the “industrial
with no imputation. When data are missing at random strength” applications of IRT. One difference is the
(MAR), a full information method will generate unbi- capability to develop in IRT tailored tests, or sub-
ased estimates (Chapter 9). Some computer tools, such sets of items generated by the computer that optimally
as Mplus, support multiple imputation in the robust assess an examinee based on the correctness of their
DWLS method for ordinal data (Asparouhov & Muthén, previous answers. If the examinee fails initial items,
2021); otherwise, options may be limited to listwise or the computer presents easier ones. Testing stops when
pairwise deletion, which assumes the data are missing more difficult items are consistently failed. A reliability
completely at random (MCAR). In computer simula- coefficient can be estimated for each person, given the
tions, Pritikin et al. (2018) compared the performance particular items administered. In contrast, CFA gener-
of FIML, robust DWLS with listwise deletion, and ally analyzes static measurement models fitted to data
robust DWLS with multiple imputation for models with from the whole sample, not just individual cases. Very
both ordinal and continuous indicators when the data large samples and numbers of test items are required
loss pattern is MAR. The robust DWLS estimator with for tailored testing. See Koziol et. al (2023) for more
listwise deletion was clearly outperformed by the other information about IRT models and analyses for cat-
two methods. There was a smaller advantage for FIML egorical data.
over robust WLS with multiple imputation, but whether
that advantage translates to a meaningful difference in
real data sets is unknown. SUMMARY
There are special methods for estimating structural

ITEM RESPONSE THEORY AND CFA equation models with ordinal data, such as items with
Likert response scales. When the number of response
Both CFA and methods for item response theory (IRT) categories is relatively small, such as < 5, a good practi-
(see the Psychometrics Primer on this book’s website) cal choice is the robust DWLS method, which fits mea-
can analyze data at the level of items or scales (total surement models with threshold structures to polychoric
or other summary scores over items). There are other correlation matrices estimated for latent response vari-
points of contact between the two methods: It is possible ables presumed to underlie the observed variables, plus
to rescale estimates in categorical CFA, such as factor the option to generate scaled test statistics and robust
loadings and thresholds, to equal item discriminations standard errors. A reasonable alternative for ordi-
or item difficulties in two-parameter IRT models, and nal indicators with at least five response options and
vice versa (Wirth & Edwards, 2007). They share some nearly symmetrical histograms is robust ML for con-
of the same estimators, such as FIML, and assumptions tinuous variables, which may require smaller sample
about relations between categorical items and latent sizes compared with the robust DWLS estimator. Using
response variables. The Mplus program can estimate robust DWLS with multiple imputation for incomplete
the parameters of three- or four-parameter logistic IRT data files is a better general option than the robust
models including parameters for guessing, or the prob- DWLS method with case deletion. There are relatively
abilities that respondents with low levels of ability could newer methods for testing global model fit based on the
get items correct simply by guessing at the answers. It correlation residuals, or differences between estimated
can also analyze items where scoring is based on partial (sample) and model-implied polychoric correlations,
Pt4Kline5E.indd 329 3/22/2023 2:05:50 PM

that may useful in categorical CFA. Full information Finney, S. J., & DiStefano, C. (2013). Nonnormal and
ML estimation for categorical data based on methods categorical data in structural equation modeling. In G.
that simulate random sampling from multivariate prob- R. Hancock & R. O. Mueller (Eds.), Structural equation
ability distributions is an alternative to robust DWLS, modeling: A second course (2nd ed., pp. 439–492). IAP.
but very large samples are needed. Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012).
When can categorical variables be treated as continu-
ous? A comparison of robust continuous and categorical
LEARN MORE SEM estimation methods under suboptimal conditions.
Psychological Methods, 17(3), 354–373.
Finney and DiStefano (2013) give clear descriptions of esti-
mation options for ordered-categorical data, Rhemtulla et al. Shi, D., Maydeu-Olivares, A., & Rosseel, Y. (2020) Assess-
(2012) compare robust WLS and robust ML for continuous ing fit in ordinal factor analysis models: SRMR vs.
variables as alternative for models with ordinal indicators, RMSEA. Structural Equation Modeling, 27(1), 1–15.
and Shi et al. (2020) describe confidence intervals and test
statistics based on correlation residuals for ordinal data.
EXERCISES
1. Interpret t2 = .253 in Figure 18.1(b), given the infor- 3. For analysis 1, show that the thresholds for item
mation in Figure 18.1(a). X1 in Table 18.2 generated in lavaan match the
cumulative proportions of responses over the four
2. Rerun analysis 1 in Table 18.1 but scale the scale response categories (Equation 18.1) listed at the
depression factor using the effects coding method beginning of the output file for this example.
instead of the reference variable model. Compare
the results for factor loadings in the respecified
model as just described with those in the original
model.
Pt4Kline5E.indd 330 3/22/2023 2:05:53 PM

19
Nonrecursive Models with Causal Loops
The assumptions of recursive structural models—whether comprised of single indicators or proxies (common
factors, composites)—that all causal effects are unidirectional and that disturbances are independent simplify
the analysis. For example, recursive structural models are identified (Rules 15.1, 16.4). Partially recursive
models with unidirectional effects and bow-free patterns of disturbance correlations restricted to pairs of
endogenous variables with no direct effects between them (e.g., Figure 7.4(c)) are also identified (Rule 7.4)
and, thus, can be treated in the analysis just like recursive models. The occurrence of Heywood cases or
failure of iterative estimation is also less likely for recursive models.
But the same assumptions about recursive models that ease the analytical burden are also restrictive.
For example, neither causal loops nor correlated disturbances in a bow pattern, which occur between pairs
of endogenous variables with direct effects between them (e.g., Figure 7.4(d)), can be represented in recur-
sive models. These kinds of effects can be specified in nonrecursive structural models, but they have special
identification requirements that involve instrumental variables (e.g., Figure 6.1(f)). Nonrecursive models with
causal loops (e.g., Figure 7.4(b)) have additional special assumptions about reciprocal causation that are
difficult, if not generally impossible, to verify in the analysis. How to deal with the challenges just mentioned
is the focus of this chapter starting next with discussion of causal loops.
CAUSAL LOOPS Y1 ←
Causal loops in nonrecursive models represent the

hypothesis of feedback effects or reciprocal causation
among ≥ 2 endogenous variables. In a direct feedback
loop, for example, two variables Y1 and Y2 are speci-
→ Y3 Y2 ←
fied as causes and effects of each other, or Y1  Y2.
An indirect feedback loop involves ≥ 3 variables con-
nected by direct effects that eventually lead back to ear- Because each variable in the feedback loop just illus-
lier variables. In model diagrams, an indirect feedback trated is involved in an indirect effect, such as Y2 in the
loop among variables Y1–Y3 would be represented as a indirect pathway Y1 → Y2 → Y3, feedback is indirect.
“triangle” with direct effects that connect them in the You should know that feedback loops in nonre-
order specified by the researcher. An example shown cursive models are estimated with data from cross-
with compact symbolism for disturbances is presented sectional designs, not longitudinal designs. That is,
next: variables in a causal loop are measured at the same
331
Pt4Kline5E.indd 331 3/22/2023 2:05:53 PM

occasion. In contrast, reciprocal effects in longitudi- sive or nonrecursive depending on its pattern of distur-
nal designs are estimated by (1) measuring variables bances—see Little (2013, chap. 6) or Newsom (2015,
hypothesized to mutually influence each other at ≥ 2 chap. 5) for examples.
different points in time and (2) specifying cross-lag Panel models for longitudinal data offer potential
direct effects between those variables over time. For advantages over models with feedback loops for cross-
example, a causal loop between Y1 and Y2 based on a sectional data. One is temporal precedence, or the
cross-sectional design is represented as a nonparamet- explicit representation of a causal lag that corresponds
ric model without disturbances or other variables in to the measurement occasions. Specifying the correct
Figure 19.1(a). In the panel model of Figure 19.1(b), the lag is critical in longitudinal designs: If the interval
same two variables are each measured in a longitudinal between measurements is shorter than the true causal
design at two different points in time, where the second lag, then there is insufficient time for a cause to affect
characters in the double subscripts represent the mea- the outcome. Conversely, that interval should not be
surement occasion (e.g., Y12 is variable Y1 measured at so long that causal effects have already dissipated. An
the second time, and so on). The two cross-lag direct option when the true causal lag is unknown or cannot
effects are be reasonably approximated is multiphase longitudi-
nal designs, where different measurement occasions,
Y11 → Y22 and Y21 → Y12 some shorter but others longer, are specified for dif-
ferent subsets of participants (Taris & Kompier, 2014).
which represent, respectively, prediction of Y1 and Y2 For example, the variables Y11 and Y22 in Figure 19.1(b)
each from prior values of the other variable. In contrast, could be measured 2, 3, or 4 months apart for three
the autoregressive paths in Figure 19.1(b), or different subgroups, but the specific lags should be
appropriate to the variables under study. There is no
Y11 → Y12 and Y21 → Y22 guarantee that any of the multiple lags in a multiphase
study is actually correct.
represent the prediction of each variable from prior val- But the absence of temporal precedence is not always
ues of the same variable. Cross-lag and autoregressive a liability when estimating reciprocal causation. Finkel
effects are each estimated controlling for the other. For (1995) argued that the lag for some causal effects is so
example, the coefficient for the cross-lag effect of Y2 at short that it would be impractical to measure them over
time 1 on Y1 at time 2 in Figure 19.1(b) is estimated time. Examples are reciprocal effects of the moods of
controlling for the autoregressive effect of Y1 on itself spouses on each other. Although the causal lags in this
over time, and vice versa. A panel model may be recur- example are not zero, they may be so short as to be vir-
tually synchronous. If so, then representing such effects
as direct feedback loops in nonrecursive models that
are analyzed with cross-sectional data would be more
(a) Direct feedback loop (b) Panel model defensible. Indeed, if causal lags are very short, it could
(Cross-sectional) (Longitudinal) be more appropriate to estimate such effects in cross-
sectional designs even when panel data are available
Y1 Y11 Y12 (Wong & Law, 1999). Thus, longitudinal data collected
according to a specific measurement schedule are not
automatically superior to cross-sectional data when
true causal lags are very short.
Disturbances of variables that make up a feedback
Y2 Y21 Y22 loop are often assumed to covary. This specification
makes sense because if the variables mutually cause
each other, then it seems plausible that they may share
FIGURE 19.1. Reciprocal causal effects between endoge-
nous variables Y1 and Y2 represented in nonparametric mod- unmeasured causes. Some of the error in predicting one
els with a direct feedback loop based on a cross-sectional variable in a feedback loop, such as Y1 in Figure 19.1(a),
design (a) and cross-lag direct effects based on a longitudi- may be due to the other variable in that loop, or Y2. The
nal design (panel model) (b) shown without disturbances or presence of disturbance covariances in nonrecursive
other variables. models can also assist in the evaluation of whether such
Pt4Kline5E.indd 332 3/22/2023 2:05:53 PM

Nonrecursive Models with Causal Loops 333
models are identified. In statistical examples by Wong that there are also empirical checks for whether a struc-
and Law (1999), estimating disturbance covariances tural equation model is identified (Chapter 14), but they
improved the chances of detecting true unidirectional are not foolproof (i.e., passing an empirical check does
effects between variables specified as mutually causing not prove identification). The general principles and
each other in direct feedback loops. As for any kind of practical suggestions for dealing with identification for
model specification, the inclusion of correlated errors nonrecursive models are summarized next:
for endogenous variables in feedback loops should have
substantive justification. The reason is to avoid speci- 1. Endogenous variables in causal loops require
fying correlated disturbances for structural models or instrumental variables depending on their patterns
correlated errors for measurement models simply to of disturbance covariances.
improve model fit. 2. Adding exogenous variables is one way to remedy
an identification problem of a nonrecursive model.
But this can typically be done before the data are
ASSUMPTIONS OF CAUSAL LOOPS collected. Thus, it is critical to evaluate a nonre-
cursive structural model for identification right
Data from cross-sectional designs give only a snap- after it is specified and before the data are col-
shot, or an isolated observation, of an ongoing dynamic lected.
process. Therefore, estimation of reciprocal effects
in a feedback loop with cross-sectional data requires 3. A common beginner’s mistake is to specify a com-
the assumption of equilibrium, or any changes in the plex nonrecursive model of ambiguous identifica-
system underlying a presumed feedback relation have tion status and then attempt to analyze it. If the anal-
already manifested their effects and the system is in a ysis fails (likely), it may not be clear what caused
steady state. That is, the values of the estimates of the the problem. Begin instead with a simpler model
direct effects in a causal loop do not depend on the spe- that is a subset of the target model and for which the
cific time point when data are collected. Kaplan et al. application of heuristics can prove identification. If
(2001) reminded us that there is generally no statisti- the analysis fails, the problem is not identification;
cal way to directly evaluate the equilibrium assump- otherwise, add parameters to the simpler model one
tion; instead, it must be substantively argued. But at a time. If the analysis fails after adding a particu-
this assumption is not often acknowledged in studies lar effect, try a different order. If these analyses also
where feedback effects are estimated in cross-sectional fail at the same point, then adding the correspond-
designs. This is unfortunate because the results of ing parameter may cause underidentification. If no
computer simulation studies by Kaplan et al. (2001) combination of adding effects to a basic identified
indicated that violation of the equilibrium assumption model gets you to the target, think about how to
can lead to severely biased estimates. Another assump- respecify the original model in order to identify it,
tion is that of stationarity, or the requirement that the yet still respect your hypotheses.
basic causal structure does not change over time. Both
assumptions just described, equilibrium and stationar- Graphical Identification Rules
ity, are very demanding (i.e., probably unrealistic).
Graphical methods and special computer tools for
determining whether specific causal effects in directed
IDENTIFICATION REQUIREMENTS acyclic graphs are identified (Chapter 6) can be applied
to recursive structural models (Chapter 8). But exten-
It is more difficult to establish whether a nonrecursive sions of these methods to directed cyclic graphs with
structural model is identified compared with recursive causal loops (i.e., nonrecursive models) are not as well
models. Fortunately, there are some straightforward developed and can be challenging to use—see Wang
ways to determine whether some, but not all, types et al. (2016). Rigdon (1995) devised a set of neces-
of nonrecursive models are identified. Some of these sary and sufficient graphical rules for evaluating the
methods are graphical and are relatively easy to apply identification status of nonrecursive models where the
in smaller models, but they are prone to error when endogenous variables can be partitioned into recur-
dealing with larger models with many variables. Recall sively related blocks. This block classification method
Pt4Kline5E.indd 333 3/22/2023 2:05:54 PM

requires no data and thus can be applied when the model for individual observed variables (squares) are repre-
is specified. It is also useful for structural models with sented in Figure 19.2, but the same requirements apply
observed variables only or with proxies for conceptual to nonrecursive structural models with common factors
variables. The method is not automated (at least to my or composites.
knowledge), so it is applied by hand to model diagrams, Depicted in Figure 19.2(a) is a bow-pattern distur-
which can be more challenging for bigger models. bance covariance. This two-variable block is identified
In Rigdon’s (1995) method, each block in the model if the causal variable Y1 has an instrument. This speci-
diagram contains either one or two endogenous vari- fication also provides a way to identify the direct effect
ables. Blocks with two variables are reserved for those of a putative causal variable on an outcome where the
involved in direct feedback loops or with disturbance two variables share a disturbance covariance, which is
covariances in a bow pattern (i.e., there is a single direct a type of endogeneity (e.g., Figure 6.1(f)). Shown in Fig-
effect between that variable pair). Blocks with a single ures 19.2(b) and 19.2(c) are blocks of two endogenous
endogenous variable have strictly recursive relations to variables that share a direct feedback loop, where it is
all other variables in the model. Exogenous variables assumed that both direct effects that make up the loop
are ignored when partitioning the endogenous vari- are free parameters. Figure 19.2(b), with a direct feed-
ables. Any single variable block is identified because back loop between Y1 and Y2, is identified if (1) there is
it is recursive. Any two-variable block falls into one of one instrument (either Z1 or Z2, but not both) and (2)
eight patterns described by Rigdon (1995, p. 370), some there is no disturbance covariance (it is fixed to equal
of which correspond to identified blocks, but other pat- zero). But if the disturbance covariance is also a free
terns describe blocks that are not identified. parameter, then separate instruments are needed for
Shown in Figure 19.2 are the abstractions from Rig- both endogenous variables, such as Z1 for Y1 and Z2 for
don’s (1995) types that represent minimum required Y2 as represented in Figure 19.2(c).
specifications for identifying a two-variable block, Y1 Let’s try some examples. We still assume that all
and Y2, with instruments that could be either exog- direct effects in feedback, either direct or indirect, are
enous or endogenous variables. All other specifications free parameters and that all disturbance covariances
beyond these minimum requirements are irrelevant are free parameters, too. Represented in Figure 19.3(a)
(i.e., they have no bearing on whether the two-variable is the smallest identified model with all possible dis-
block is identified or not identified). All other variables turbance covariances (1 in total) for estimating direct
in the model can be ignored, including the ones with feedback. The sole block of endogenous variables is the
direct effects on both Y1 and Y2. This is because such two-variable block Y1  Y2. The instrument for Y1 is X1,
variables cannot be instruments (Chapter 6). Symbols and the instrument for Y2 is X2. With separate instru-
Direct feedback loop

(a) Bow pattern (b) No disturbance (c) With disturbance
covariance covariance
Y1 Y2 Y1 Y2 Y1 Y2
xor and
FIGURE 19.2. Minimum required specifications for identifying blocks of two endogenous variables, Y1 and Y2, based on
Rigdon’s (1995) graphical block classification method. Unlabeled variables are instruments; xor, exclusive or (an instrument
for either Y1 or Y2, but not both).
Pt4Kline5E.indd 334 3/22/2023 2:05:54 PM

(a) Direct feedback (b) Indirect feedback
X1 Y1
X1 Y1
X2 Y2
X2 Y2 X3 Y3
(c) Block recursive
X1 Y1 Y3
X2 Y2 Y4
FIGURE 19.3. Basic identified nonrecursive models with all possible disturbance covariances for a single direct feedback
loop (a) and for indirect feedback loops with three variables (b). A block recursive model with two recursively related blocks
of direct feedback loops and all possible disturbance covariances within each block (c). The whole model is nonrecursive.
ments for each of Y1 and Y2, the model is identified (see loops for two different sets of endogenous variables,
Figure 19.2(b)). Figure 19.3(b) represents the smallest Y1 and Y2 in the first loop from the left and Y3 and Y4
identified model with all possible disturbance correla- in the second loop, and all possible disturbance vari-
tions (3 in total) for estimating indirect feedback among ances in each block (i.e., 1), or 2 in total for the whole
three endogenous variables, Y1–Y3. All direct effects model. Direct effects within each block of two endog-
between each pair are unidirectional, such as Y1 → Y2, enous variables are bidirectional, such as Y1  Y2 in the
but they connect Y1–Y3 in a nonrecursive (circular) way. first loop, but effects between the blocks are recursive,
The endogenous variables in each block share a distur- including the direct effects
bance covariance, which matches the pattern in Figure
19.2(a). The instruments for each two-variable block of Y1 → Y3 and Y2 → Y4
endogenous variables in Figure 19.3(b) are listed next:
with independent disturbances across the two blocks.
1. X1 is the instrument for Y1 in the direct effect Block recursive models are considered nonrecursive
Y1 → Y2; despite their name. In the figure, the instruments for
2. X2 is the instrument for Y2 in the direct effect Y1 and Y2 in the first block of endogenous variables
Y2 → Y3; and are, respectively, the exogenous variables X1 and X2.
3. X3 is the instrument for Y3 in the direct effect The instruments for Y3 and Y4 in the second block are,
Y3 → Y1. respectively, the endogenous variables Y1 and Y2.
For all examples in Figure 19.3, the instruments are
Presented in Figure 19.3(c) is the smallest identified defined by the model’s specifications; that is, they are
block recursive model with a pair of direct feedback model-implied instruments. Specifically, the coefficient
Pt4Kline5E.indd 335 3/22/2023 2:05:54 PM

for the direct effect of the instrument on the outcome turbance variances, and 1 disturbance covariance—are
variable in a pair of endogenous variables in Rigdon’s underidentified. There is no problem with the single-
(1995) block classification method is fixed to zero. variable block that includes the third endogenous vari-
The model must also imply no correlation between an able, Y3, which is recursively related to all prior vari-
instrument and the disturbance of the outcome variables in the model.
able for the same pair (Rule 6.5). In Figure 19.3(a), for What does the researcher do now? There are basi-
instance, variable X1 has a direct effect on Y1 but has cally two options: The first is to add to Figure 19.4(a)
no direct effect on Y2, and there is no model-implied two new exogenous variables, X2 and X3, that serve as
association between X1 and the disturbance of Y2. instruments for Y1 and Y2, the two endogenous variables
Therefore, variable X1 is a proper instrument for Y1 in the causal loop. For example, specifying that X2 has a
in its direct feedback relation with Y2 (i.e., Y1  Y2). direct effect on Y1 but not Y2 generates a model-implied
Kenny (2012) reminded us that the zero path between instrument for Y1, but there should be a substantive
the instrument and the outcome in a nonrecursive block rationale for this specification. Through similar logic,
of two variables should be established by theory, not variable X3 is a model-implied instrument for Y2, if X3
statistical analysis. For example, do not regress Y2 on has a direct effect on Y2 but no direct effect on Y1. If
X1 and X2 in Figure 19.3(a) and select the instrument justified by theory, the three exogenous variables in
by seeing which variables have coefficients that are not a modified version of the figure could be assumed to
statistically significant. covary. All the respecifications just described preserve
Even if nonrecursive models like the ones in Figure the original direct effects in Figure 19.4(a), including
19.3 are identified, their analysis can still be foiled by the hypothesis that variable X1 is a common cause of
empirical underidentification. This can occur when the both Y1 and Y2.
values of coefficients for key paths in the model are close If the data are already collected, it may be too late
to zero, which effectively drops them out of the model to add exogenous variables to Figure 19.4(a). If so, then
and renders it practically underidentified. Suppose that the only remaining option is to impose constraints on
the coefficient for the path X2 → Y2 in Figure 19.3(c) the original model so that the respecified model is iden-
is about zero. The virtual absence of this path means tified. For example, constraining both the disturbance
that Y2 has no instrument, which violates the require- covariance between Y1 and Y2 and a single direct effect
ment that each endogenous variable in a direct effect of X1 to equal zero in the figure generates an identified
loop with a disturbance covariance has its own instru- model. This is because (1) a direct feedback loop with
ment (Figure 19.2(c)). Extreme multicollinearity can independent disturbances requires a single instrument
also lead to empirical underidentification. For example, (Figure 19.3(b)). Also, (2) fixing just a single direct
if the absolute correlation between exogenous variables effect of X1 to zero generates a model-implied instru-
X1 and X2 in Figure 19.3(c) is nearly 1.0, they are basi- ment for the remaining endogenous variable. For exam-
cally the same variable, which means that the model’s ple, the specification
requirement for two instruments is virtually failed.
X1 → Y2 = 0
RESPECIFICATION OF NONRECURSIVE in a modified version of Figure 19.4(a) with no distur-

MODELS THAT ARE NOT IDENTIFIED bance covariance defines X1 as an instrument for Y1.
The respecified model just described is identified, but
Suppose that Figure 19.4(a) is specified to faithfully the original hypotheses in the figure that X1 is a com-
reflect the hypotheses of a particular theory but, unfor- mon cause of both Y1 and Y2 and that the pair of endog-
tunately, the model is not identified. Applying Rigdon’s enous variables just mentioned share at least one com-
(1995) method, we can spot the problem right away: mon omitted cause are both lost.
There is a pair of endogenous variables, Y1 and Y2, in Another way to simplify a model with a direct feed-
a direct feedback loop, their disturbance covariance is back loop is to constrain the coefficients for the two
a free parameter, but exogenous variable X1 has direct direct effects to equality, which reduces the number of
effects on both Y1 and Y2, so there is no instrument for free parameters by 1 because only a single estimate is
either endogenous variable. Thus, the parameters of the required for the 2 direct effects. Recall that equality
causal loop in the figure—the 2 direct effects, 2 dis- constraints generally apply in the unstandardized solu-
Pt4Kline5E.indd 336 3/22/2023 2:05:54 PM

(a) Not identified
Y1
X1
Y2
X2 Y3
(b) Identified
X Y1
Y2
FIGURE 19.4. Nonrecursive models that are not identified (a) or identified (b).
tion only. This means that the two corresponding stan- ments. Instead, they assume nonrecursive models either
dardized coefficients could still be unequal. Any equal- with all possible disturbance covariances for the whole
ity constraint should be justified by theory, not because model, such as Figures 19.3(a) and 19.3(b), or with all
it happens to generate a converged, admissible solution. possible disturbances within each causal loop of block
Constraining reciprocal effects to equality assumes recursive models, such as Figure 19.3(c).
that the raw score metrics of the two endogenous vari-
ables are the same; otherwise, an equality constraint is
Order Condition
harder to justify.
The order condition is a necessary but insufficient
requirement. This means that if the order condition
ORDER CONDITION is failed, the model is not identified. But passing the
AND RANK CONDITION order condition does not guarantee that the model is
actually identified (Berry, 1984; Paxton et al., 2011). It
Rigdon’s (1995) block classification method is straight- is a counting rule applied to each endogenous variable
forward to apply in smaller models such as Figures 19.3 involved in a causal loop. The order condition is evalu-
and 19.4(a). In larger models, though, it can be diffi- ated by tallying the number of variables (except dis-
cult to locate instruments through visual inspection. turbances) that have direct effects on the target endog-
Two other identification heuristics for nonrecursive enous variable versus the number that do not. Let’s call
models, the order condition and the rank condition, do the latter excluded variables. The order condition is
not require the researcher to find model-implied instru- stated next:
Pt4Kline5E.indd 337 3/22/2023 2:05:54 PM

RULE 19.1 The order condition requires that the tion 3.2). Thus, the rank condition requires a minimum
number of excluded variables for each endogenous number of independent rows in the coefficient matrix
variable equals or exceeds the total number of for each endogenous variable in a nonrecursive model.
endogenous variables minus 1 Unlike the order condition, the rank condition is both
necessary and sufficient:
For nonrecursive models with all possible disturbance
correlations, the total number of endogenous variables RULE 19.2 The rank condition requires that the
equals that for the whole model. For example, there rank of the coefficient matrix for each endogenous
are three endogenous variables in Figure 19.3(b). This variable equals or exceeds the total number of
means that at least 3 – 1 = 2 variables must be excluded endogenous variables minus 1; nonrecursive models
from the equation of each endogenous variable, which that pass this condition are identified
is true here: A total of 3 variables, X2, X3, and Y2, are
excluded from the equation for endogenous variable Y1 In models with all possible disturbance covariances,
(the included variables are X1 and Y3), which exceeds the total number of endogenous variables equals that
the minimum number for this model (2). Because the for the whole model; otherwise, it is number of endoge-
equations for endogenous variables Y2 and Y3 also meet nous variables in each block of a block recursive model.
the order condition—Exercise 1 asks you to verify this Berry (1984) devised a method for checking the
claim—the whole model passes the order condition. rank condition that does require detailed knowledge of
For block recursive models, the total number of matrix algebra (see also Paxton et al., 2011, chap. 3).
endogenous variables is counted separately for each It is described in Appendix 19.A for Figure 19.3(a).
block when the order condition is evaluated. In Figure Exercise 2 asks you to use Berry’s (1984) method to
19.3(c), for instance, there are two blocks of endoge- verify that Figure 19.4(a) is not identified. (This same
nous variables in causal loops. Each block has 2 vari- model failed Rigdon’s (1995) graphical block classifi-
ables, so to satisfy the order condition, at least 2 – 1 = cation method.) Exercise 3 asks you to show that that
1 variable must be excluded from the equation of each Figure 19.4(b) with no disturbance covariance is identi-
endogenous variable in both blocks, which is true here: fied using the block classification method. For the same
1 variable is excluded from each equation for Y1 and model, Exercise 4 asks you to prove that both the order
Y2 in the first block (e.g., X2 for Y1), and 3 variables are condition and the rank condition are failed. This exer-
excluded from each equation in the second block (e.g., cise demonstrates that the order condition is no lon-
X1, X2, and Y2 for Y3). Thus, the whole block recursive ger necessary and that the rank condition is no longer
model in the figure passes the order condition. sufficient for models without all possible disturbance
covariances. Exercise 5 asks you how to respecify Fig-
ure 19.4(b) to test the hypothesis that the disturbance
Rank Condition
covariance is zero.
The rank condition is defined in matrix algebra terms
(Bollen, 1989, pp. 101–103; Paxton et al., 2011, chap. 3).
It is the requirement that the determinant for the matrix DETAILED EXAMPLE FOR
of coefficients for variables excluded from the equation A NONRECURSIVE PARTIAL SR MODEL
of each endogenous variable does not equal zero; other-
wise, the matrix cannot be inverted, which means there Robbins (2012) collected data on N = 64 countries to
is no unique solution for the coefficients (i.e., the equa- test the hypothesis that generalized trust (i.e., the dis-
tion for the target endogenous variable is underidenti- position to trust strangers) and institutional quality
fied). In order for this determinant to be nonzero, the (i.e., the perception that the state is incorruptible, offers
rank of the coefficient matrix should equal or exceed effective legal rights, and protects civil liberties) recip-
the number of endogenous variables minus 1. The rank rocally affect each other. In other words, the view that
is the number of rows (equations) in the coefficient the state is basically fair and universalistic encourages
matrix that are linearly independent, which means that generalized trust among citizenry, which in turn boosts
no equation is a simple weighted combination of the the sense of institutional quality. Higher levels of both
rest. Two linearly dependent equations have no unique generalized trust and institutional quality are associ-
solution; that is, they are underidentified (e.g., Equa- ated with greater voter turnout, more civic engagement
Pt4Kline5E.indd 338 3/22/2023 2:05:54 PM

by citizens, better economic performance, and less tests conducted by Robbins (2012, pp. 253–254) are
violence (Robbins 2012; Perry, 2021). The hypotheses consistent with these hypotheses.
just summarized correspond to a virtuous cycle, where Summarized in Table 19.1 are the input data for
desirable events reinforce each other in an ongoing pat- indicators of trust, institutional quality, monarchy, and
tern (Fan et al., 2016). In contrast, undesirable events information technology for N = 64 countries. Because
compound each other in a vicious cycle, which lead the legal property rights and rule of law variables are
to escalating detrimental outcomes unless the cycle extremely collinear (r = .933), I excluded the rule of
is interrupted (e.g., melting glaciers reflect less light, law indicator from this analysis. Listed in Table 19.2 as
which further warms the atmosphere). In a neutral analysis 1 is the lavaan syntax file that fits the nonre-
cycle, the outcome of feedback is nether beneficial nor cursive model in Figure 19.5 to the data in Table 19.1.
harmful. The model has a direct feedback loop between a mea-
The rate of respondents in 64 countries who endorsed sured variable, trust, and a common factor, institutional
the statement “most people can be trusted” in the World quality. Their disturbances are assumed to covary. The
Values Survey is the single indicator for generalized instrument for trust, monarchy, is a measured variable,
trust.1 Robbins (2012) used three measures of institu- but the instrument for institutional quality, information
tional quality, legal property rights, rule of law, and technology, is a common factor. The structural part of
corruption, where higher scores indicate, respectively, Figure 19.5 corresponds to the Figure 19.2(c) in Rig-
stronger property rights, greater civil liberties protec- don’s (1995) block classification method, so it is identi-
fied. You should verify that the structural component of
tion, and lower risks of corruption. The property rights
Figure 19.5 also passes both the order condition and the
indicator is from the Economic Freedom of the World
rank condition. The measurement part of the model is
project of the Fraser Institute,2 and the rule of law and
also identified (Rule 14.1). Exercise 6 asks you to verify
corruption indicators are from the Governance Matters
that df M = 5.
III data set of the World Bank Group.3
The analysis in lavaan with default ML converged
The instrument for generalized trust is monarchical
to an admissible solution. Values of selected global fit
status, which is represented as a dichotomy coded as statistics are listed next:
“1” for a constitutional or absolute monarchy and coded
as “0” for otherwise. The rationale is that monarchies chiML(5) = 5.896, p = .316
can facilitate generalized trust by projecting a sense of
CRMR = .024, exact-fit test p = .500
national identity, collective unity, or social stability. The
instrument for institutional quality is information tech- RMSEA = .053, 90% CI [0, .188]
nology, which is measured by two World Bank Group CFI = .996; SRMR = .020
indicators: internet bandwidth per person and the num-
ber of internet users per 100 people both measured in The model passes the chi-square test, but its power for
natural log units.4 Robbins (2012) hypothesized that N = 64 is < .10 in the MacCallum–RMSEA method.
information technology as an exogenous force can bol- The exact-fit test based on the correlation residuals with
ster institutional quality by making states better able to the CRMR as the test statistic is also passed but, again,
secure the exchange of commodities or enforce laws on power would be low. The upper bound of the RMSEA
a wider scale (assuming such developments are not put is unfavorable (.188), but this result is not surprising in
to malignant uses). Robbins (2012) hypothesized that a small sample (N = 64).
(1) a monarchy affects institutional quality only indi- The computer was unable to calculate the whole
rectly through its impact on trust, and (2) information matrix of standardized residuals for the covariance
technology affects trust only by first directly affecting residuals in this analysis. The correlation residuals are
institutional quality. Results of instrument falsification reported in the top part of Table 19.3, and their cor-
responding test statistics (normal deviates, z) are listed
1 https://www.worldvaluessurvey.org/wvs.jsp in the bottom part of the table (Maydeu-Olivares et al.,
2 https://www.fraserinstitute.org/studies/economic-freedom
2018), where values in boldface are significant at the
.05 level. All absolute correlation residuals are < .10.
3 https://openknowledge.worldbank.org/handle/10986/17136
Two are significant at the .05 level—.074 (monarchy,
4 https://data.worldbank.org/indicator bandwidth) and –.045 (monarchy, internet)—but nei-
Pt4Kline5E.indd 339 3/22/2023 2:05:54 PM


of Nonrecursive Model of Institutional Quality and Generalized Trust as
a Function of Monarchical Status and Information Technology
Indicator 1 2 3 4 5 6 7
Institutional quality
1. Rule of law —
2. Legal property rights .933 —
3. Corruption .767 .738 —
Information technology
4. Bandwidth .760 .737 .614 —
5. Internet users .820 .760 .664 .867 —
Single indicators
6. Generalized trust .454 .466 .372 .368 .375 —
7. Monarchy status .469 .480 .380 .457 .351 .498 —
SD 1.008 1.931 1.424 2.585 1.569 .138 .394
Note. Input data are from Robbins (2012), N = 64 countries.
ther result suggests gross discrepancies between model every increase in monarchical status of a full standard
and data. Given all results about fit just summarized, deviation, the level of generalized trust is expected to
the model is retained. increase by .35 standard deviations, controlling for
Interpretation of standardized parameter estimates institutional quality.5 The perception of institutional
is emphasized next. Estimates for the common factors quality is predicted to increase by about .69 standard
in Figure 19.5 are reported in Table 19.4. The average deviations, given an increase in information technology
proportions of explained variance (i.e., AVE) for indi- of a full standard deviation while controlling for trust.
cators of the information technology and institutional Estimates of direct effects for the two variables in the
quality factors are, respectively, .867 and .748, which 5 In the unstandardized solution, the coefficient for the direct
suggest reasonable convergent validity within each set
effect of monarchy on generalized trust is .123. Because monar-
of indicators. Estimates for the structural model are chy is coded as 1 = absolute/constitutional monarchy versus 0 =
presented in Table 19.5. The level of information tech- otherwise, the level of trust in countries with monarchies is .123
nology covaries positively with monarchical status (i.e., points higher in raw score units than in countries with other types
monarchy instead of other types of governments). For of government, controlling for institutional quality.
TABLE 19.2. Analyses, Script and Output Files, and Packages in R

for a Nonrecursive Model of Institutional Quality and Generalized Trust
as a Function of Information Technology and Monarchical Status
Analysis Script file R package
1. Partial structural regression model with a causal robbins-causal-loop.r lavaan
loop analyzed with data from N = 64 countries semTools
2. Computation of blocked-error R2’s for endogenous robbins-r-squared.r lavaan

variables (trust, quality) in a causal loop
Pt4Kline5E.indd 340 3/22/2023 2:05:54 PM

Monarchy Trust
Information Institutional
Technology Quality
1 1
Property
Bandwidth Internet Corruption
Rights
FIGURE 19.5. Nonrecursive model of institutional quality and generalized trust as a function of information technology and
monarchial status estimated with data for N = 64 countries.
TABLE 19.3. Correlation Residuals and Test Statistics for a Nonrecursive

Model of Institutional Quality and Generalized Trust as a Function
of Information Technology and Monarchical Status
Legal property rights 0
Corruption 0 0
Bandwidth –.001 –.010 0
Internet users –.003 .018 0 0
Generalized trust .005 –.018 .003 –.002 0
Monarchy status .006 –.021 .074 –.045 0 0
Test statistics (z)

Legal property rights 0
Corruption 0 0
Bandwidth –.091 –.394 0
Internet user –.456 .775 0 0
Generalized trust .317 –.317 .099 –.099 0
Monarchy status .377 –.377 2.083 –2.083 0 0
Pt4Kline5E.indd 341 3/22/2023 2:05:55 PM

TABLE 19.4. Maximum Likelihood Estimates for Common Factors

in a Nonrecursive Model of Institutional Quality and Generalized
Trust as a Function of Information Technology and Monarchical Status
Factor loadings
Information technology
Bandwidth 1.000 — .915 .031
Internet users .628 .053 .947 .028
Institutional quality
Legal property rights 1.000 — .934 .038
Corruption .624 .078 .790 .055
Error variances
Bandwidth 1.065 .345 .162 .057
Internet users .250 .122 .103 .053
Legal property rights .469 .249 .128 .071
Corruption .750 .161 .376 .087
Note. Standardized estimates of error variances are proportions of unexplained variance.
TABLE 19.5. Maximum Likelihood Estimates for the Structural Part

of a Nonrecursive Model of Institutional Quality and Generalized
Trust as a Function of Information Technology and Monarchical Status
Direct effects
Monarchy → Trust .123 .045 .350 .124
Technology → Quality .523 .101 .687 .120
Trust → Quality 5.772 3.003 .442 .225
Quality → Trust .022 .012 .292 .148
Disturbance variances and covariance

Trust .013 .002 .675 .099
Quality .920 .459 .287 .013
Trust Quality –.068 .046 –.629 .288
Exogenous variances and covariance

Technology 5.513 1.183 1.000 —
Monarchy .153 .027 1.000 —
Technology Monarchy .384 .130 .419
Note. Standardized estimates of disturbance variances are proportions of unexplained variance.
Pt4Kline5E.indd 342 3/22/2023 2:05:55 PM

direct feedback loop of Figure 19.5 are shown in bold- –.630 (see the table), says that omitted common causes
face in Table 19.5. For every increase in trust of 1 stan- are expected to increase one variable while decreasing
dard deviation, the perception of institutional quality is the other, controlling for their reciprocal effects and
expected to increase by about .44 standard deviations, those of monarchy and technology. It is possible to cal-
and the level of trust is expected to increase by about .29 culate R2 values for each of the endogenous variables in
standard deviations for every increase in quality of 1 the structural model, trust and quality, but this statistic
standard deviation. As expected, the directions of both is not appropriate for nonrecursive models with causal
effects are positive, and the relative impact of trust on loops for reasons explained in the next section. Topic
quality in standardized form is about 1.5 times stronger Box 19.1 deals with effect decomposition in nonrecur-
than the effect in the other direction (.44/.29 = 1.52). The sive models and the assumption of equilibrium.
disturbance correlation for trust and quality, or about At first glance, the representation in Figure 19.5 of
TOPIC BOX 19.1
Effect Decomposition in Nonrecursive Models

and the Equilibrium Assumption
Wright’s tracing rules (Chapter 9) do not apply to models with causal loops. Variables in feedback loops
have indirect effects—and thus total effects—on themselves, which is apparent in effect compositions
generated by SEM computer tools for nonrecursive models. Consider the direct feedback loop Y1  Y2.
Suppose that the standardized direct effect of Y1 on Y2 is .40 and that the effect in the other direction is
.20. An indirect effect of Y1 on itself corresponds to the sequence
Y1 → Y2 →Y1
which is estimated as .40(.20), or .08. There are additional indirect effects of Y1 on itself through Y2,
however, because cycles of mutual influence in feedback loops are theoretically infinite. The indirect effect
Y1 → Y2 →Y1 → Y2 →Y1
is one of these, and its estimate is .40(.20).40(.20), or .0064. Mathematically, these terms head pretty
quickly to zero, but the total effect of Y1 on itself estimates all possible cycles through Y2. Indirect and total
effects of Y2 on itself are similarly derived.
Calculation of direct, indirect, and total effects between variables in causal loops as just described
assumes equilibrium. Recall that there is no statistical test of whether the equilibrium assumption is tenable
when the data are cross sectional. In computer simulations, Kaplan et al. (2001) found that the stabil‑
ity index, which is printed in the output of some SEM computer tools such as Amos, did not accurately
measure the degree of bias due to lack of equilibrium. It is based on certain mathematical properties of the
matrix of coefficients for direct effects among all endogenous variables in a structural model, not just those
involved in feedback loops. These properties have to do with whether estimates of absolute direct effects
would become infinitely larger over time (i.e., with infinite cycles). If so, the system is said to “explode”
because it may never reach equilibrium, given the observed direct effects among the endogenous vari-
ables. The mathematics of the stability index are complex (Kaplan et al., 2001, pp. 317–322). A standard
interpretation of this index is that values < 1.0 are taken as evidence for equilibrium, but values > 1.0
suggest the lack of equilibrium. But this interpretation is not supported by Kaplan et al.’s (2001) simulation
results, which emphasizes the need to evaluate equilibrium on rational grounds.
Pt4Kline5E.indd 343 3/22/2023 2:05:55 PM

single indicators for monarchical status and generalized els or involved in recursive relations only in a model
trust without separate measurement error terms is not that is otherwise nonrecursive, such as Y3 in Figure
ideal. But these variables concern countries, not cases 19.4(a). The beR2 statistic is automatically computed by
(i.e., people), and the data were taken from publicly LISREL for nonrecursive models, which is convenient.
accessible sources based on historical and economic Hayduk (2006) outlined the method summarized next
records for multiple nations. For example, the monar- for calculating beR2 using any SEM computer tool that
chy variable in the model corresponds to the distinc- prints the model-implied (predicted) covariance matrix
tion between monarchies versus other forms of govern- when all parameters are fixed to equal user-specified
ment, which is arguably straightforward to distinguish. values:
In sensitivity analyses, Robbins (2012) reported that
specifying levels of measurement error in monarchy 1. Estimate the full model as usual and specify that
and trust up to .02 (i.e., 2% of the total variance is due the predicted covariance matrix for all variables,
to random error) did not appreciably change the results including common factors, is printed in the output.
compared with the original analyses with no separate Record the values of the predicted variances for all
terms for measurement error in these variables. endogenous variables involved in nonrecursive rela-
tions.
2. Specify a blocked-error model such that values of
BLOCKED‑ERROR R2 model parameters are fixed to equal their counter-
FOR NONRECURSIVE MODELS parts in the full model (i.e., at Step 1) except for the
focal endogenous variable’s disturbance variance
For continuous endogenous variables in recursive mod-
and disturbance covariance(s), which are all fixed to
els, the sample proportion of explained variance, or the
equal zero. The specification just mentioned blocks
squared multiple correlation
all effects of causal variables on the disturbance for
the focal endogenous variable. Because all parame-
2 Disturbance variance
R= 1.0 − (19.1) ters are fixed to equal constants, no model fit statis-
Model-implied variance tics may be printed in the output. Generate the pre-
dicted covariance matrix, and record the predicted
effectively blocks the effect of the corresponding distur-
variance for the focal endogenous variable.
bance term by (1) removing its influence from the total
(model-implied) variance. It is also assumed that (2) the 3. Calculate beR2 as
disturbance variance is independent of all other causes
of the same outcome represented in the model. This Model-implied variance Blocked error
beR 2 = (19.2)
is the standard assumption in ordinary least squares Model-implied variance Full
regression that all unmeasured predictors are uncor-
related with the predictors in the equation. Under this which is the ratio of the model variance in the
assumption, residuals in regression analysis are com- blocked-error model (Step 2) over the correspond-
puted such that they are independent of the predictors. ing value for the same endogenous variable in the
But for continuous endogenous variables involved in full model (Step 1). Repeat Steps 2–3 for each
nonrecursive relations, such as causal loops, the R2 sta- endogenous variable involved in nonrecursive rela-
tistic is problematic. This is because the disturbances tions in the full model.
of such variables are probably correlated with at least
one of their presumed causes, which violates the least The statistic beR2 is interpreted as the proportion of
squares requirement that the residuals (disturbances) variance in the focal endogenous variable explained
are uncorrelated with all predictors (causal variables). through the actions of causal variables in the model
Hayduk (2006) described the blocked-error R2 (beR2) except for the direct effect of the disturbance on that
for endogenous variables involved in nonrecursive rela- focal variable (Hayduk, 2006). Depending on the
tions. It is calculated by blocking the influence of the model and data, the value of beR2 can be either smaller
disturbance (error) of just the endogenous variable in or larger than that of R2 for the same endogenous vari-
question. An advantage of beR2 is that it equals the able.
value of R2 for endogenous variables in recursive mod- Listed for analysis 2 in Table 19.2 are the script and
Pt4Kline5E.indd 344 3/22/2023 2:05:55 PM

output files for computing in lavaan values of beR2 for possible disturbances or that are block recursive. The
each of the two variables, trust and institutional quality, order condition is only necessary, but the rank condi-
involved in the causal loop of Figure 19.5. The lavaan tion is sufficient, which means that meeting it guar-
program does not automatically calculate beR2, but it antees that the model is identified. Analyses of non-
does print predicted covariance matrices, which pro- recursive models are susceptible to empirical underi-
vide the necessary values (i.e., model-implied vari- dentification if values of coefficients for paths required
ances) to generate beR2. Reported in Table 19.6 are the for identification are close to zero or one. There are
values of R2 and beR2 for each endogenous variable. special measures of the proportion of explained vari-
Also listed in the table are values of disturbance or pre- ance, such as the Hayduk’s (2006) blocked-error R2,
dicted variances in the full model and in both blocked- that remove effects of causal variables from the distur-
error models. For trust, beR2 = .320, which is somewhat bances of endogenous variables involved in nonrecur-
lower than R2 = .325 for the same outcome. But for sive relations. In contrast, the ordinary least squares
institutional quality, beR2 = .944 exceeds R2 = .713 for statistic R2 is for recursive models.
this endogenous variable. Two other corrected propor-
tions of explained variance for endogenous variables in
nonrecursive relations are described in Topic Box 19.2. LEARN MORE
Fan et al. (2016) describe the application of SEM in eco-

SUMMARY logical studies including the analysis of models with causal
loops. Hayduk (2006) describes alternative corrected R2
effect sizes for outcomes in nonrecursive models, and Paxton
Causal loops within structural models represent
et al. (2011) is a valuable resource for specifying and ana-
the hypothesis that sets of endogenous variables are
lyzing nonrecursive structural equation models.
causes and outcomes of one another; that is, causation
is not simply unidirectional. Compared with recursive Fan, Y., Chen, J., Shirkey, G., John, R., Wu, S. R., Park,
models, nonrecursive models are more challenging to H., & Shao, C. (2016). Applications of structural equa-
analyze. One reason is identification: recursive mod- tion modeling (SEM) in ecological studies: An updated
els are always identified, but certain configurations review. Ecological Processes, 5(1), Article 19.
of paths in nonrecursive models can render some of
their parameters underidentified. Fortunately, there are Hayduk, L. A. (2006). Blocked-error-R2: A conceptually
improved definition of the proportion of explained vari-
ways to evaluate definitively the identification status
ance in models containing loops or correlated residuals.
of at least some kinds of nonrecursive models. Rig-
Quality & Quality, 40(4), 629–649.
don’s (1995) graphical block classification method can
be applied to determine when endogenous variables in Paxton, P., Hipp, J. R., & Marquart-Pyatt, S. (2011). Nonre-
nonrecursive relations, such as causal loops, require cursive models: Endogeneity, reciprocal relationships,
instruments, and the order condition and rank condi- and feedback loops. Sage.
tion can be applied to nonrecursive models with all
TABLE 19.6. Standard (Ordinal Least Squares)

and Blocked-Error Proportions of Explained Variance
for Endogenous Variables in a Causal Loop
Full model Blocked-error models
Disturbance Model-implied Model-implied
Outcome variance variance R2 variance beR2
Trust .013 .019 .325 .006 .320
Quality .920 3.202 .713 3.022 .944
Note. Values for R2 and beR2 are from computer output. There is noticeable rounding error
in hand calculation at 3-decimal accuracy for the trust variable. beR2, blocked-error R2.
Pt4Kline5E.indd 345 3/22/2023 2:05:55 PM

TOPIC BOX 19.2
Other Corrected Proportions of Explained Variance

for Nonrecursive Models
The Bentler– Raykov R2 (Bentler & Raykov, 2000) is a based on a respecification that partitions the
variance for target endogenous variables controlling for disturbances and causal variables. Their method
adds a phantom (artificial) variable with no indicators to the model in a way that partitions the variance
as just described, and the squared correlation between the target and phantom endogenous variables is
the corrected R2. Hayduk (2006) noted that the disturbance of the focal endogenous variable is actually
a common cause of both that outcome and the phantom variable in this method, so the Bentler–Raykov R2
does not completely block the error. Values of the Bentler–Raykov R2 are automatically printed in EQS for
endogenous variables in nonrecursive models.
Versions of LISREL since version 8.3 automatically print values of the reduced-form R2 for each
endogenous variable. It counts as explained variance all direct or indirect effects of the exogenous vari-
ables on each focal endogenous variable, just as beR2 for the same outcome. Unlike beR2, though, the
reduced-form R2 ignores the causal effects of all nonfocal endogenous variables except for indirect effects
of the exogenous variables through them on the focal endogenous variable. It does so by blocking the
effects of disturbances for all endogenous variables on any of their outcomes in the model, not just the dis-
turbances for endogenous variables involved in nonrecursive relations (Hayduk, 2006). In this way, the full
explanatory power of exogenous variables, the causes of which are unknown, is given full sway compared
with the explanatory power of endogenous variables, the causes of which are explicitly represented in the
model. A drawback is that if all direct causes for an endogenous variable are also endogenous, then few,
if any, of these effects may contribute to the value of the reduced-form R2.
EXERCISES
1. Prove that Figure 19.3(b) satisfies the order condi- 4. Demonstrate that Figure 19.4(b) with no distur-
tion. bance covariance fails both the order condition and
the rank condition.
2. Show that Figure 19.4(a) fails the rank condition.
5. Respecify Figure 19.4(b) to test the hypothesis that
3. Use Rigdon’s (1995) block classification method to the disturbance covariance is zero.
prove that Figure 19.4(b) is identified.
6. Verify that df M = 5 for Figure 19.5.
Pt4Kline5E.indd 346 3/22/2023 2:05:56 PM

Appendix 19.A 1 0 (II)

0 1
1 1
Evaluation of
the Rank Condition The third row can be formed by adding the corre-
sponding elements of the first and second rows, so it
should be deleted. Therefore, the rank of this matrix
Begin by constructing a system matrix, where the is (II) is 2, not 3.
endogenous variables are represented in the rows and
3. Repeat Steps 1 and 2 for every endogenous variable.
all variables are represented in the columns. In each
row, a “0” or “1” appears in the column that corresponds If the rank condition is satisfied for every endog-
to that row. A “1” indicates that the variable represented enous variable, then the model is identified.
by the column has a direct effect on the endogenous
variable represented by that row. A “1” also appears in Steps 1 and 2 applied to the system matrix for Figure
the column that corresponds to the endogenous vari- 19.3(a) are outlined here (III). Note that we are begin-
able represented by that row. The remaining entries are ning with Y1:
0s, and they indicate excluded variables. The system
matrix for Figure 19.3(a) is (I): X1 X2 Y1 Y2 (III)
Y1 1 0 1 1
X1 X2 Y1 Y2 (I)
Y1 1 0 1 1 Y2 0 1 1 1 → 1 → Rank = 1
Y2 0 1 0 1
For Step 1, all entries in the first row for Y1 are crossed
“Reading” this matrix for Y1 indicates three 1s in its out. Also crossed out are the two columns with a “1” in
row, one in the column for Y1 itself, and the others in the this row (i.e., those with column headings X1 and Y2, the
columns of variables that affect it, X1 and Y2. Because included variables in the equation for Y1.) The result-
X2 is excluded from Y1’s equation, the entry in this col- ing reduced matrix has a single row, so this reduced
umn is “0.” Entries in the row for Y2 are specified in a matrix cannot be further simplified. Thus, the rank of
similar way. the equation for Y1 is 1, which equals the minimum
The rank condition is applied to the equation of value, or one less than the total number of endogenous
each endogenous variable by working with the system variables in Figure 19.3(a). The rank condition is satis-
matrix. The steps for models with all possible distur- fied for Y1.
bance correlations are: The steps for the other endogenous variable in Figure
19.3(a) are summarized next. Evaluation for Y2 (IV):
1. Begin with first row of the matrix. Cross out all
entries in that row. Also cross out any column in the X1 X2 Y1 Y2 (IV)
system matrix with a “1” in that row. Variable labels
are not needed in the reduced matrix. Y1 1 0 1 1
2. Simplify the reduced matrix further by deleting any Y2 0 1 1 1 → 1 → Rank = 1
rows with entries that are all zeros. Also, delete any
row that is an exact duplicate of another or can be
reproduced by adding other rows (i.e., it is linearly The rank of the equation for Y2 is 1, which exactly
dependent on other rows). The number of remaining equals the minimum required value. Because the rank
rows is the rank. For example, consider the follow- condition is satisfied for both endogenous variables in
ing reduced matrix (II): Figure 19.3(a), we conclude that it is identified.
Pt4Kline5E.indd 347 3/22/2023 2:05:57 PM

For a block recursive model like Figure 19.3(c), the second block, Y3 and Y4, are not included in the system
rank condition is evaluated for each of the two blocks matrix for variables in the first block, Y1 and Y2. The
of endogenous variables in direct feedback loops. For system matrix for the second block lists only Y3 and Y4
example, the system matrix for the block that includes in its rows, but all variables in the whole model are rep-
Y1 and Y2 lists only those variables and the directly resented in its columns. Next, apply the rank condition
prior variables X1 and X2. Endogenous variables in the to each system matrix of each block.
Pt4Kline5E.indd 348 3/22/2023 2:05:57 PM

20
Enhanced Mediation Analysis
This chapter extends the discussion presented in Topic Boxes 6.1, 7.1, and 8.1 about, respectively, the defi-
nition of mediation, the role of research design—that is, conceptual time-ordering of cause, mediator, and
outcome in cross-sectional designs versus temporal precedence in measurement among these variables in
longitudinal designs—and analysis steps. We begin with the mediation myth, or the false belief that
mediation is estimated in the typical mediation study (Kline, 2015; Pek & Hoyle, 2016), where
1. the design is cross-sectional;

2. there is little, if any, mention of the assumptions on which mediation is based, such as whether the
purported mediator is theoretically amendable to influence by the cause (e.g., the mediator is a
state, not a trait, variable);
3. finding statistically significant indirect effects is treated as proof of mediation;
4. the existence of equivalent models—some of which do not involve mediation at all—that explain the
data equally well is not acknowledged.
Thus, the typical mediation study offers a weak case for actual mediation (Trafimow, 2015), and this is true
despite the popularity of mediation analysis and the publication of thousands of SEM studies purportedly
about mediation (Nguyen et al., 2021).
The situation just described is unfortunate because and other areas. In clinical trials or observational stud-
the promise of mediation analysis is both intrigu- ies, mediation analysis can inform researchers about
ing and scientifically relevant: It addresses the ques- mechanisms through which interventions or risk fac-
tion about how changes are transmitted from a causal tors affect health outcomes (Lee et al., 2021).
variable through one or more intervening variables, But mediation analysis involves more—a lot more,
or mediators, which in turn lead to changes in an out- actually—than simply drawing one-way arrows
come (Little, 2013). Pearl (2014) described mediation between symbols that represent variables in model
analysis as telling us how nature works in the form of diagrams, collecting data, generating product estima-
direct or indirect effects, given a causal model based tors for hypothesized indirect causal pathways, and
on theory. MacKinnon et al. (2007) noted that ques- testing those estimators for statistical significance
tions about chains of relations through causal path- with relatively little regard to proper design, analysis,
ways made up of three or more variables form the basis or interpretation. There are also increasing numbers of
of many research problems in psychology, education, researchers who caution against unprincipled media-
commerce, public health, medicine, pharmacology, tion analysis in disciplines that include perinatal epi-
349
Pt4Kline5E.indd 349 3/22/2023 2:05:57 PM

demiology, environmental health, psychology, and psy- 2. Without a strong conceptual time-ordering among
chiatry (respectively, Ananth & Brandt, 2022; Blum et variables, it may be impossible to prefer any one
al., 2020; Bullock & Green, 2021; Stuart et al., 2021). among equivalent models.
The general state of practice in mediation analysis has 3. In observational studies, it is difficult to avoid biased
clearly lagged behind best practices, or even common estimates due to omitted common causes without
sense in some cases. including proxies for confounders or instrumental
The body of literature devoted to mediation analysis variables in the model.
is relatively large and includes book-length presenta-
tions by Hayes (2022), Jose (2013), MacKinnon (2008), Presented in the upper left part of Figure 20.1(a)
and VanderWeele (2015), among other works. It is also is a basic model of mediation among three observed
an area that has witnessed the development of new variables, X, M, and Y. This recursive model is just-
analysis strategies or ways of thinking about media- identified (df M = 0), so it would perfectly fit the data.
tion in specific contexts, such as in randomized clinical The other five models in Figure 20.1(a) are also recur-
trials and the analysis of models with multiple media- sive and arbitrarily switch the roles of cause, media-
tors when the causal structure among them is unknown tor, and outcome among the same three variables. They
(Bullock & Green, 2022; Gonzalez et al., 2023; Loh all would also perfectly explain the same data, so they
et al., 2022; Lynch et al., 2008). Other recent develop- are equivalent to the original version. The model in the
ments include the availability of standards for reporting upper left part of Figure 20.1(b) is a nonrecursive model
results from mediation analyses in randomized trials or of mediation with freely estimated reciprocal effects
observational studies (Lee et al., 2021). between M and Y. The disturbance covariance is fixed
Given the scope of the topic, it is impossible to to equal zero, and the cause, X, is the instrument for
describe all of enhanced mediation analysis in a single M, which identifies the model (Figure 19.2). The direct
chapter. Instead, the goal of this presentation is to warn effect of X enters the causal loop by first affecting M,
readers away from common pitfalls and suggest some which is then transmitted back and forth between M
better, more modern, alternatives in term of research and Y, which are also mediators with respect to each
design and analysis options. Note that diagrams and other. This nonrecursive model is just-identified (df M =
examples in this chapter cover path models with sin- 0), so it perfectly fits the data. The other five models
gle indicators, but the ideas described next generally in Figure 20.1(b) indiscriminately switch the roles of
extend to models with multiple indicators for causes, instrument and variables in the causal loop, and they
mediators, or outcomes. This is fortunate because perfectly explain the data, too. Thus, all six models in
results of computer simulation studies by Iacobucci et Figure 20.1(b) are equivalent.
al. (2007) suggested that multiple-indicator analyses of There are still more equivalent versions. The three
mediation that control for measurement error generally mediation models in Figure 20.1(c) are also nonre-
outperform single-indicator analyses that do not con- cursive but now with equality-constrained directs in
trol for such error (i.e., standard regression analysis). the causal loops. With no disturbance covariance, no
instrument is needed to identify these models. Due to
the equality constraint, df M = 0 for all three models,
MEDIATION ANALYSIS so they perfectly fit the data. The three models in Fig-
IN CROSS‑SECTIONAL DESIGNS ure 20.1(d) are recursive, but they are not mediation
models. Instead, they feature a disturbance covariance
There are three great obstacles to estimating media- in bow-free pattern with no direct effect between the
tion in cross-sectional designs (Antonakis et al., 2010; endogenous variables. Because df M = 0 for the models
MacKinnon et al., 2007; O’Laughlin et al., 2018; Tate, in Figure 20.1(d), they, too, perfectly the data. Thus, all
2015): 18 models in Figure 20.1 are equivalent versions that,
as a set, make opposing causal claims.
1. The absence of temporal precedence implies Let’s consider a more complex model. Figure 20.2(a)
that mediation cannot be estimated in a way that is a parallel mediation model, where X is hypothe-
respects its very definition as involving changes sized to affect Y both directly and indirectly through
among cause, mediator, and outcome variables. two different mediators, M1 and M2. The mediators are
Pt4Kline5E.indd 350 3/22/2023 2:05:57 PM

(a) Recursive mediation
X M X Y Y M
Y M X
Y X M X M Y
M Y X
(b) Nonrecursive mediation, free reciprocal effects
X M X M Y M
Y Y X
Y M M X M X
X Y Y
(c) Nonrecursive mediation, equal reciprocal effects
X M M X Y X
Y Y M
(d) Recursive, no mediation
X M Y M M X
Y X Y
FIGURE 20.1. Equivalent models for three variables that represent recursive mediation (a), nonrecursive mediation with freely
estimated reciprocal effects (b), nonrecursive mediation with equality-constrained reciprocal effects (c), and recursive models
with bow-free pattern disturbance covariances but no mediation (d).
351
Pt4Kline5E.indd 351 3/22/2023 2:05:57 PM

not causally linked. Instead, their association is speci- Thoemmes (2015) demonstrated that it is illegitimate
fied as spurious due to their common cause, variable to reverse arrows (direct effects) in mediation models
X. The same model assumes there is no unobserved to check whether one equivalent version is superior to
confounding of the mediators (i.e., their disturbances another. The supposed evidence is the finding that the
are independent). You should verify that df M = 1 for coefficient for an indirect effect in one model, such as
Figure 20.2(a). There is also no causal link between
the mediators in Figure 20.2(b), but they are assumed X → M1 → M2 → Y
to share at least one unmeasured cause (i.e., their dis-
turbances covary). Each of Figures 20.2(c) and 20.2(d) in Figure 20.2(c) is statistically significant, but the coef-
is a sequential (serial) mediation model with causal ficient for the path
effects between the mediators but in opposite direc-
tions, respectively, M1 → M2 versus M2 → M1. Because X → M2 → M1 → Y
df M = 0 for Figures 20.2(b)–20.2(d), they all perfectly fit
the data. Thus, if the parallel mediation model in Fig- in the other model, such as Figure 20.2(d), is not sta-
ure 20.2(a) with a single degree of freedom is rejected, tistically significant, or vice versa. This outcome is not
there is no way to statistically distinguish among Fig- surprising because each indirect effect just listed has its
ures 20.2(b)–20.2(d) with alternative causal structures own standard error, and it could happen in a particular
for the mediators. sample that p values are < .05 for either, both, or nei-
No causal link
(a) Parallel mediation (b) Unmeasured confounder(s)
M1 M1
X Y X Y
M2 M2
Causally linked (sequential mediation)

(c) M1 → M2 (d) M2 → M1
M1 M1
X Y X Y
M2 M2
FIGURE 20.2. Parallel mediation model with no causal effects or disturbance covariances between mediators (a), model
with unobserved confounding of the mediators but no causal effect between them (b), and sequential mediation models where
M1 causes M 2 (c) and M 2 causes M1 (d).
Pt4Kline5E.indd 352 3/22/2023 2:05:57 PM

Enhanced Mediation Analysis 353
ther of these indirect effects. Remember that p values use of incentives, subordinate conscientiousness, and
for specific effects in structural equation models have company policies, and these instruments had no direct
little, if anything, to do with whether the corresponding effects on turnover or model-implied associations
specification is correct. with its disturbance. So specified, the LMX variable
In an analytical study of alternative mediation is hypothesized to mediate effects of the instruments
models all in the same equivalence class, Thoemmes on turnover and has its own direct effect on turnover.
(2015) demonstrated that (1) incorrect models yielded In ordinary least squares (OLS) regression analysis,
indirect effects that are generally comparable in size the coefficient for the effect of LMX on turnover was
with the true indirect. Also, (2) the true indirect effect incorrect compared with estimates for the same effect
is not necessarily the largest among all possible indi- in in two-stage least squares (2SLS) analysis with
rect effects. Specifically, true models yielded the largest instruments for a mediation model. Other suggestions
indirect effect only about 20% of the time over incor- for reducing bias in observational studies include the
rect-but-equivalent versions. These results suggest that use of measures with strong psychometric properties
an empirical specification search that selects the model and avoidance of common method effects by measur-
with the largest indirect effect would be wrong in ing hypothesized mediators and outcomes with differ-
most cases. Thoemmes (2015) argued that researchers ent methods, such as observational versus archival, or
should abandon the attempt to prefer one model over different informants, such as teenagers versus their par-
equivalent versions based on statistical evidence alone. ents—see Antonakis et al. (2010) for more examples.
In cross-sectional designs, only a solid rationale about
conceptual time-ordering should be based on substan-
tive theory or refer to the nature of the variables that EFFECT SIZES FOR INDIRECT EFFECTS
would make one directionality specification more plau-
sible than its opposite (Tate, 2015). To summarize, it is Product estimators for indirect effects of continuous
not generally possible to “discover” mediation through variables in linear models, such as ab for the indirect
significance testing without strong a priori hypotheses effect of X on Y through M in Figure 8.2(b), describe
about effect directionality. the expected change in outcome Y while holding cause
It is often difficult in cross-sectional studies to plau- X constant and changing mediator M to whatever value
sibly assume there are no unmeasured confounders it would reach under a change in X. In the unstandard-
for every pair of variables in a basic mediation model. ized solution, change in Y as just described is expressed
Covariate selection is an option to identify causal in its raw score (original) metric, given a change of 1
effects in the presence of endogeneity, but it requires point in the original metric of X in its effect on M. In the
knowledge of both the nature and number of possible completely standardized solution, change is expressed
confounders. Instruments are another option, if con- in standard deviation units for Y, given a change in X
founders are unknown or if proxies for known con- of 1 standard deviation in how it affects M (Chapter
founders are unavailable. Ideally, there would be at 8). A drawback is that values of standardized indirect
least one instrument for each causal variable including effects—or standardized effect sizes in general—are
the mediator(s). Recall that model-implied instruments not directly comparable over samples with different
can be either exogenous or endogenous variables, and variances on the original (unstandardized) metrics of
potentially all exogenous variables could be “recruited” the variables. Also, if cause X is a dichotomy coded
as instruments (Antonakis et al., 2010). “1” for treatment and “0” for control, then the unstan-
Antonakis et al. (2014) described the use of instru- dardized product estimator would be preferred. This
ments to estimate causal effects of leadership–mem- is because a 1-point change in X directly specifies the
ber exchange (LMX)—the idea that effective leaders contrast between treatment and control (or any two dif-
develop different levels of relationships with their sub- ferent groups or conditions).
ordinates characterized by trust, mutual respect, and Preacher and Kelley (2011) described over a dozen
positive affect—on employee turnover, or the inten- other statistics for describing the magnitude of indirect
tion to stay with or leave a company, where LMX is effects. The smaller number of effect sizes described
an endogenous regressor that covaries with the distur- next are among the most widely reported in mediation
bance for turnover. Instruments for LMX were exog- studies or have statistical properties that seem promis-
enous variables such as leader extraversion, leader ing. I should caution you, though, that this is an active
Pt4Kline5E.indd 353 3/22/2023 2:05:57 PM

research topic for which critiques of extant statistics or ances) are unaffected by sample size, the problems just
descriptions of new effect sizes are regularly published, mentioned are serious.
so this summary could already be dated. Also, there is Preacher and Kelley (2011) described kappa-
no universal, comprehensive mediation effect size that squared (κˆ 2), which is defined as the ratio of the
works in all situations; that is, basically all existing sta- observed (sample) indirect effect, ab, over the maxi-
tistics have limitations. Finally, it is not always clear mum possible value of the indirect effect with the same
how to extend effect sizes developed mainly for models sign in a particular data set, or max (ab). That is, the
with single mediators to models with multiple media- value of κˆ 2 describes the empirical indirect effect as the
tors. With these caveats in mind, let’s begin our review. proportion of the maximum attainable value of the indi-
Perhaps the most widely reported mediation effect rect effect, given the data. Values of κˆ 2 are unaffected
size is the mediation ratio, also called the relative by sample size and are interpreted in a standardized
indirect effect. Referring to Figure 8.2(b), the media- metric from 0 to 1.0. Wen and Fan (2015) criticized κˆ 2
tion ratio is calculated as for its lack of the property of rank preservation; that is,
the value of κˆ 2 can decrease while the indirect effect
ab
P̂M = (20.1) on which it is based increases. They also demonstrated
ab + c′ that values of κˆ 2 can be contradictory for models with
which is the ratio of indirect effect of X on Y through multiple mediators and that its definition of the maxi-
M, or ab, to the total effect of X on Y, or ab + c′. It is mum possible indirect effect is mathematically prob-
intended to describe the proportion of the total effect lematic. These drawbacks limit the theoretical useful-
that is mediated, but there are problems with this inter- ness of κˆ 2 (Lachowicz et al., 2018).
pretation: The value of the mediation ratio can exceed Lachowicz et al. (2018) introduced upsilon (u), a
1.0 or can be negative, so it is not a simple proportion. standardized effect size that in the population equals
Its absolute value is unbounded in that it can approach the proportion of variance in the outcome explained
infinity when the total effect is close to zero, which jointly by the cause and mediator and adjusted for spu-
can happen in inconsistent mediation (Topic Box 8.1). rious associations induced by the causal ordering of the
Another drawback is that the ratio of a very small indi- variables (i.e., variables M and Y have a common cause,
rect effect over a total effect that is just a tiny bit larger X). For example, u = .10 indicates that 10% of the popu-
can be relatively high, such as .95, could be misinter- lation variance in Y is explained jointly by X and M
preted as indicating a relatively important effect. Like- after correcting for spurious associations. When sup-
wise, the ratio can appear to trivialize larger effects, pression effects are not evident (e.g., mediation is not
such as .50 for a meaningful indirect effect that cor- inconsistent; Topic Box 8.1), values of u are bounded by
responds to half of an even bigger total effect (Preacher 0 and 1.0; otherwise, it is a monotonic function of the
& Kelley, 2011) indirect effect in that u increases in value beyond 1.0 as
A second mediation effect size is the ratio of the the size of the indirect effect increases.
indirect effect of X on Y through M to the direct effect The sample estimator, or û, is just the squared stan-
of X on Y, or dardized indirect effect, or
ab 2
R̂ M = (20.2) =uˆ (a=
s bs ) as2bs2 (20.3)
c′
which is intended to compare the relative magnitudes where as and bs are the coefficients in the completely
of the indirect versus direct effects of X. Like the standardized solution for, respectively, the direct effects
mediation ratio, the ratio of the indirect effect to the of X on M and of Y on M. Recall that standardized effect
total effect is not a proportion because its value can sizes depend on values of the variances in a particular
also exceed 1.0 or be negative. It can also exagger- sample, and û is no exception. Because û closely resem-
ate smaller effects or trivialize larger effects. Both P̂M bles an R2-type size, it is a positively biased estimator of
and R̂ M have relatively large variances over random the parameter u. Lachowicz et al. (2018) described an
samples, so they are not efficient estimators. Values of adjusted estimator, u , based on the asymptotic covari-
P̂M can be unstable if N < 500, and R̂ M may require ance matrices (sampling variances) for the unstandard-
N > 5,000 before its values are stable (Preacher & ized regressions of M on X and of Y on both X and M; see
Kelley, 2011). Although both P̂M and R̂ M are relatively Lachowicz et al. (2018, p. 251) for the equation. In large
easy to calculate and their values (but not their vari- samples, bias in û is relatively slight, but there is a bigger
Pt4Kline5E.indd 354 3/22/2023 2:05:58 PM

difference in smaller samples. The MBESS package for Variable X is achievement values, where a higher score
R (Kelley, 2022) calculates all the effect sizes described indicates greater personal value for strong academic
to this point and can optionally use nonparametric boot- achievement; the mediator M is deviance tolerance,
strapping from a raw data file to generate confidence where a higher score means greater tolerance of behav-
intervals for the parameters estimated by them. ior that violates the rights of others; and the outcome
Presented next is a covariance matrix in lower diago- Y is level of self-reported deviant behavior. The model
nal form for three variables, respectively, X, M, and Y, has zero degrees of freedom because the cause has both
based on data from Jessor and Jessor (1991) (cited in direct and indirect effects on the outcome.
Presented in Table 20.1 for analysis 1 are the syntax
Preacher & Kelley, 2011) in a sample of N = 432 high
and output files for fitting in MBESS a basic single-
school students:
mediator model to the covariance matrix just listed (I)
with the OLS estimator. The same model and data were
2.268 (I) analyzed in lavaan to obtain standardized estimates
.662 2.276 and standard errors for the indirect and total effects.
Reported in the top part of Table 20.2 are the estimates
–.087 –.226 .092
for all effects, and listed in the bottom part of the table
TABLE 20.1. Script and Output Files for Estimation of Effect Sizes
for the Indirect Effect in a Basic Mediation Model
Analysis Script file R packages
1. Coefficients and effect sizes for the indirect jessor-effect-size.r MBESS
effect in a basic mediation model lavaan
2. Interventional in(direct) effects in a van-ryzin-intervention.r lavaan

randomized study of a family-based program
TABLE 20.2. Ordinary Least Squares Estimates for a Basic

Mediation Model and Effect Sizes for the Indirect Effect
Effect Coefficient(s) Estimate
Direct and indirect effects
X → M (AV → DT) a .292 (.046) .291a
M → Y (DT → DB) b –.096 (.009) –.479
X → Y (AV → DB) c′ –.010 (.009) –.051
X → M → Y (AV → DT → DB) ab –.028 (.005) –.140
Total effect of X on Y ab + c′ –.038 (.010) –.190
Effect sizes
P̂M ab/(ab + c′) .733
R̂ M ab/c′ 2.744
û (asbs)2 .019
Note. Effect sizes were calculated from values in computer output for this analysis (see
Table 20.1); otherwise, there is noticeable rounding error in hand calculation at 3-deci-
mal accuracy. AV, achievement values; DT, deviance tolerance; DB; deviant behavior.
aUnstandardized (standard error) standardized.
Pt4Kline5E.indd 355 3/22/2023 2:05:58 PM

are values of effect sizes for the indirect effect. The mediation is complete or partial; that is, when the
effect size κˆ 2 is not reported in the table due to the prob- mediator wholly explains the association between
lems described earlier, but it is printed in the output file cause and outcome versus the mediator explains
for this analysis. Exercise 1 asks you to interpret the some, but not all, of that association (Topic Box 8.1).
coefficients for the unstandardized and standardized
indirect effect of achievement values on deviance toler- In a mediation model for a strictly longitudinal medi-
ance through deviance tolerance, and Exercise 2 asks ation design, there are no direct effects between vari-
you to reproduce and interpret the values of the effect ables measured at the same occasion. Instead, covari-
sizes for the indirect effect in the table. ances are specified based on theory either between
exogenous variables at the first measurement occasion
or “downstream” in the model between disturbances for
CROSS‑LAG PANEL DESIGNS concurrently measured endogenous variables. Direct
FOR MEDIATION effects are either autoregressive for the same variable
or cross-lagged for different variables measured at dif-
Effects from cause to mediator to outcome take time ferent times (Cole & Maxwell, 2003).
to occur, but use of a cross-sectional design basically Two examples of cross-lagged panel models for
assumes that such effects are instantaneous, which is estimating mediation are presented in Figure 20.3.
logically flawed (Selig & Preacher, 2009). Estimation of The model in Figure 20.3(a) corresponds to a half
indirect effects in cross-sectional designs also assumes longitudinal design (Cole & Maxwell, 2003). In this
that (1) dynamic relations among cause, mediator, and model, the subscripts indicate the measurement occa-
outcome—respectively, X, M, and Y—have stabilized sion for each of the cause, mediator, and outcome. In
so that the coefficients for paths connecting any pair of such designs, the cause is measured only at time 1, but
variables at adjacent time points remain the same over the mediator and outcome are each measured at times
time, or stationarity; and (2) the system has reached 1 and 2 (see the figure). At time 2, the disturbances
equilibrium such that cross-sectional relations among for the mediator and outcome are allowed to covary.
X, M, and Y are constant over time (Maxwell & Cole, Coefficient a in the figure estimates the cross-lagged
2007). But even when these ideal conditions hold, there direct effect of the cause at time 1 on the mediator at
are reasons to suspect that estimates of indirect effects time 2, controlling for the mediator at time 1. Simi-
in cross-sectional studies could be substantially biased larly, coefficient b in the figure estimates the cross-
(Maxwell & Cole, 2007; Maxwell, Cole, & Mitchell, lagged direct effect of the mediator at time 1 on the
2011; Selig & Preacher, 2009): outcome at time 2, now controlling for the outcome at
time 1. Although there is no contiguous pathway from
1. Coefficients for direct effects in such designs fail X to M to Y in the figure, the indirect effect is never-
to control for autoregressive effects of the mediator theless estimated for continuous variables assuming
and outcome variables. For example, the mediator no interaction as the product ab. This product estima-
is regressed on the cause but not also on the mediator controls for the autoregressive effects of both the
tor from an earlier point in time, or prior to the sole mediator and outcome. The figure has no direct effect
measurement occasion in a cross-sectional design. from X to Y, but adding one is an option, if justified
Thus, this coefficient may not correctly estimate the in theory. Exercise 3 asks you to verify that Figure
impact of the cause on the mediator. For a similar 20.3(a) is identified and to tally df M, and Exercise 4
reason, the coefficient for the mediator–outcome asks you to describe the relation between X and Y
direct effect could be wrong, but now due to failure specified in the figure.
to control for the prior state of the outcome. In the analysis of a half longitudinal model where
the mediator, executive cognitive function, was repre-
2. Estimates of indirect effects in cross-sectional sented as a common factor with three indicators, Brøn-
designs can be substantially biased when the true nick et al. (2017) found little evidence that executive
mediational process follows longitudinal random function mediated the association between physical
effects models such as latent growth curve models, activity and scholastic performance among children
which are described in the next chapter. followed for 7 months. Physical activity at time 1 also
3. Both patterns of bias just described can occur when failed to appreciably predict achievement at time 2, a
Pt4Kline5E.indd 356 3/22/2023 2:05:58 PM

(a) Half longitudinal mediation
X1
a
M1 M2
b
Y1 Y2
(b) Full longitudinal mediation
e e′
X1 X2 X3
c c′
h h′
g g′
M1 M2 M3
d i d′ i′
Y1 Y2 Y3
f f′
FIGURE 20.3. A half longitudinal model of mediation (a). A full longitudinal model of mediation (b). Subscripts indicate
measurement occasions for the same variables.
finding that Brønnick et al. (2017) described as gener- and d′ correspond to the sole contiguous pathway in the
ally contradicting results from analyses of similar vari- model through which changes in the initial cause can
ables in cross-sectional studies. indirectly affect outcome, or
In a full longitudinal design (Cole & Maxwell,
2003), the cause, mediator, and outcome are each mea- X1 → M2 → Y3
sured on at least three different occasions. This design
is represented in Figure 20.3(b). Disturbances within The model for the full longitudinal design in Figure
each of the second and third measurement occasions 20.3(b) contains within it two replications of the half
are all pairwise correlated. Assuming continuous vari- longitudinal design. Specifically, (1) the product cd for
ables and no interactions, the indirect effect is estimated the paths
as the product of coefficients c and d′, or cd′. Coefficient
c estimates the direct effect of the cause at time 1 on the X1 → M2 and M1 → Y2
mediator at time 2, controlling for the previous level
of the mediator, and coefficient d′ estimates the direct is a proxy for the “real” estimator of mediation, or cd′.
effect of the mediator at time 2 on the outcome at time Likewise, (2) the product c′d ′ for the paths
3, controlling for the autoregressive effect of the out-
come at time 2. Little (2013) notes that coefficients c X2 → M3 and M2 → Y3
Pt4Kline5E.indd 357 3/22/2023 2:05:58 PM

estimates a different proxy. The values of the two proxy The example just described is a special case of con-
estimators just mentioned (cd, c′d ′) and that of cd′ for ditional process analysis (CPA), where moderators
the sole contiguous indirect pathway should all be simi- can be categorical or continuous variables. It works
lar under the assumption of stationarity, which also pre- by estimating both mediation and moderation in the
dicts all the equivalences (within the limits of sampling same analysis to test hypotheses about how causal
error) listed next for Figure 20.3(b): mechanisms depend on individual differences, con-
text, situation, or stimuli (Hayes & Rockwood, 2020).
c = c′ d = d′ e = e′ f = f′ Although the idea of integrating mediation and mod-
g = g′ h = h′ i = i′ eration analysis is not new (Edwards & Lambert, 2007;
Fairchild & MacKinnon, 2009), the method of CPA as
Within a sample of young adults surveyed annu- developed by Hayes (2022) includes a suite of freely
ally over a 3-year period, Garn and Simonton (2022) available macros (scripts) for SPSS, SAS/STAT, and R
analyzed a full longitudinal model that represents the called PROCESS that automates much of the analysis
hypothesis that emotions about physical activity— of conditional process models compared with specify-
whether it is experienced as enjoyable or boring—medi- ing the model in the syntax of an SEM computer tool.1
ated the relation between motivation and value beliefs It applies OLS or logistic regression methods to ana-
about physical activity and the degree of active versus lyze observed-variable path models that include inter-
sedentary leisure time activities. Motivation beliefs action or indirect effects. Nonparametric bootstrapping
concern whether being physically active is part of from raw data files is used to generate standard errors
one’s self-concept, and value beliefs reflect the amount and confidence intervals for specific effects. Causal
of interest in physical activity. Three behavioral out- variables can optionally be analyzed in mean-deviated
come measures were analyzed, including the frequency form (i.e., centering), and there is no option for comput-
and duration of self-reported moderate-intensity and ing standardized coefficients (Hayes, 2022).
vigorous-intensity leisure time physical activity versus Models with single or multiple mediators or mod-
sedentary activities like socializing or studying. There erators can be analyzed in CPA using the PROCESS
was little evidence that either enjoyment or boredom macros. Most examples of CPA in the literature involve
mediated relations between motivation or value beliefs cross-sectional designs, but both the method and the
and leisure time physical activities that are vigorous, PROCESS macros can also be applied in longitudinal
moderate, or sedentary. Instead, reciprocal effects designs when analyzing cross-lag panel models. Unlike
over time between motivation and value beliefs about causal mediation analysis (discussed in the next sec-
physical activity and engaging in at least moderately tion), where direct and indirect effects are defined in
intense leisure time physical activity were stronger. ways that allow for interactions between causes and
That is, young adults who saw being physically active mediators, the technique of CPA generally assumes
as important spent more of their leisure time engaged no cause–mediator interactions in their effects on out-
in physical activity, which lead to even higher levels of comes (Hayes & Rockwood, 2020). Large samples
motivation to remain active.
are generally required when estimating conditional
causal effects. As in standard regression analysis, the
PROCESS macros assume zero measurement error in
CONDITIONAL PROCESS ANALYSIS
causal variables, so very precise scores are needed, too.
Another software option for CPA is the R package pro-
Estimation of conditional indirect effects by simulta-
cessR by Moon (2019).2 There is also a Shiny App web
neously fitting the same structural model to data from
page for processR through which the researcher can
multiple samples, each drawn from a different popula-
upload a data file and specify the analysis in a graphi-
tion, was described in Chapter 12. In this case, group
cal user interface.3
membership is treated as a categorical moderator vari-
Figure 20.4(a) is a skeletal diagram for a first-stage
able, and if magnitudes of direct or indirect effects vary
appreciably over groups, there is evidence for contin- 1 http://www.processmacro.org/index.html
gent causation. If indirect effects are conditional and
2 https://github.com/cardiomoon/processR
the use of the term “mediation” is warranted by design
or rationale, then moderated mediation is indicated. 3 https://cardiomoon.shinyapps.io/processR/
Pt4Kline5E.indd 358 3/22/2023 2:05:58 PM

conditional process model, where variable W is makes apparent that W indirectly affects Y through M,
assumed to moderate the first stage of the indirect effect too, although there is no direct effect of W on Y.
The regression equations shown next without inter-
X→M→Y cepts for the first-stage conditional process model in
Figure 20.4(b) are
That is, causal variables W and X interact in their
effects on the mediator M, which implies that the direct M̂ =a1 X + a2W + a3 XW (20.4)
effect of X on M varies over the levels of W. Because
interaction is symmetrical, it is also true that the direct Ŷ bM + c′X
= (20.5)
effect of W on M depends on X. The perspective just
stated about symmetry is more obvious in Figure For continuous variables, coefficient a1 in Equation
20.4(b), which represents in regression style all effects 20.4 estimates the linear direct effect of X on M when
and variables, including the product term XW. (Recall at the point W = 0 when the scores on both X and W are
that product terms have no causal agency by themselves not mean-centered or at the mean of W when the scores
because they represent joint causal effects of their con- are mean-centered—see Edwards (2009). Coefficient
stituent variables; Appendix 7.A.) Figure 20.4(b) also a2 has a similar interpretation except now for causal
W moderates X → M
(a) Skeletal (b) Regression style
a3
XW M
W M a2
W b
a1
X Y
X Y
c′
W moderates X → M, X → Y
(c) Skeletal (d) Regression style
a3
XW M
W M a2
e b
W
d
X Y a1
X Y
c′
FIGURE 20.4. First-stage conditional process models, where variable W moderates just the first stage of the indirect effect
of X on Y through M represented in skeletal (a) and full regression style (b), and where W also moderates the direct effect of
X on Y represented in skeletal (c) and full regression style (d).
Pt4Kline5E.indd 359 3/22/2023 2:05:58 PM

variable W. Coefficient a3 estimates the linear × linear process remains the same.4 See Topic Box 20.1 for a
interaction, or the amount by which the linear effect of numerical example.
X on M varies in a progressive (linear) way over the Definitions of conditional indirect effects can be
levels of W, and vice versa. If a3 ≠ 0, then the first stage expanded in many ways. For example, in a second-
of indirect effect of X on Y through M is moderated by stage conditional process model, at least one variable
W. Because interaction is symmetrical, it is just as true moderates the direct effect of the mediator on the out-
that variable X moderates the first stage of the indirect come, or M → Y, but the direct effect of the cause on
effect of W on Y through M—see Figure 20.4(b). the mediator, or X → M, is constant; that is, it is not
Given Equations 20.4 and 20.5 for Figure 20.4(b), moderated. In a first- and-second-stage conditional
the product estimator for the conditional indirect effect process model, both direct effects that make up the
of X on Y through M is indirect effect of X on Y through M are moderated by
at least one other variable. In the simplest model of
(a1 + a3W ) b =
a1b + a3bW (20.6) this kind, a single variable W moderates both stages
of the indirect effect just mentioned, but larger mod-
where the expression in parentheses is the conditional els can include moderators that are unique to either the
direct effect of X on M that varies over the levels of first stage or to the second stage. Models with multiple
the moderator W, and coefficient b estimates the direct mediators can also be analyzed. Hayes and Rockwood
effect of M on Y, which is not moderated by W. The (2020) described extensions to multilevel conditional
coefficient for moderator W in the equation, or a3b, is process analysis—see Hayes (2022) for more informa-
the index of moderated mediation for a first-stage tion and examples.
conditional process model. If the direct effect of X on
M does not depend on W (i.e., there is no moderation),
then a3b = 0 and the indirect effect of X on Y is esti- CAUSAL MEDIATION ANALYSIS BASED
mated by the conventional product estimator a1b; oth- ON NONPARAMETRIC MODELS
erwise, larger values of the index suggest a greater con- AND COUNTERFACTUALS
ditional indirect effect. If moderator W is dichotomous
and coded such that a 1-point difference represents Pearl (2014) defined mediation based on nonparametric
group membership (e.g., coded 0 or 1), then the index causal models that make no assumptions about distri-
of moderated mediation is the group difference for the butional characteristics for any variable (Chapter 6).
indirect effect (Hayes & Rockwood, 2020). Such models generally assume that causes and media-
Hayes and Rockwood (2020) noted that Equation tors interact in their effects on outcomes; that is, the
20.5 for outcome Y in a first-stage conditional process effect of the cause on outcome changes over the levels
model can optionally include W and XW as additional of the mediator, and vice versa. Thus, cause–media-
predictors. For example, tor interaction effects are routinely estimated in this
approach. Mediation is also defined from a counter-
Ŷ = bM + c′X + dW + eXW (20.7) factual perspective that combines the potential out-
comes model with more traditional SEM (Chapter 1).
specifies that the direct effect of X on Y is moderated by The two features just mentioned extend the definition
W. But given Equation 20.4 for the mediator and Equa- of mediation to include nonlinear models for categori-
tion 20.7 for the outcome, variable W does not moderate cal outcomes or mediators while still including linear
the direct effect of M on Y, or the second stage of the models for continuous variables where no interaction
indirect effect from X to Y through M. Figure 20.4(c) is assumed as a special case (i.e., the Baron–Kenny
is the skeletal form of the model just described, where method; Topic Box 8.1).
the dashed line represents the hypothesis that W mod-
erates a direct effect (i.e., of X on Y) that is not part of
Natural Direct and Indirect Effects
an indirect causal pathway. The full regression version
of the model is represented in Figure 20.4(d) with coef- Next, we assume a dichotomous cause X coded as (0,
ficients from Equations 20.4 and 20.7. The product esti- 1), where the levels correspond to, respectively, control
mator for conditional indirect effect of X on Y through
M in Figure 20.4(d) is still (a1 + a3W)b (i.e., Equation 4 Hayes (2022) referred to Figures 20.4(a) and 20.4(c) as, respec-
20.6), so the basic definition of a first-stage conditional tively, Model 7 and Model 8.
Pt4Kline5E.indd 360 3/22/2023 2:05:58 PM

TOPIC BOX 20.1
Numerical Example of a First‑Stage Conditional Process Analysis

In a sample of N = 245 youth soccer players, Curran et al. (2013) administered measures of player behav-
ioral engagement (effort, persistence; Yenga), basic need satisfaction (willing participation, sense of com-
petence; M need), and perceived coaching support for structure (what is expected is made clear; Xstru) and
autonomy (players have choices; Wauto). In a basic mediation model that excludes autonomy, Curran et
al. (2013, p. 34) reported the unstandardized path coefficients listed next in equations without intercepts:
ˆ
M need = .330 X stru
Yênga .665Mneed + .063X stru
=
The estimator for the unconditional indirect effect of structure on engagement through need satisfaction
is .330(.665), or .219. Thus, for every 1-point increase in coaching structure, the expected increase in
engagement is .219 points through the mediator of need satisfaction when ignoring autonomy.
Curran et al. (2013) tested the hypothesis that coaching autonomy moderates the first stage of the
direct effect just described. Specifically, they predicted that the indirect effect of structure on engagement
through need satisfaction becomes increasingly positive as perceived autonomy increases. Their condi-
tional process model corresponds to Figure 20.4(d), where autonomy is also specified to moderate the
direct effect of structure on engagement, but not the direct effect of need satisfaction on engagement. The
equations with values of unstandardized coefficients reported by Curran et al. (2013, p. 35) are listed next:
ˆ
M −.677 X stru − .612 Wauto + .181X struWauto
need =
Yênga = .599 Mneed − .274X stru − .084Wauto + .054X struWauto
Given the results just listed and referring to the symbols in Figure 20.4(d), the coefficients for comput-
ing the conditional indirect effect of structure on engagement through needs satisfaction are
a1 = –.677 a3 = .181 b = .599
Thus, the product estimator that depends on autonomy is
(a1 + a3Wauto)b = a1b + a3bWauto = –.677(.599) + .181(.599) Wauto = –.406 + .108 Wauto
where the index of moderated mediation, or a3b = .108, does not equal zero (i.e., the indirect effect is
conditional). Given M = 4.84 and SD = 1.09 for autonomy (Curran et al., 2013, p. 34), we can calculate
the indirect effect at levels of the moderator that correspond to 3.75, 4.84, and 5.95, or values that are,
respectively, 1 SD below on the mean, the mean, and 1 SD above the mean:
Wauto = 3.75 Wauto = 4.84 Wauto = 5.95

Coefficient for
Xstru → M need → Yenga .001 .117 .237
(continued)
Pt4Kline5E.indd 361 3/22/2023 2:05:59 PM

In other words, the indirect effect of coaching structure on player engagement through player need satis-
faction becomes increasing positive as the level of coaching autonomy increases, which is consistent with
Curran et al.’s (2013) prediction.
Curran et al. (2013) applied the Johnson– Neyman technique for probing interactions to identify
regions of significance, or boundary values of the moderator (autonomy) for which the indirect effect
of structure on engagement through need satisfaction is statistically significant at the .05 level (Preacher et
al., 2007). They reported that the indirect effect was positive and significant for values of Wauto > 4.71. At
this boundary point for the moderator, the coefficient for the indirect effect is .103 (you should verify this
result), so coefficients that exceed .103 are significant at the .05 level. But for Wauto < 2.43, the indirect
effect is statistically significant but is now negative: At this lower boundary point, the coefficient for the
indirect effect is –.144, so any coefficient less than this value is significant. Given both results, the indirect
effect of structure on engagement through need satisfaction is not significant at the .05 level when the level
of the autonomy moderator lies within the interval [2.43, 4.71].
Presented in Figure 20.5 is a plot of the indirect effect of coaching structure on soccer player engage-
ment through player need satisfaction as a function of coaching autonomy structures. The solid line repre-
sents how the indirect effect changes from negative at lower levels of autonomy to positive at higher levels.
The dashed lines represent the lower and upper limits of bootstrapped confidence intervals reported by
Curran et al. (2013). The shaded areas in the figure designate the regions of significance for the indirect
effect as a function of the level for the moderator. Curran et al. (2013) described the conditional indirect
effect just estimated as antagonist because it is positive when coaching autonomy is higher (right side of
Figure 20.5) but is negative when autonomy is lower (left side of the figure).
.50
.40
.30
.20
Indirect effect
.10
0
−.10
−.20
−.30
−.40
−.50
2.00 3.00 4.00 5.00 6.00 7.00

Autonomy
FIGURE 20.5. Plot of the indirect effect of coaching structure on soccer player behavioral engagement through play-
ers’ satisfaction as a function of coaching autonomy. Dashed lines represent 95% bootstrapped confidence intervals for
the indirect effect. Shaded areas correspond to Johnson–Neyman regions of significance for the .05 level. Adapted,
with permission, from “A Conditional Process Model of Children’s Behavioral Engagement and Behavioral Disaffection
in Sport Based on Self-Determination Theory,” by T. Curran, A. P. Hill, and C. P. Niemiec (2013), Journal of Sport &
Exercise Psychology, 35 (1), p. 36. doi: 10.1123/jsep.35.1.30. Copyright © 2013 by Human Kinetics Publishers.
Pt4Kline5E.indd 362 3/22/2023 2:05:59 PM

versus treatment in a randomized trial, but the same tor (no interaction), then setting the mediator to a fixed
basic ideas apply to a binary cause like exposure to a value (CDE) or considering the value of the mediator
risk factor (i.e., no vs. yes) in an observational study. that would have been observed in the control condi-
Definitions stated next involve counterfactuals. For a tion (NDE) generates the same result (Richiardi et al.,
mediator M and outcome Y, the controlled direct effect 2013). But when cause and mediator interact, there is
(CDE) is defined as the difference in outcome between no single value for the CDE. Indeed, if the mediator is
the treated and untreated cases, if the mediator were continuous, there are infinite controlled direct effects.
controlled (fixed) at the same level for all cases in the In this case, the NDE corresponds to a weighted aver-
population. Expressed as an expected (typical) value in age of the CDE over all levels of the mediator.
a basic mediation model with no covariates, The natural indirect effect (NIE) is the amount of
change in outcome among treated cases as the media-
=
CDE E (Y1,m − Y0,m ) (20.8) tor changes from values that would be observed in the
control group to the levels it would attain in the treat-
which is the magnitude of the treatment effect, if the ment group for each case. Thus, the NIE addresses the
mediator were fixed to a particular value, or M = m, question, “If everyone received the treatment but with
for all cases.5 Controlled direct effects may be of inter- the mediator changed from the level it would be with
est in situations where it is possible to intervene at a no treatment, or M0, to the level it would be with treat-
population level by fixing a mediator to the same value ment, or M1, how much would the outcome change?”
for everyone, such as interest rates or levels of air pollu- Defined as an expected value,
tion, while also implementing an intervention. In such
prescriptive contexts, there is no natural variation in =
NIE E (Y1, M1 − Y1, M 0 ) (20.10)
the mediator because it is set to equal a constant for all
cases (Pearl, 2009). The sum of NDE and NIE is the total effect (TE) of the
The natural direct effect (NDE) is the difference cause on the outcome, and this is true whether the cause
in outcome due to varying the cause from control to and mediator interact or do not interact in their effects
treatment (i.e., from X = 0 to X = 1) while keeping the on outcome. Expressed as an expectation,
mediator for each case fixed at the value it would have
reached under no treatment (i.e., X = 0, the control con- TE = NDE + NIE = E (Y1 − Y0 ) (20.11)
dition). Unlike for the CDE, the level of the mediator is
not fixed to the same constant for all cases. Defined as which for a continuous outcome is the average differ-
an expected value, ence between the treatment and control conditions.
Nguyen et al. (2021) described TE as the effect of shift-
=
NDE E (Y1, M 0 − Y0, M 0 ) (20.9) ing the cause from control to treatment (i.e., from X =
0 to X = 1) without intervening on the mediator. The
which is the size of the treatment effect with the media- numerical example in Topic Box 20.2 may help to clar-
tor assuming whatever value it would be in the control ify the definitions just considered.
condition, or M0, for each case. Thus, the context for Identification of controlled direct effects requires the
the NDE is descriptive, not prescriptive, because the assumptions that (1) the treatment and outcome are not
mediator varies over cases but only to the extent that it confounded and (2) the mediator and the outcome are
does in the control condition only (Pearl, 2009). not confounded, given covariates that are all unaffected
If the cause and mediator do not interact, then the by the cause. There are also positivity assumptions
controlled and natural direct effects are equal, or that values of the mediator and all covariates exist with
positive (> 0) probabilities under both the treatment and
CDE = NDE. This is because when the direct effect
control conditions (Torres, 2020). Natural (in)direct
of treatment is constant across all levels of the media-
effects require additional assumptions. A strong ver-
5 The meaning of a “controlled indirect effect” is not straight- sion is the requirement for sequential ignorability that
forward. For example, Pearl (2009) argued that such effects are there is no confounding between any pair of variables
not defined while Kaufman et al. (2004) noted that a controlled among cause, mediator, and outcome under covariate
indirect effect may be defined in cases where there is no interac- adjustment (Pearl, 2014). The term “sequential” means
tion between cause and mediator. that the mediator is considered to be randomly assigned
Pt4Kline5E.indd 363 3/22/2023 2:05:59 PM

TOPIC BOX 20.2
Numerical Example of Natural Direct and Indirect Effects

For a dichotomous cause X and a continuous mediator M and outcome Y, the linear mediation model that
allows for interaction between X and M is
M̂
= a 0 + a1X (20.12)
Yˆ =b0 + b1X + b2M + b3XM
where a 0 and b 0 are intercepts in their respective equations and XM is a product term that represents
the interactive effect between the cause and mediator. The CDE, NDE, and NIE are defined as follows
(VanderWeele, 2015):
cDE = b1 + b3 m (20.13)
nDE= b1 + b3 a 0
niE
= (b2 + b3 )a1
In other words, the CDE is how much outcome Y would change on average, if the mediator is controlled at
the same level for all cases, or M = m, but the cause changes from X = 0 to X = 1. The NDE is how much
Y would change on average if the cause were changed from X = 0 to X = 1, but the mediator M is kept to
the level that it would have taken when X = 0, which equals the intercept a 0 when predicting M from X. The
NIE is the mean change in Y when X = 1 as M changes from the level that would be observed for X = 0 to
the level it would attain under X = 1. If there is no interaction, then b3 = 0 in Equation 20.12. In this case,
(1) both the CDE and NDE equal the unconditional direct of X, or b1; and (2) the NIE equals the classical
product estimator of the indirect effect, or a1b2.
Suppose that X = 1 is an antiretroviral therapy for human immunodeficiency virus (HIV) and X = 0
is control; the mediator M is viral load, or the blood level of HIV; and the outcome Y is the level of CD4
T-cells, or helper white blood cells (Petersen et al. 2006). Values for regression coefficients and intercepts
estimated in a hypothetical sample are
= ˆ 1.70 − .20X
M
Yˆ = 450.00 + 50.00X − 20.00M − 10.00XM
That is, the predicted viral load in the control group is 1.70, but treatment reduces this count by .20. For
control patients with no viral load, the predicted level of CD4 T-cells is 450. Treatment increases this
amount by 50.00 for patients with no viral load, and for every 1-point increase in viral load for control
patients, the level of CD4 T-cells drops by 20.00. For treated patients, the slope of the regression line for
predicting the level of CD4 T-cells decreases by 10.00 compared with control cases.
These results just summarized imply that
a 0 = 1.70 and a1 = –.20

b 0 = 450.00, b1 = 50.00, b2 = –20.00, and b3 = –10.00
(continued)
Pt4Kline5E.indd 364 3/22/2023 2:05:59 PM

Thus, the direct effect of treatment versus control at a given level of viral load M = m is
CDE = 50.00 – 10.00 m
The researcher can select a particular value of m and then estimate the CDE by substituting this value in
the formula just listed. Another option is to estimate the direct effect at the weighted average of M for
the whole sample. The direct effect of treatment estimated at the level of viral load that would have been
observed in the control condition is
NDE = 50.00 – 10.00 (1.70) = 33.00
where 1.70 is the predicted value of viral load in the control condition (X = 0). The natural indirect effect of
treatment allowing viral load to change as it would from the control to treatment is estimated as
NIE = (–20.00 – 10.00)(–.20) = 6.00
where –.20 is the difference in viral load between the treatment and control conditions. The total effect of
treatment is the sum of the natural direct and indirect effects, or
TE = 33.00 + 6.00 = 39.00
Thus, antiretroviral therapy increases the level of CD4 T-cells by 39.00 though both its natural direct effect
(33.00) and its natural indirect effect through viral load (6.00). Cheng et al. (2021) described estimators
of natural indirect effects for models with binary or continuous mediators or outcomes.
(i.e., an experimentally manipulated variable), given a is no treatment (X = 0), or M0, provides no information
randomized cause and the covariates, although media- about the effects on the outcome that would be observed
tors in most studies are measured, not manipulated, by setting both the mediator and cause to other values
variables. (e.g., X = 1 for the cause), given the covariates. This
Pearl (2014, pp. 465–467) described somewhat less requirement is “cross-world” because the mediator and
demanding requirements for identifying natural (in) outcome values are assumed to be unrelated even in
direct effects that allow for methods other than covari- counterfactual causal worlds where assignment to treat-
ate adjustment to control confounding or methods that ment or control differs for the mediator and outcome.
apply mainly in observational studies, but they all gen- It can be challenging, if not impossible, to empiri-
erally assume no posttreatment (treatment-depen- cally evaluate the cross-world independence assump-
dent) confounders, or variables caused by treatment tion or even to justify it based on substantive knowl-
that are, in turn, common causes of the mediator and edge (Andrews & Didelez, 2021). It is also true that
some posttreatment confounders, such as treatment
outcome, when the cause and mediator are assumed to
adherence or compliance, are measured among treated
interact.6 This is a cross-world independence assump-
cases only (i.e., it is unobserved in the control group).
tion that knowing the value of the mediator when there
These challenges emphasize the potential value of esti-
6 Incontrast, baseline confounders are common causes of the
mating controlled direct effects, which do not assume
mediator and outcome, but they are not affected by treatment. cross-world independence (Naimi et al., 2014). Pearl
Analysis of baseline confounders as covariates in covariate (2014), Richiardi et al. (2013), and Gonzalez et al.
adjustment is relatively straightforward. (2023) described options to cope with posttreatment
Pt4Kline5E.indd 365 3/22/2023 2:05:59 PM

confounders when estimating natural direct effects for of the causal structure for the mediators. This includes
parametric or nonparametric models, and Acharya et situations when the direction of causal effects between
al. (2016) described sensitivity analysis when estimat- mediators is unknown or when it is unknown whether
ing controlled direct effects. mediators have unmeasured common causes (Loh et
al., 2022). In contrast, path-specific indirect effects
require specification of a particular causal order among
Interventional Direct
multiple mediators, such as
and Indirect Effects
Interventional direct and indirect effects are not defined X → M1 → M2 → Y
in terms of cross-world counterfactuals, so they are
identified under weaker assumptions that allow for in Figure 20.2(c). As more mediators are added to the
posttreatment confounders (Vansteelandt & Daniel, model, the challenge of specifying a correct causal
2017). In a basic mediation model with a dichotomous structure among them also increases. This is espe-
cause X (0 = control, 1 = treatment) and a single media- cially true when analyzing high-dimension data, that
tor M, the interventional direct effect (IDE) is the dif- is, the kind collected in disciplines such as genomics
ference between treatment and control conditions while or environmental epigenetics, where the number of
keeping the mediator for each case equal to the value potential mediators can exceed the sample size in mod-
of a random draw from a distribution for the media- els of complex biological systems (Blum et al., 2020).
tor under the control condition. Defined as an expected Specification of the precise causal structure among,
value, say, hundreds of intervening variables in a model of
high-dimension mediation requires strong a priori
=
IDE E (Y1,G0 − Y0,G0 ) (20.14) knowledge that is not always available, and empirical
searches based on significance testing for path-specific
where G 0 represents a random draw from the distribu- indirect effects do not solve the problem (Thoemmes,
tion of the mediator in the absence of treatment. 2015).
The interventional indirect effect (IIE) is the Interventional (in)direct effects are identified
change in outcome in the treatment condition as the under the assumption of no unmeasured confounding
mediator changes for each case from the value of a between (1) the cause and the outcome, (2) the cause
random draw from the distribution of the mediator in and the mediators, and (3) all mediators and the out-
the control condition to the value of a random draw come after controlling for the cause. If covariates are
from the distribution for the treatment condition. As an included in the model, it is assumed they are sufficient
expected value, to control for confounding between the sets of vari-
ables just mentioned (Vansteelandt & Daniel, 2017).
=
IIE E (Y1,G1 − Y1,G0 ) (20.15) Unlike natural (in)direct effects, which are generally
not identified when there are posttreatment confound-
where G1 and G 0 are distributions for the mediator ers, all such confounders are considered as possible
under, respectively, treatment and control. For IIE ≠ 0, competing mediators when estimating interventional
the cause must shift the distribution of the mediator indirect effects (Loh et al., 2022). Suppose that variable
from what it would be with no treatment to what it oth- X in Figure 20.2 represents the contrast of treatment
erwise would be with treatment. In a basic mediation and control. In Figure 20.2(c) with the direct effect
model with no posttreatment confounding, natural (in) M1 → M2, mediator M1 is a posttreatment confounder
direct effects can also be interpreted as interventional of the relation between M2 and Y. In Figure 20.2(d) with
(in)direct effects, and their sum equals the total effect the opposite direct effect, or M2 → M1, it is mediator
defined in Equation 20.12 (Vansteelandt & Daniel, M2 that is a posttreatment confounder for the relation
2017). But natural (in)direct effects are not identified between M1 and Y. Both patterns just mentioned are
when the assumption of no posttreatment confounders treated simply as competing possibilities in an unspeci-
is violated, but interventional (in)direct effects are still fied causal structure for the mediators when estimating
defined in this situation. interventional indirect effects (Loh et al., 2022).
In models with multiple mediators, interventional Suppose that X is a dichotomous cause (0 = control,
(in)direct effects can be estimated without knowledge 1 = treatment), there are three mediators, M1–M3, and
Pt4Kline5E.indd 366 3/22/2023 2:05:59 PM

Y is the outcome. The model is linear without interac- joint distribution for M1–M3 under the control condition
tion terms. Interventional indirect effects are estimated (Vansteelandt & Daniel, 2017).
over two different models. In the marginal model, Interventional indirect effects are computed as prod-
each mediator is separately regressed on the cause.7 ucts of the coefficients ai for the total effect of X on each
Expressed as expected values with no intercept terms, mediator in the marginal model (Equation 20.16) with
the marginal model for this example is the corresponding coefficients bi for the direct effect of
each mediator on Y in the outcome model (Equation
E (M1 | X) = a1 X (20.16) 20.17). These product estimators in this example are
E (M2 | X) = a2 X listed next for, respectively, mediators M1–M3:
E (M3 | X) = a3 X
IIE1 = a1b1 (20.18)
where the coefficient for each indicator, ai, is inter- IIE2 = a2 b2
preted as the total effect of X. For example, coefficient IIE3 = a3b3
a1 in the equation is interpreted as the total effect of
X on M1 that includes the direct effect and all four of For example, the product a1b1 estimates for treated
the possible indirect causal pathways between X and M1 cases the effect of shifting the distribution for mediator
that involve the other mediators listed next: M1 from its counterfactual distribution under control to
its distribution under treatment while fixing the distri-
X → M2 → M1 X → M3 → M1 butions for M2 and M3 to those under the control condi-
X → M2 → M3 → M1 X → M3 → M2 → M1 tion. It corresponds to the combination of the five path-
specific indirect effects in this example listed next:
Thus, although the researcher specifies no causal
dependence among the mediators, all possible indirect X → M1 → Y
effects between X and M1 that involve any combina- X → M2 → M1 → Y
tion or causal order between the other two mediators X → M3 → M1 → Y
plus the direct effect are estimated by coefficient a1
X → M2 → M3 → M1 → Y
in Equation 20.16. The drawback is that path-specific
indirect effects are not estimated. Coefficients a2 and a3 X → M3 → M2 → M1 → Y
for, respectively, mediators M2 and M3 have analogous
interpretations. Thus, the product a1b1 in Equation 20.18 represents all
In the outcome model, variable Y is regressed on effects of X on Y that are mediated by M1, but not by any
the cause and all mediators, which expressed as an causal descendent of M1, and this interpretation holds
expected value with no intercepts for this example is regardless of the true causal structure for the mediators
(Vansteelandt & Daniel, 2017). Exercise 5 asks you to
E(Y | X, M1–M3) = b1M1 + b2 M3 + b2 M3 + c′X (20.17) list all the path-specific indirect effects represented by
the product a2 b2 for mediator M2.
Coefficients for the mediators, b1–b3, represent the The joint interventional indirect effect (IIEjo) is
direct, not total, effects of the mediators M1–M3 on the the sum of all interventional indirect effects. For this
outcome Y. For example, coefficient b1 estimates the example,
direct effect of M1 on Y while controlling for the other
two mediators and X; coefficients b2 and b3 for, respec- IIEjo = IIE1 + IIE2 = IIE3 (20.19)
tively, M2 and M3 have similar meanings. Coefficient
c′ in the equation estimates the interventional direct The total interventional effect (TIE) is the sum of the
effect of X on Y that avoids all three mediators. Spe- direct effect of the cause that avoids all mediators and
cifically, the treatment effect is estimated by controlling the joint interventional indirect effect. For this example,
the mediators for each case at random draws from the
TIE = c′ + IIEjo (20.20)
7 Loh et al. (2022) used the term “marginal mean model” for
equations with intercepts. If no mediators are posttreatment confounders, the value
Pt4Kline5E.indd 367 3/22/2023 2:06:00 PM

of TIE will equal that of the total effect (TE) defined REPORTING STANDARDS
in Equation 20.11; otherwise, the TIE is described by FOR MEDIATION STUDIES
Nguyen et al. (2021) and others as the overall effect of
shifting the cause from control to treatment (from X = 0 Vo et al. (2020) described characteristics and report-
to X = 1) while also shifting the mediator distributions ing practices for mediation analysis in a total of 98
from their state under control to their state under treat- randomized clinical trials published in MEDLINE
ment. See Topic Box 20.3 for a numerical example of for 2017–2018. Although randomization guarantees
computing interventional (in)direct effects. time precedence in measurement for the cause (i.e.,
Loh et al. (2022) described estimation of interven- intervention), mediators and outcomes were measured
tional (in)direct effects for more complex models with simultaneously in just over half of the studies (53%).
cause–mediator, mediator–mediator, or cause–media- Relatively few mediation analyses (4%) were based on
tor–mediator interactions. The decomposition of total a counterfactual approach that includes estimation of
effects is more complicated for such models compared interaction effects. Controls for possible confounders
with the example just considered for which no interac- between multiple mediators or between mediators and
tion was assumed, but knowledge of the causal structure outcomes were analyzed in about 60% of the studies,
for the mediators is still not required. Results of simula- although the number of covariates was generally small
tion studies by Loh et al. (2022) indicated that estimates (e.g., 2–5). Assumptions of mediation analysis or good-
of interventional (in)direct effects can be very biased ness of model fit were explicitly addressed in only about
when the true model has mediator–mediator interac- 30% of studies. Overall, the conduct and reporting of
tions but such effects are omitted from the researcher’s mediation analyses were very heterogeneous, and Vo
model. In general, allowing for all interactions between et al. (2020) called for consensus-based guidelines for
mediators in the outcome model yielded unbiased esti- mediation studies.
mates, but larger samples would be needed. The R pack- Lee et al. (2021) developed such guidelines for
age medoutcon estimates natural and interventional reporting results from mediation analyses in random-
(in)direct effects (Hejazi, Diaz, et al., 2022).8 ized trials or observational studies based on a sys-
There are still other relatively new approaches to tematic review of the literature and a Delphi survey
mediation analysis, but it is beyond the scope of this of statisticians, researchers, journal editors, and rep-
chapter to describe them in detail. For example, Bull- resentatives of the EQUATOR (Enhancing the Qual-
ock and Green (2021) explained the application of ity and Transparency of Health Research) Network.9
methods from experimental designs and the use of Consensus meetings of team members were followed
instruments to estimate features of treatments that are by an external review that included potential users and
added or subtracted in ways that indicate the involve- resulted in the 25-item AGReMA (A Guideline for
ment of some mediators but not others, or implicit Reporting Mediation Analyses) Long-Form Check-
mediation analysis. Hejazi, Rudolph, et al. (2022) list for studies in which mediation analysis is a pri-
described mediation analysis for stochastic interven- mary focus, and the nine-item AGReMA Short-Form
tions, which are neither binary (full treatment ver- Checklist for studies where mediation analysis is a
sus none) nor static (level remains unchanged once secondary aim. The points from both checklists that
assigned) variables. Instead, intervention is viewed specifically address reporting about mediation analyses
as a random variable with multiple levels, such as are summarized here:
changes in drug doses or levels of financial assistance,
over ranges of many possible values that can change 1. Describe how the sample size was determined for
over time. Effects of such treatments are estimated mediation analysis (e.g., power analysis, accuracy
under possible hypothetical interventions on mediators in parameter estimation).
where only a portion of their values are altered. These 2. Include a graphical representation of the entire
and other recent developments outlined in this chapter model, such as directed acyclic graph (DAG),
give hope to the idea that the overly long adolescence including causes, mediators, outcomes, and con-
of mediation analysis is finally ending—see Gonzalez founders (covariates). State the assumptions of the
et al. (2023). model.
8 https://code.nimahejazi.org/medoutcon/ 9 https://www.equator-network.org/
Pt4Kline5E.indd 368 3/22/2023 2:06:00 PM

TOPIC BOX 20.3
Numerical Example of Interventional Direct and Indirect Effects

Summarized in Table 20.3 are data from Van Ryzin et al. (2013), who analyzed a larger model of direct
and indirect effects for a family-based randomized intervention for grade 6 students developed to address
communication, parenting skills, and management of adolescent behavioral problems. The two mediators
in the table, including eating attitudes, where higher scores indicate higher levels of unhealthy eating (e.g.,
binge eating or purging), and levels of depression, were measured 5 years later when study participants
were 17 years old.
The outcome variable in Table 20.3 is the body-mass index (BMI) at age 22 years, or when the par-
ticipants were young adults. Van Ryzin et al. (2013) classified participants as obese for BMI ≥ 30 versus not
obese for BMI < 30. A total of about 22% of the participants were classified as obese under the threshold
just mentioned. They reported point-biserial correlations (rpb) between obesity as a dichotomy and the
other variables in Table 20.3. For this example, I converted these rpb values to biserial correlations, which
estimate the Pearson correlation between a true continuous variable that was dichotomized and another
continuous variable (MacCallum et al., 2002, p. 24). Thus, the original continuous outcome, or BMI, is
approximately recovered by these conversions, and the mean and standard deviation for the BMI reported
in the table are based on norms for 19-year-old men and women in the United States (Ogden et al., 2004).
The marginal and outcome models for this example are represented next by regression equations
without intercepts. We assume no cause–mediator or mediator–mediator interactions. The marginal model
where the mediators (eating attitudes, depression) are regressed on the cause (intervention) is
M̂1(eat) = a1X and M̂2 (dep) = a 2X

(continued)
TABLE 20.3. Input Data (Correlations, Means, Standard Deviations) for Estimation
of Interventional Direct and Indirect Effects of a Family-Based Intervention on Body
Mass Index with Eating Attitude and Depression as Mediators
Indicator Age (years) 1 2 3 4
Cause
1. Intervention condition 12 —
Mediators
2. Eating attitudes 17 .04 —
3. Depression 17 –.03 .30 —
Outcome
4. BMI 22 0 .20 .03 —
M — .50 .30 58.20 25.17
SD — .50 .25 4.63 6.31
Note. Input data are from Van Ryzin et al. (2013), N = 792. BMI, body mass index. Results for the BMI are biserial correla-
tions.
Pt4Kline5E.indd 369 3/22/2023 2:06:00 PM

and the outcome model where BMI is regressed on the mediators and cause is
Yˆbmi =b1M1(eat) + b2M2 (dep) + c′X
Given the marginal and outcome models just defined, the analytical model corresponds to Figure
20.2(a), which is a parallel mediation model with two mediators where (1) each mediator is regressed on
the cause, and (2) the outcome is regressed on the all the other variables. Figure 20.2(a) represents the
combined marginal and outcome models for this example. But Figure 20.2(a) is not the causal model for
this analysis because it assumes the mediators are independent except for their spurious association due to
a common cause. In contrast, interventional indirect effects allow for any pattern of causation between the
mediators (e.g., Figures 20.2(c) or 20.2(d)). Thus, product estimators for interventional indirect effects have
a different interpretation compared with path-specific product estimators in parallel mediation models (Loh
et al., 2022).
Listed in Table 20.1 for analysis 2 are the syntax and output files for fitting a parallel mediation model
as just described to the data in Table 20.3 in lavaan. Note that neither the standard fit statistics nor
the residuals normally printed by lavaan are relevant in this analysis where the causal structure for the
mediators is both unknown and saturated (it has all possible paths). Thus, this output was suppressed in the
analysis. Values of key unstandardized path coefficients from lavaan output are listed next:
a1 = .020 a2 = –.278
b1 = 5.310 b2 = –.046 c′ = –.119
Given the coefficients just listed, values of interventional (in)direct effects for this example are listed next:
IDE = –.119
IIE1(eat) = .020(5.310) = .016 IIE2(dep) = –.278(–.046) = .013
IIEjo = .016 + .013 = .029
TIE = –.119 + .029 = –.090
We can say that the overall effect of intervention (TIE) is to decrease BMI by .090, but the pattern
of effects is complex: Intervention decreases BMI by –.119 when both mediators are avoided (IDE), or
when the mediators are controlled for each case by random draws from distributions assuming no treat-
ment. But mediated effects of intervention increase BMI by .029 altogether (IIEjo). Specifically, intervention
(1) increases BMI by .016 by improving eating attitudes (IIE1(eat)) and (2) also increases BMI by .013 by
reducing depression, which elevates BMI (IIE2(dep)). These effects are relatively small, given the standard
deviation for BMI, or 6.31 (Table 20.3), but a decade separates cause and outcome in this example.
3. Describe how causes, mediators, outcomes, and 5. Describe results for any sensitivity analyses on the
other variables were measured, their psychomet- effects of violated assumptions that concern miss-
rics, and how blinding was used to reduce potential ing data, unmeasured confounders, or distributional
bias. properties of the data.
4. Specify the statistical methods used to estimate 6. Report effects sizes for effects of interest and, if
mediation or other special model building (or trim- possible, estimates of their uncertainty, such as con-
ming) procedures to modify the initial model. fidence intervals.
Pt4Kline5E.indd 370 3/22/2023 2:06:00 PM

SUMMARY but it is probably more realistic because the method is

agnostic to what might be an unknown causal struc-
Iacobucci et al. (2007) offered some great advice about ture for multiple mediators. A latent growth modeling
testing for mediation: approach to mediation analysis is described in the next
chapter.
First, step away from the computer. A mediation analysis
is not always necessary. Many processes should be infer-
able from their resultant outcomes. If you must conduct LEARN MORE
a mediation analysis, be sure it has a strong theoretical
basis, clearly integrated and implied by the focal concep- Bullock and Green (2021) demonstrate the application of
tualization, not an afterthought. Further, be prepared to principles and techniques from experimental designs to the
argue against, and empirically test, alternative models of estimation of mediated effects and articulate their assump-
explanation. (p. 151, emphasis added)
tions, Loh et al. (2022) introduce interventional (in)direct
effects to psychology researchers, and Nguyen et al. (2021)
Part of planning for mediation analysis includes the offer an accessible explanation of causal mediation analysis
use of an appropriate research design. Cross-sectional to the same audience.
designs, which have no temporal precedence in mea-
surement, are generally inadequate without ironclad Bullock, J. G., & Green, D. P. (2021). The failings of conven-
rationales about causal order; otherwise, the presence tional mediation analysis and a design-based alterna-
of equivalent models, some of which may involve no tive. Advances in Methods and Practices in Psychologi-
indirect effects at all, are a serious validity threat. There cal Science, 4(4), 1–18.
are relatively new methods for estimating mediation in
Loh, W. W., Moerkerke, B., Loeys, T., & Vansteelandt, S.
randomized trials that allow for interaction between (2022). Disentangling indirect effects through multiple
causes and putative mediators or that do not require mediators without assuming any causal structure among
the specification of path-specific indirect effects, which the mediators. Psychological Methods, 27(6), 982–999.
needs knowledge of the causal structure for multiple
mediators. The point just mentioned refers to the focus Nguyen, T. Q., Schmid, I., & Stuart, E. A. (2021). Clarifying
on interventional indirect effects, which Vansteelandt causal mediation analysis for the applied researcher:
and Daniel (2017) described as “less ambitious” com- Defining effects based on what we want to learn. Psy-
chological Methods, 26(2), 255–271.
pared with estimating path-specific indirect effects,
EXERCISES
1. Interpret the coefficients for the indirect effect in 4. Describe the relation between X and Y represented
the top part of Table 20.2. in Figure 20.3(a).
2. Calculate and interpret the effect sizes in the bot- 5. Describe the path-specific indirect effects of X on Y
tom part of Table 20.2 from the information in the through M2 represented by the coefficient a2 b2 for
top part of the table. the interventional indirect effect in Equation 20.18.
3. Calculate df M for the model in Figure 20.3(a) and

show that it is identified.
Pt4Kline5E.indd 371 3/22/2023 2:06:00 PM

21
Latent Growth Curve Models
Analysis of latent growth models—also called latent growth curve models—in SEM and other sta-
tistical techniques, such as multilevel modeling, focuses on the estimation of growth trajectories, or changes
over time on ≥ 1 repeated measures variables, based on statistical models of the hypothesized underlying
growth process at the group and individual levels. Antecedents of change, or covariates that predict indi-
vidual differences in growth trajectories, can optionally be included in the model. Growth curve analysis has
wide application in many research areas, and such models are seen so often in the literature that conflating
the terms longitudinal model and growth curve model is understandable, albeit partly incorrect (Little, 2013).
The description of models from the SEM perspective where repeated measures are specified as indicators of
latent variables that represent starting points or rates of change owes much to Meredith and Tisak (1990),
McArdle and Epstein (1987), Muthén (1991), and Willet and Sayer (1994), among others. There is also a
large body of work about the analysis of growth models in other statistical methods, especially multilevel
modeling (Bryk & Raudenbush, 1987).
It is not feasible to give a comprehensive account of latent growth modeling in a single chapter, so the
more practical goal is to extend what you have learned so far about measurement and structural models in
traditional SEM to the analysis of longitudinal data. Considered first are basic change models with latent
growth factors that represent initial levels or slopes for subsequent change. Next, models with variables
expected to predict growth trajectories are considered, and there are analysis examples with real data
for both types of models just mentioned. Finally, more advanced types of latent growth curve analyses are
described. If you have no prior background in analyzing longitudinal data from a growth curve perspec-
tive, I think this presentation should get you off to a good start. For more advanced works in this area, see
Duncan et al. (2006), Grimm and McArdle (2023), Grimm et al. (2017), Little (2013, chap. 8), or Newsom
(2015, chaps. 6–7).
BASIC LATENT GROWTH MODELS relations, standard deviations, and means are analyzed
for continuous variables in their original metrics. There
Basic latent growth models have the features listed next are no missing data, that is, each case is measured on
(Byrne & Crombie, 2003; Curran et al., 2010; Duncan every occasion. Distributions for all outcome variables
& Duncan, 2009; Stoel et al., 2004): are normal in shape. Under all the conditions just men-
tioned, the input data can be either a raw data file or a
1. A single continuous dependent variable is mea-
matrix of summary statistics, and the estimator can be
sured on at least three different occasions in the same
default ML.
sample, and scores have the same units over time that
can be said to measure the same construct at each 2. The data are time structured, which means that
assessment. The scores are not standardized, so cor- all cases are tested at the same intervals that need not
372
Pt4Kline5E.indd 372 3/22/2023 2:06:00 PM

Latent Growth Curve Models 373
be equal. For example, a sample of children may be explain differences in initial levels or rates of change
observed at 3, 6, 12, and 24 months of age. If other among individuals.
children are tested at, say, 4, 10, 15, and 30 months,
7. A growth predictor model is probably the most
their data cannot be analyzed together with those tested
common type of latent growth model with time-invari-
at other intervals.
ant covariates: The latent growth factors are regressed
3. The model is specified as a type of structural on the covariates, and the corresponding path coeffi-
regression (SR) model with a mean structure for latent cients estimate the potential of the covariates to predict
growth factors that represent fixed and random effects initial status or change. It is also possible to estimate
in individual trajectories over time. Fixed effects cor- the total proportion of variance in the latent growth fac-
respond to single values in the population, such as the tors explained by all covariates (i.e., R2). For reasons
average starting point (initial level) or mean slope (rate outlined later, a growth predictor model assumes full
of change), that are presumed to underlie the empiri- (complete) mediation in that covariates are specified to
cal growth records, or the temporally sequenced out- affect the repeated measures only indirectly through
comes for individual cases observed in samples. the latent growth factors.
4. Random effects in latent growth models represent 8. Basic latent growth models in SEM as just
probability distributions around fixed effects, and they described can also be analyzed as multilevel (two-level)
are generally estimated as variances at the case level models in the technique of hierarchical linear model-
around mean initial levels or slopes. If these variances ing (HLM)—also referred to as linear mixed model-
are zero, then a single growth trajectory is shared by ing, multilevel modeling, or mixed effects modeling—
all cases; otherwise, increasingly larger individual dif- where scores are treated as nested or clustered within
ferences at the first observation or in slopes for change individuals (level 1), and random coefficients for initial
over time are indicated as random effects increase in level or slope are regressed on covariates (level 2). The
magnitude. parameterization of a basic growth model in HLM is
different. For example, repeated measures are specified
5. Indicators of latent growth factors are the
as a single variable indexed by time, and time is treated
repeated measures on the dependent variable. Each
as an explicit predictor in the model in HLM. But in
indicator has an error term, so perfectly reliable scores
SEM, repeated measures are treated as separate vari-
are not assumed. In the analysis, error variances can be
ables, and there is no requirement to explicitly include
constrained to equality over all measurement occasions
time as a variable in the model (McNeish, 2020). But
or freely estimated with no constraints, among other
HLM and SEM often generate the same parameter esti-
possibilities, but the choice should be based on substan-
mates for the same model and data. This is a point of
tive considerations. Within the limits of model identi-
convergence between the two techniques—see Curran
fication, such as available degrees of freedom (df M),
(2003) for examples.
error covariances can also be specified and estimated.
For example, the hypothesis of correlated errors may be
The requirements or assumptions of basic latent
less plausible for assessments that are conducted annu-
growth models as just described are restrictive, but
ally compared with measurement over shorter intervals,
there are options in SEM for relaxing or setting many
such as weekly or daily. The possibility to test hypothe-
aside. For example, multiple imputation or full infor-
ses about the error covariance structure for longitudinal
mation maximum likelihood (FIML) can be applied
data affords much flexibility in the analysis.
to incomplete longitudinal data assuming a miss-
6. A basic latent growth model can optionally ing at random (MAR) data loss pattern, which may
include time-invariant covariates, or observed vari- be unlikely in studies where, for example, substance
ables measured for all cases just once and at the same abusers have the highest risk for attrition even after
time, such as the first measurement occasion. Time- controlling for auxiliary variables. Enders (2011)
invariant covariates in a latent growth model are often described options for handling data missing not a
individual difference variables, such as family back- random (MNAR) in latent growth modeling; see also
ground or medical history, that (a) do not change in Enders (2023). Special estimators in SEM are available
value over time and (b) are expected to predict growth for continuous outcomes with nonnormal distributions
trajectories. The goal is to determine what variables or for categorical outcomes (Chapters 9, 18); see also
Pt4Kline5E.indd 373 3/22/2023 2:06:00 PM

Zheng et al. (2022) for recommendations specific to TABLE 21.1. Input Data (Correlations, Standard
latent growth modeling. Deviations, Means) for Analyses of Latent
Time-varying covariates, which can change in Growth Models for Change in Neural Risk
value over time and are thus treated as repeated mea- Processing in Adolescents
sures variables, can also be analyzed with no special Variable 1 2 3 4
problem. A relatively unique feature of SEM is that pre-
1. R1 —
dictors of growth trajectories can also be analyzed as
2. R2 .35 —
conceptual variables modeled as common factors with
3. R3 .28 .35 —
multiple indicators, and structural models for predic-
tors can include indirect causal effects among them 4. R4 .29 .40 .45 —
(e.g., mediation). It is also possible to analyze in SEM M .04 .61 .57 .83
longitudinal data that are not time structured, or when
measurement intervals vary over cases. These and SD .05 .77 .76 1.15
other options for analyzing more complex latent growth
models are outlined later in this chapter. Note. These data are from Kim-Spoon et al. (2021), N = 150.
DATA SET FOR ANALYZING risk-taking behavior and neural processing of decision-
BASIC GROWTH MODELS making about risk (Kim-Spoon et al., 2021).
WITH NO COVARIATES Alternative latent growth models with no covariates
for the data just described are presented in Figure 21.2.
Kim-Spoon et al. (2021) analyzed longitudinal func- Full RAM graphical symbolism is used to represent
tional magnetic resonance imaging (fMRI) data in a the latent growth factors, including the special delta-1
sample of adolescents measured annually over 4 years symbol for the constant in a mean structure. Remember
(ages 14–17 years). The sample size is small, N = 150, that representing mean structures in model diagrams
but the data are quite novel. At each assessment, partic- is optional, but I do so here for pedagogical complete-
ipants engaged in a simulated lottery choice task while
their blood-oxygen-dependent responses were mea-
sured in brain structures, such as the insula and dorsal
anterior cingulate cortex (dACC), associated with risk 2.00
processing in adults. A neural risk processing compos-
ite was computed for each adolescent, where higher
1.50
Neural risk processing
scores indicate greater activation in the brain regions

just mentioned at choice points in the lottery task with
the potential for monetary loss (i.e., risk). 1.00
Presented in Table 21.1 are the summary statistics
for neural risk processing, and the means and standard
deviations at each yearly assessment are depicted in .50
Figure 21.1. In general, both the levels and ranges of
individual differences increase with age, although those 0
changes are not strictly linear. For example, although
there is slight decline over ages 15–16 years, neural risk
processing generally increases on average over ages −.50
14, 15–16, and 17 years. At age 14 years, the standard
deviation is relatively small compared with that at age 14 15 16 17
17 years with more intermediate values observed for Age in years
ages 15–16 years, but adolescents generally become
more variable in neural risk processing with matura- FIGURE 21.1. Neural risk processing composite means
tion. These patterns are consistent with the expectation and standard deviations for adolescents measured annually
that adolescence is characterized by increases in both over a 4-year period.
Pt4Kline5E.indd 374 3/22/2023 2:06:00 PM

(a) Random intercept-only (b) Latent basis growth

No growth Change R1 to R4
1 1
μI μI μS
2 2 2
σI σI σS
Intercept Intercept Shape
1 1 0
1 1 1 1 1 1
1 λ2
λ3
R1 R2 R3 R4 R1 R2 R3 R4
(c) Latent basis growth (d) Linear growth

Change R1 to R2
1 1
μI μS μI μL
2 2 2 2
σI σS σI σL
Intercept Shape Intercept Linear
1 0 1 0
1 1 1 1 λ4 1 1 1 1 2 3
λ3
R1 R2 R3 R4 R1 R2 R3 R4
FIGURE 21.2. Basic latent growth models for change in neural risk processing (R) measured annually over 4 years (ages
14–17 years).
ness. More compact graphical symbolism (i.e., ) is Random Intercept‑Only Models

used to represent the error terms for annual neural risk
Figure 21.2(a) is a random intercept-only model that
processing designated as R1–R4 in the figure. Note that
intercepts for the repeated measures are fixed to zero, predicts no growth, or where there is a stable level over
which is indicated by the absence of direct tracings the repeated measures. Loadings for all indicators on
from delta-1 to these variables. But there are indirect the intercept factor are fixed to equal 1.0. The unstan-
tracings from delta-1 through the latent growth factors, dardized coefficient for the direct tracing of delta-1 to
and the coefficient for the total of all such tracings for Intercept represents the factor mean, mI, which is also
each indicator represents a predicted mean (Chapter 9). the predicted average neural risk processing over all
Whether a latent growth model can closely predict the ages, or 14–17 years. Expressed as an equation,
observed means is often of keen interest in the analysis
(i.e., mean residuals should be closely inspected). m R i =m Ι (21.1)
Pt4Kline5E.indd 375 3/22/2023 2:06:01 PM

where the subscript i = 1–4 indexes the repeated mea- (0, l2, l3, 1.0) (21.2)
sures over ages 14, 15, 16, and 17 years, respectively.
The factor variance, s 2I, is the amount of random varia- defines mS, the shape factor mean, as the average
tion around the overall level (i.e., mI). It may seem coun- change in neural risk processing between 14 and 17
terintuitive to describe Figure 21.2(a) as a “growth” years of age. The same weights in the equation also
model, but it is consistent with the expectation that the define sS2, the shape factor variance, as the random
growth trajectory is flat (Curran et al., 2010). For exam- variation around the mean overall change.
ple, in their analysis of a latent growth model, Costa et Equation 21.2 scales the growth factor Shape in
al. (2013) found that empathy for patients among medi- Figure 21.2(b) so that a 1-unit change in time refers to
cal students did not appreciably decrease over 3 years; the whole period of observation, or ages 14–17 years
that is, a random intercept-only model was consistent inclusive. So defined, loadings l2 and l3 are interpreted
with the data. as proportions of the total overall change that has
occurred up to and including the corresponding mea-
surement (McNeish, 2020). For example, l3 = .70 for
Basis Growth Models
R3 in the figure would mean that .70, or 70% of the
Figure 21.2(b) is a latent basis growth model—also total increase in neural risk processing over ages 14–17
called a latent basis curve model or a level and shape years has occurred by age 16 years. Given the observed
model—with two growth factors, Intercept (i.e., level) means for R1–R4 in Table 21.1 or, respectively,
and Shape. Loadings of all repeated measures on the
intercept factor are fixed to equal 1.0, just as in the .04, .61, .57, .83
random intercept-only model (Figure 21.2(a)). A basis
model predicts growth trajectories that are not flat, but it is expected that l3 < l2 because the mean declines
growth is not specified as strictly linear. Instead, a basis from ages 15 to 16 years, or from .61 to .57, before it
model allows for linear or curvilinear change in any pat- increases again to its final level, or .83. In this way, the
tern that will be “captured” in the data and represented “dip” in mean neural risk processing from ages 15–16
by the shape factor. This approach to growth model- years will be reflected in the relative values of basis
ing is called nonlinear curve fitting, and it represents coefficients.
a simpler, more flexible alterative to the specification There are two indirect tracings from delta-1 in the
of models with a separate latent growth factor for each mean structure for Figure 21.2(b) to each repeated
change trend or polynomial, such as linear, quadratic, measure, one through Intercept and the other through
cubic, and so on. In the data for this example, average Shape. The coefficient for the sum of both tracings just
neural risk processing is higher at age 17 years com- mentioned is the predicted mean at each age, or
pared with age 14 years, but growth over intermediate
ages is not strictly linear—see Figure 21.1. The method m Ri = m I + l i mS (21.3)
for explaining how a basis growth model is specified
without prior hypotheses about the specific functional In other words, the predicted average neural risk pro-
form of change is covered next. cessing is the sum of the initial mean at age 14 years
Unstandardized loadings for the shape factor in Fig- plus the proportion of total change over ages 14–17
ure 21.2(b) are called basis coefficients. Two such coef- years that has occurred up to and including the point
ficients are fixed to equal constants: The specification of measurement indicated by the subscript i = 1–4.
l1 = 0 for R1 defines the intercept factor mean, mI, as the Given l3 = .70, for instance, the predicted mean at age
average level of neural risk processing at the first mea- 16 years is
surement occasion, or at 14 years of age, but as a latent,
not observed, variable. It also defines s 2I as the random m R 3 =m I + .70 m S
variation around the initial level. The specification
l4 = 1.0 for R4 in the figure scales the shape factor as or the average at age 14 years plus 70% of the overall
defined momentarily. The other two basis coefficients— amount of change that has occurred by age 17 years.
l2 and l3 for, respectively, ages 15 and 16 years—are The covariance in Figure 21.2(b) represents the
free parameters. The whole set of basis coefficients, or association between Intercept and Shape. A positive
Pt4Kline5E.indd 376 3/22/2023 2:06:01 PM

covariance indicates that higher initial levels of neural initial level to generate the final mean at age 17 years.
risk processing at age 14 years predict a higher rate of Because the observed means decline from ages 15 to 16
subsequent change between 14–17 years. That is, cases years, we expect that l3 < 1.0 and l3 < l4 when Figure
who start at higher levels change the most over time. A 21.2(c) is fitted to the data in Table 21.1.
negative covariance indicates the opposite: Higher ini- How the shape factor is scaled also affects its cova-
tial standing predicts less change over 4 years. Finally, a riance with the intercept factor. This is because the
factor covariance of zero indicates that initial level has interpretation of sS2 is determined by the set of basis
nothing to do with the rate of subsequent change. The coefficients specified by the researcher (e.g., Equation
question of whether change is predictable by initial level 21.2 vs. 21.4), and the variance of Shape contributes to
is often of substantive interest in growth modeling. its covariance with Intercept. However, the correlation
You should know that there are alternative ways to between Shape and Intercept in the standardized solu-
scale the shape factor through specification of its basis tion is not affected by scaling. It is also true that indica-
coefficients. Although how Shape is scaled does not tor error variances and covariances are unaffected by
affect model fit, the interpretation of certain parameters the scaling method (e.g., Little, 2013, pp. 249–256). In
depends on the scaling method. For example, the shape written reports, always describe how latent growth fac-
factor in Figure 21.2(c) is scaled by the alternative set tors are scaled.
of basis coefficients
Proportionality Assumption
(0, 1.0, l3, l4) (21.4)
Latent basis growth models afford much flexibility in
where the loadings for neural risk processing at ages 14 that the researcher is not required to specify any detail
and 15 years (R1, R2) are fixed to equal, respectively, 0 about the nature of curvilinear change, or even if curvi-
and 1.0, but the loadings for ages 16 and 17 years (R3, linear change is expected. This feature supports more
R4) are free parameters. So defined, the factor mean mS exploratory analyses of change, including situations
in Figure 21.2(c) is the average change between ages where departure from linear change is not of theoreti-
14–15 years, and the factor variance, sS2, is random cal interest. Also, basis growth models keep things rela-
variation in change over this 1-year period. In contrast, tively simple because they always have just two growth
the mean and variance for Shape in Figure 21.2(b) are factors, Intercept and Shape, regardless of the pattern
both defined based on change or variation in change of linear or curvilinear growth in the data. But there is
over the entire age range, or 14–17 years (i.e., a 4-year a potential cost, as explained next.
period; compare Equations 21.2 and 21.4). Basis growth models require the proportionality
Although indicator means for Figure 21.2(c) are also assumption that the percentage of growth observed
generated using Equation 21.3, the two freely estimated at each measurement occasion is equal for all cases.
basis coefficients, or l3 and l4 in Equation 21.4, have For example, if l3 = .70 in Figure 21.2(b), then it is
different interpretations. For example, l3 for R3 at age assumed that 70% of the total growth has occurred for
16 years is the proportion of the average change in neu- all cases by the third measurement occasion, or age 16
ral risk processing from ages 14 to 15 years (R1, R2) years in the ongoing example. Random variation in ini-
that must be added to the initial mean at age 14 years tial standing or total change over ages 14–17 years, or
(i.e., mI) in order to generate the mean at age 16 years. s 2I > 0 and sS2 > 0, are allowed, but the requirement that
If l3 = 1.25, for example, then the predicted mean at all cases reach the same proportions of their final status
age 16 is at each follow-up is restrictive. If true growth trajecto-
ries are mainly linear, the proportionality assumption
m R 3 =m I + 1.25 m S may present little difficulty. But if true change is qua-
dratic, the proportionality assumption will be violated
which is the initial mean at age 14 years plus 1.25 times unless (1) the linear slope is basically zero and (2) there
the amount of change expected to occur on average is almost no between-person variation in linear or qua-
between ages 14 and 15 years. The coefficient l4 has a dratic slopes (McNeish, 2020).
similar interpretation except that it is the proportion of Based on computer simulations by Wu and Lang
change over ages 14–15 years that must be added to the (2016), violation of the proportionality assumption can
Pt4Kline5E.indd 377 3/22/2023 2:06:01 PM

result in substantially biased parameter estimates when random intercept-only model in Figure 21.2(a) is nested
analyzing basis change models. However, misspecifi- under all other models in the figure, so the chi-square
cation due to violation of the proportionality assump- difference test applies to the no-growth model, too.
tion was not generally detected by global fit statistics, Exercise 1 asks you to calculate df M for Figures 21.2(a),
including the model chi-square, RMSEA, and SRMR. 21.2(b), and 21.2(d).
Based on these results, Wu and Lang (2016) warned
against uncritical enthusiasm for basis growth models,
Linear and Quadratic (Polynomial)
especially if there is nontrivial variation in curvilinear
Growth Models
change over time. They also suggested that researchers
consider alternative models, including those described Presented in Figure 21.3 is a polynomial model with
in the next two sections, that do not assume propor- both linear and quadratic growth factors. The coef-
tionality. McNeish (2020) described the analysis of ficients for the quadratic factor are fixed to equal the
basis growth models respecified to relax the propor- constants
tionality assumption using Bayesian methods. In this
approach, each basis coefficient is modeled as a latent (0, 1.0, 4.0, 9.0) (21.6)
variable that itself varies over cases; that is, loadings on
the shape factor are not constants for the whole sample. which for each measurement occasion is just the square
The method requires working knowledge of Bayesian of the corresponding coefficient for the linear factor
estimation and the analysis of raw data files, not sum- (see the figure). The model allows for both linear and
mary statistics. quadratic trends in growth, but any higher-order poly-
nomial or trend, such as cubic growth, is assumed to
be zero. The mean for the intercept factor is the aver-
Linear Growth Models
age neural risk processing at age 14 years, and its vari-
Linear growth models assume that change is strictly ance is the random variation around this mean initial
linear; thus, they are constrained versions of basis level. The linear factor mean, mL, is the average amount
growth models for the same variables. They differ in of yearly increase in neural risk processing, which is
that (1) there is no proportionality assumption for linear assumed to be the same between all pairs of adjacent
growth models, and (2) the coefficients in linear models years, and s 2L is the random variation around the annual
are all fixed to equal constants that correspond to times change. The quadratic factor mean, mQ, is the average
of measurement (i.e., no coefficients are free param- curvature or departure from linear growth, and its vari-
eters). For example, the coefficients for the linear factor ance s 2Q is random variation in quadratic growth curves
in Figure 21.2(d) are over cases.
In Figure 21.3, delta-1 in the mean structure has
(0, 1.0, 2.0, 3.0) (21.5) indirect tracings for each indicator through each of the
three growth factors, Intercept, Linear, and Quadratic.
which defines the intercept as corresponding to the first Thus, the equation that generates indicator means is
measurement at age 14 years (i.e., l1 = 0). Equation 21.5
also specifies equally spaced intervals for the next three m Ri = m I + l Li m L + l Q i m Q (21.7)
annual follow-ups over ages 15–17 years. The coeffi-
cients in Equation 21.5 increase in value, which reflects where l Li and lQ i are, respectively, coefficients for the
the hypothesis that neural risk processing uniformly linear and quadratic factors. For example, at age 16
increases with age. years (R3), lL3 = 2.0, lQ3 = 4.0, and
Because the linear model in Figure 21.2(d) is just
a constrained version of the basis growth model in m R 3 =m I + 2.0m L + 4.0m Q
Figure 21.2(c), the relative fit to the same data can be
directly compared with the chi-square difference test. Thus, the mean neural risk processing at age 16 years
A model of strict linear change is not consistent with is the average at age 14 years plus 2 times the aver-
the observed means for this example (Table 21.1, Fig- age annual change plus 4 times the average quadratic
ure 21.1), so its fit to the data may be problematic. The curvature.
Pt4Kline5E.indd 378 3/22/2023 2:06:01 PM

μI μL μQ
2 2 2
σI Intercept σL Linear Quadratic σQ
1 0 3 0
1 1 2 9
1 1 1 4
R1 R2 R3 R4
FIGURE 21.3. Polynomial growth model for linear and quadratic change in neural risk processing (R) over 4 years (ages
14–17 years).
All three latent growth factors are assumed to covary All examples for specifying coefficients for linear or
in Figure 21.3, so each of linear change and quadratic quadratic factors up to this point have assumed (1) equal
change can be related to the initial level, and the two measurement intervals for (2) repeated measures that
functional forms of change just mentioned can covary. generally increase over time where (3) the intercept fac-
A potential advantage is that a polynomial change tor is defined as the level at the first measurement occa-
model can be tested hierarchically, such as first with sion. How to deal with exceptions to these conditions is
just Intercept (i.e., a no-growth model), then next with considered in Appendix 21.A.
just Intercept and Linear (i.e., strictly linear change),
and finally with all the growth factors (linear and qua-
dratic trends are estimated)—see Llabre et al. (2004), EXAMPLE ANALYSES OF BASIC
who compared linear-only, and linear-and-quadratic, GROWTH MODELS
and other growth models of cardiovascular recovery
from stress for an example. In a computer simulation Listed in Table 21.2 for analysis 1 are the syntax and
study, Kim et al. (2018) reported that model trimming, output files for fitting in lavaan a random intercept-
or starting with the most complex growth model, was only model, basis growth model, and linear growth
more likely to detect the true growth trajectory than model—Figures 21.2(a), 21.2(b), and 21.2(d), respec-
model building, which begins with the simplest model tively—to the data in Table 21.1. All files can be down-
(e.g., random intercept-only). loaded from this book’s website. In their analyses of the
It is possible, especially in smaller samples, that same data, Kim-Spoon et al. (2021) fixed the error vari-
extreme collinearity can cause the analysis of polyno- ance for neural risk processing at age 14 years, or R1 in
mial growth models to fail. Polynomial growth factors, Figure 21.2, to zero; otherwise, the estimate is slightly
such as Linear and Quadratic in Figure 21.3, can be negative (i.e., a Heywood case), which is consistent
very highly correlated, and data in small samples may with a real error variance of zero. Given the narrow
not be sufficiently precise to distinguish between them. range of individual differences at this age—see Figure
Another challenge is that the variance of a polynomial 21.1—this result for the error variance is not surprising.
factor can be relatively small, which can restrict the Therefore, the same zero constraint is imposed in the
statistical power of tests for linear or curvilinear trajec- analyses described next.
tories. Exercise 2 asks you to compute df M for Figure Reported in Table 21.3 are the values of global fit
21.3. statistics and the results of the chi-square difference
Pt4Kline5E.indd 379 3/22/2023 2:06:01 PM

TABLE 21.2. Script and Output Files for Analyses of Basic Latent Growth
Models for Change in Neural Risk Processing Among Adolescents
Analysis Script file R packages
1. Basic growth models for change in neural kim-spoon-growth.r lavaan
risk processing over ages 14–17 years semTools
2. Growth model for predicting change in duncan-predict-growth.r lavaan

alcohol use over ages 13–16 years semTools
test for the three models tested. The fit of the random to the observed mean increase of .83 – .04, or .79, over
intercept-only is very poor, so the hypothesis of no ages 14–17 years, but Shape is a latent, not observed,
growth is not supported. In contrast, the basis growth variable. The factor covariance is positive, or .014, so
model passes the chi-square test, but the power of this higher levels of neural risk processing at age 14 years
test at the .05 level, or .14 in the MacCallum–RMSEA predicted greater overall increases by age 17 years. The
method, is low. A much larger sample size of N = 1,542 factor correlation in the standardized solution is .410,
would be needed for power ≥ .90. Values of approxi- which gives a better sense of effect size than the factor
mate fit indexes are not problematic except for the covariance.
upper bound of the 90% CI based on the RMSEA, or Unstandardized parameter estimates for the neural
.105, but this result is not surprising in a small sample. risk processing indicators are presented in Table 21.5.
The residuals, described later, do not suggest a gross Listed in the upper left part of the table are the basis
problem with local fit. The linear growth model fails coefficients for the shape factor, which for ages 14–17
the chi-square test and has significantly worse fit to years are, respectively,
the data than the basis growth model. Values for the
RMSEA and SRMR for the linear growth model are (0, .678, .646, 1.000)
also unfavorable. Inspection of the residuals for the lin-
ear model indicates poor local fit—see the output file As expected, the coefficient for age 16 years, or .646,
for analysis 1 (Table 21.2). Given all the results summa- is somewhat lower than the coefficient for age 15 years,
rized to this point, the basis growth model is retained. or .678. This pattern mirrors the dip in observed neural
Reported in Table 21.4 are the default ML estimates risk processing means between these two ages (Figure
for the factor means, variances, and covariance in the 21.1). The coefficient of .646 indiates that 64.6% of the
basis growth model. The intercept factor mean is .040, total increase in mean neural risk processing over ages
which equals the observed mean neural risk process- 14–17 years has occurred up to and including age 16
ing at age 14 years at 3-decimal accuracy (Table 21.1). years. Error variances are reported in the upper right
The shape factor mean is .817, which is close in value part of the table, including the fixed-to-zero variance
TABLE 21.3. Values of Selected Global Fit Statistics for Latent Growth Models of Change
in Neural Risk Processing in Adolescents
RMSEA
Model chiML dfM p chiD dfD p [90% CI] CFI SRMR
No growth 267.629 9 < .001 — — — .438 [.393, .484] 0 .452
Latent basis growth 2.811 4 .590 264.812 5 < .001 0 [0, .105] 1.000 .028
Linear growth 34.812 6 < .001 32.001 2 < .001 .179 [.124, .239] .670 .122
Note. All results were computed in lavaan.
Pt4Kline5E.indd 380 3/22/2023 2:06:01 PM

TABLE 21.4. Maximum Likelihood Parameter Estimates for Latent Growth

Factors in a Basis Model of Change in Neural Risk Processing in Adolescents
Means Variances and covariance
Intercept .040 .004 .002 < .001a
Shape .817 .086 .487 .108
Intercept Shape — — .014b .004
a.00029.
bFactor correlation is .410.
for the measurement at age 14 years, for which R2 must these significance tests is probably low. For practice,
be 1.0. Values of R2 for the remaining three measure- you should be able to generate the predicted means in
ments over ages 15–17 years are all < .50, which is not Table 21.5 from the parameter estimates in Tables 21.4
ideal. This is because the model fails to explain most and 21.5 using Equation 21.3. Correlation residuals for
of the observed variation in continuous neural risk pro- the covariance structure are all less than .10 in absolute
cessing at these ages. value—see the output file for analysis 1 in Table 21.2—
Presented in the bottom part of Table 21.5 for each so the basis growth model in Figure 21.2(b) explains
indicator are the observed and predicted (model- their observed correlations relatively well. For Exercise
implied) means, mean residuals (differences between 3, you are asked to rerun the analysis by fitting to the
observed and predicted means), and standardized data the basis growth model in Figure 21.2(c), where
mean residuals, which are significance tests expressed the coefficients for Shape are defined by Equation 21.4
as normal deviates (z) for the corresponding mean (i.e., 0, 1.0, l3, l4). Exercise 4 asks you to fit the linear
residuals. In general, the model closely predicts the and quadratic growth model in Figure 21.3 to the same
observed means, and none of the standardized mean data. This particular analysis fails, and you are asked to
residuals is significant at the .05 level, but power for describe the problem.
TABLE 21.5. Maximum Likelihood Parameter Estimates

and Predicted Means for Indicators in a Basis Model
of Change in Neural Risk Processing in Adolescents
Loadings (Shape) Error variancesa
Indicator Estimate SE Estimate SE
R1 0 — 0 —
R2 .678 .073 .359 .055
R3 .646 .071 .354 .053
R4 1.000 — .753 .117
Observed Predicted Mean Standardized
Indicator mean mean residual mean residual
R1 .040 .040 0 0
R2 .610 .594 .016 .794
R3 .570 .568 .002 .122
R4 .830 .857 –.027 –.979
aR2 values for R1–R4 are, respectively, 1.000, .407, .388, and .408.
Pt4Kline5E.indd 381 3/22/2023 2:06:02 PM

EXAMPLE FOR A GROWTH cients for the direct tracings from delta-1 are not means;
PREDICTOR MODEL instead, they are intercepts for the regression of the
WITH TIME‑INVARIANT COVARIATES latent growth factors on the covariates. For example,
the term aL in the figure designates the predicted value
In unconditional growth models with no covariates, on the linear factor when both gender and family sta-
latent growth factors are exogenous variables that are tus equal zero (i.e., young adolescent males from two-
free to vary and covary (e.g., Figure 21.2). In contrast, parent families—see Table 21.6). The predicted mean
growth predictor models with time-invariant covariates for the linear factor is the sum of the coefficients for
are conditional growth models, where growth factors the direct tracing from delta-1 just mentioned and both
are regressed on variables expected to predict growth indirect tracings through the covariates, or
trajectories. Thus, growth factors in predictor models
are endogenous, not exogenous, variables with dis- m L = a L + β LGm G + β LFm F (21.8)
turbances that represent variation unexplained by the
covariates. Likewise, disturbance covariances repre- That is, the mean of the linear factor is determined by
sent associations between growth factors after control- the intercept and coefficients for its regression on gen-
ling for the covariates. An example follows. der and family status and the means of those predictors
Duncan and Duncan (1996) surveyed a total of 321 (Rule 9.5). Exercise 5 asks you to derive the expres-
adolescents about their levels of alcohol use over a sion for the predicted mean of the intercept factor in
4-year period, or annually over ages 13–16 years. The Figure 21.4.1
data are summarized in Table 21.6, where the means Predicted indicator means in Figure 21.4 are gener-
for gender and family status at the first assessment are ated as the sum of the coefficients for all indirect trac-
proportions of adolescents who are female (.573) or ings from the constant delta-1. You should verify that
who live with just one parent (.446). The year-to-year there is a total of six such tracings for each indicator.
increases in mean levels of alcohol use are generally Altogether they represent indicator means as functions
consistent, which suggests linear growth. of their regressions on the latent growth factors and
The data just described are fitted to the conditional the regressions of the growth factors on the covari-
linear growth model presented in Figure 21.4 with the
optional delta-1 symbol for the mean structure. Because 1 Yes,the intercept factor in a predictor growth model itself has
the intercept and linear factors are endogenous, coeffi- an intercept for its regression on the covariates.
TABLE 21.6. Input Data (Correlations, Standard Deviations, Means)

for Analysis of a Growth Prediction Model for Alcohol Use by Adolescents
Alcohol use
1. A1 —
2. A2 .640 —
3. A3 .586 .670 —
4. A4 .454 .566 .621 —
Covariates
5. Gender .001 .038 .118 .091 —
6. Family Status .214 .149 .135 .163 .025 —
M 2.271 2.560 2.694 2.965 .573 .446
SD 1.002 .960 .912 .920 .504 .498
Note. These data are from Duncan and Duncan (1996); N = 321. Gender is coded 0 = male, 1 = female, family
status is coded 0 = two parents, 1 = single parent.
Pt4Kline5E.indd 382 3/22/2023 2:06:02 PM

ate effects of the covariates on the variables included in

the growth model, or A1–A4 in Figure 21.4.
With a total of 6 observed variables, there are 6(9)/2
= 27 observations available to estimate the 18 param-
μG μF eters of Figure 21.4. These include
Gender 1 Family
1. 8 variances (of 2 covariates, 2 factor disturbances,
βIG αI αL βLF and 4 indicator errors);
βIF βLG 2. 2 covariances (1 between the covariates and 1
Intercept Linear between the factor disturbances);
3. 4 direct effects on the factors (2 from each covari-
1 0 ate);
1 1 1 1
2 3 4. 2 means of the covariates; and
5. 2 factor intercepts.
A1 A2 A3 A4 Thus, df M = 27 – 18 = 9. Listed for Analysis 2 in Table

21.2 are the syntax and output files for fitting the con-
ditional growth model in Figure 21.4 to the data in
Table 21.6. The analysis in lavaan with default ML
FIGURE 21.4. Conditional linear growth model with time- converged to an admissible solution. Values of selected
invariant covariates for predicting change in alcohol use (A) model fit model fit statistics are listed next:
measured annually over 4 years (ages 13–16 years).
chiML(9) = 13.823, p = .129
RMSEA = .041, 90% CI [0, .081]
ates. Exercise 6 asks you to write the equation for the CFI = .992; SRMR = .027
model-implied mean of A4, the alcohol use measure-
ment at age 16 years. This exercise may help you to The model passes the chi-square test, the power of
appreciate that many SEM computer tools can option- which at the .05 level for N = 321 is only .408. For
ally print model-implied means for latent or observed power ≥ .90, a sample size of N = 883 would be needed.
variables when growth models are analyzed. All rela- Values of approximate fit indexes do not suggest gross
tions between the covariates and the repeated measures problems in global fit, and the residuals (described
in the figure are represented as indirect; that is, the later) seem satisfactory, too. Thus, the growth predictor
latent growth factors are specified to completely medi- model in Figure 21.4 is retained.
TABLE 21.7. Maximum Likelihood Parameter

Estimates for Latent Growth Factors in a Prediction
Model for Alcohol Use by Adolescents
Intercepts and covariance
Intercept 2.116a .092 .666c .074
Linear .203b .031 .037d .101
DI DL — — –.077e .022
aPredicted mean is 2.291. bPredicted mean is .221. cR2 = .050. dR2 = .039.
eDisturbance correlation is –.491.
Pt4Kline5E.indd 383 3/22/2023 2:06:02 PM

Reported in the left side of Table 21.7 are the esti- the other predictor. The only coefficient that is statisti-
mates for the intercepts of the latent growth factors cally significant at the .05 level is for family status as
in their regressions on the covariates, including 2.116 a predictor of initial level of alcohol use, or Intercept.
for Intercept and .203 for Linear. These regression The result .377 says that adolescents from single-parent
intercepts contribute to their respective factor means, families (coded as 1) have on average higher levels of
which were computed by lavaan and are reported in alcohol at age 13 years than their age-peers from two-
the table footnotes. Thus, the predicted average level parent families (coded as 0) by .377, while controlling
of alcohol use at age 13 years is 2.291, and the mean for gender. The coefficient for predicting Linear from
annual increase is .221. The unstandardized distur- gender is nearly significant at the .05 level. This coeffi-
bance variances and covariance for Intercept and cient, or .065, indicates that young women (coded as 1)
Linear are reported on the right side of the table. The have a higher mean annual increase in alcohol use over
covariates explain .050, or 5.0% of the total variation ages 13–16 years by .065 than young men (coded as 0)
in the level of alcohol use at age 13 years (Intercept); over the same age range, controlling for family status.
they also explain .039, or 3.9% of the total variation in In their analysis of additional covariates, Duncan and
the annual increase (Linear) over the ages 13–16 years. Duncan (1996) reported that parent–child conflict,
The disturbance covariance is negative, or –.077. The parent substance abuse, and peer encouragement of
standardized estimate (i.e., the factor correlation) is substance use also predicted not only higher levels of
–.491. Thus, adolescents with higher levels of alcohol alcohol use, but also higher levels of cigarette and mari-
use at age 13 years had lower annual increases as they juana use over time.
matured, and vice versa, while controlling for gender Results for the alcohol use measures over ages 13–16
and family status. years are presented in Table 21.9. Their error variances
Unstandardized regression coefficients for the covari- are listed in the left part of the table, and all their R2
ates are reported in Table 21.8. Because both predictors values (see the table note) exceed .50, so the model
are binary variables coded as 0/1, these coefficients are explains most of the variation in alcohol use at each
directly group mean differences after controlling for age. Observed, predicted, residual, and standardized
TABLE 21.8. Maximum Likelihood Parameter

Estimates for Covariates (Path Coefficients)
in a Growth Prediction Model for Alcohol Use
by Adolescents
Gender Family status
Outcome Estimate SE Estimate SE
Intercept .011 .105 .377 .106
Linear .065 .035 –.044 .036
TABLE 21.9. Maximum Likelihood Parameter Estimates and Predicted Means

for Indicators in a Growth Predictor Model for Alcohol Use by Adolescents
Error variances Means
Indicator Estimatea SE Observed Predicted Residual Standardized residual
A1 .333 .050 2.271 2.291 –.020 –1.418
A2 .310 .033 2.560 2.511 .049 1.737
A3 .273 .029 2.694 2.732 –.038 –1.490
A4 .312 .046 2.965 2.952 .013 .877
aR2 values for A1–A4 are, respectively, .678, .651, .661, and .643.
Pt4Kline5E.indd 384 3/22/2023 2:06:02 PM

residual means are listed in the right part of the table. tors—when reporting on model fit. A written sum-
To summarize, observed and predicted means are quite mary for the analysis of a latent growth model—or
similar, and no mean residuals are statistically signifi- any other type of model in SEM—is deficient with-
cant at the .05 level. Absolute correlation residuals for out information about local fit.
the alcohol use variables (see the output file for analysis
2 in Table 21.2) are all < .10, so the model seems to
adequately explain their intercorrelations, too. EXTENSIONS OF LATENT
GROWTH MODELS
PRACTICAL SUGGESTIONS There are many ways to extend the basic concepts of
FOR LATENT GROWTH MODELING latent growth modeling considered to this point. Basi-
cally, any kind of growth model can be simultaneously
Suggestions for analyzing latent growth models are fitted to data from multiple groups. The motivation for
listed next: doing so is the same, too: to determine whether model
parameters vary appreciably over samples from differ-
1. Carefully screen and inspect the data (means, vari- ent populations (Chapter 12), among other possibili-
ances, correlation, distributions) and pick an appro- ties. For example, Comeau and Boyle (2018) analyzed
priate method to deal with missing data (Chapter 4). growth trajectories for internalizing and externaliz-
ing behavior from birth to age 14 years for children
2. Inspect the empirical growth records, such as for
exposed to three different patterns of family poverty,
sets of randomly selected cases in a large data set,
always poor, never poor, and intermittently poor, or ≥ 1
to get a sense of trajectories in the data. The freely
transition(s) in and out of poverty. Children in always
available JASP program has capabilities for plotting
poor families had the highest levels of internalizing and
individual growth records (Chapter 5).
externalizing behavior over all times followed by chil-
3. Also consult theory or results from prior empirical dren who experienced change in poverty and then next
studies when specifying the type of growth curve by children from never poor families. Low maternal
model, such as a basis model versus a polynomial education compounded adverse effects of poverty for
model when growth is curvilinear. both internalizing and externalizing.
4. Fit an unconditional growth model without covari- It is possible to relax the requirement for time-struc-
ates, especially if the structural model for covariates tured data that all cases should be measured at the same
has common factors each with multiple indicators measurement intervals. One motivation is time bin-
or the prediction model has positive degrees of free- ning, or the practice in longitudinal studies of treating
dom (it is not saturated). If change over time is con- cases with varying measurement schedules as if they
sistent with the unconditional growth model, then were assessed at the exact same time. For example, data
next add the covariates and analyze a conditional from tested children who ranged in age from 3 years and
growth model. This strategy is similar to two-step 10 months through 4 years and 2 months may be treated
modeling for SR models, where the whole model is as though all cases were 4.0 years old at a particular
broken down into its two components, in this case follow-up. Here, a time bin of 4 months classifies the
growth versus prediction. It is easier to detect mis- children as belonging to the same age group. A problem
specification in a conditional growth model after is that the wider the time bin, the greater the variation
it is known that the corresponding unconditional in sampling time and, thus, the greater the loss of tem-
growth model has satisfactory fit to the data. poral information about measurement. In other kinds
of studies, individually varying time points may be
5. If analyzing a parallel growth process model inevitable, such as when data are only recorded at medi-
(defined in the next section), estimate separate mod- cal appointments that are scheduled as needed or when
els for each process before combining the two mod- not all measures, such as fMRI scans, are collected at
els. This tactic makes it easier to detect misspecifi- every visit. Treating individually varying time scores as
cation in the growth model for a particular domain. though all cases were measured at the same occasions
6. Do not neglect to describe the residuals—especially can result in appreciably biased estimates of parameters
mean residuals for indicators of latent growth fac- in latent growth models (Blozis & Cho, 2008).
Pt4Kline5E.indd 385 3/22/2023 2:06:02 PM

In the statistical technique of HLM, the analysis of growth structure. Instead, their role is to control for
longitudinal data collected at individually varying time effects on the indicators above and beyond those of the
points is not a special problem because time is explic- latent growth factors. Thus, a repeated measure on any
itly represented as a variable in the analysis. An option occasion is jointly determined by the latent growth fac-
in SEM is the definition variable approach, where tors and by the direct effect of the corresponding time-
loadings for linear or nonlinear growth factors are varying covariate (Curran et al., 2010). For example,
treated as fixed parameters that define the occasions of Panwar et al. (2022) analyzed latent growth models for
measurement (i.e., time) on a case-by-case basis (Mehta monthly rates of COVID-19 cases per million (CPM)
& Neale, 2005). Specifically, the parameter matrix for over the period August 2020 to July 2021 in a total of
coefficients varies according to measurement sched- 126 countries. The models were more accurate when
ules for individual cases. For example, the definition average monthly outdoor temperature was included as a
variable could be specified to equal the exact ages of time-varying covariate with direct effects on CPM. As
children tested around the age of 4.0 years, such as expected, temperature was inversely related to growth
3 years and 8 months for one child versus 4 years and in CPM; that is, higher environmental temperatures
1 month for a different child, and so on. In this method, predicted slower growth rates, and countries with lower
there is no single model-implied mean vector or covari- monthly temperatures had the highest rates of growth
ance matrix for the whole sample; instead, there are as in CPM. At lower temperatures, people spend more
many of each kind of matrix just mentioned as there time indoors, which increases the transmission rate of
are different measurement schedules over all cases. COVID-19.
A drawback is that the usual global fit statistics, includ- All models considered so far are univariate in that
ing the model chi-square, may be unavailable in this they concern growth in a single domain. A multivari-
method—see Sterba (2014) for examples. ate growth model represents change across two or more
The predictor growth model in Figure 21.4 features domains. Other terms for this include parallel latent
two time-invariant predictors of growth, which are growth curve model and parallel growth process
assumed to have no direct effects on the repeated mea- model, if the domains are measured at the same points
sures. A different role for a time-invariant covariate in time (Kaplan, 2009). The model in Figure 21.5(c) has
X is depicted in Figure 21.5(a)—shown with no mean two sets of latent growth factors for different repeated
structure nor disturbances to save space—where the measures, indicators X1–X3 and Y1–Y3, each measured
repeated measures are each directly regressed on the at the same three occasions. The latent growth factors
covariate. Stoel et al. (2004) referred to the figure as are specified as correlated across the domains. For
a direct effect growth curve model with ≥ 1 time- example, the intercept factor for X1–X3 covaries with
invariant covariates with direct effects on the indica- the intercept factor for Y1–Y3, and so. These covariances
tors. Variable X is not part of the growth model per se; represent cross-domain change, or where the starting
instead, it serves as a control variable such that param- point or slope of change in one domain is related to that
eters of growth factors are estimated after adjustment in another domain. For example, Brailean et al. (2017)
for direct effects of the covariate on the indicators. administered measures of depression and memory
Under conditions described by Stoel et al. (2004) that function on five occasions in a longitudinal study of
involve model identification, it may be possible to test older adults aged 65–89 years at the first measurement.
hypotheses that time-invariant covariates also affect They analyzed a cross-domain latent growth model for
the latent growth factors. Indeed, growth predictor both sets of repeated measures, and they reported that
models, where time-invariant covariates have no direct (1) poorer initial recall performance predicted steeper
effects on the indicators, are constrained versions of increase in depressed mood over time, and (2) steeper
direct effect models—see Stoel et al. (2004) for more decline in processing speed predicted a faster increase
information. in somatic symptoms of depression over time. The
Represented in Figure 21.5(b) is a growth model authors hypothesized that older adults may experience
with a time-varying covariate, X, where the subscripts greater levels of depression in reaction to poor memory
indicate the measurement occasion, or X1–X3, that cor- performance.
respond to the measurement schedule for the repeated A curve-of-factors latent growth model—also
measures Y1–Y3. Although time-varying covariates are called a second-order growth curve model or a
repeated measures, they are not included in the latent latent variable longitudinal curve model (Newsom,
Pt4Kline5E.indd 386 3/22/2023 2:06:02 PM

(a) Direct effect (b) Time-varying (c) Parallel growth

model covariates process
X1 X2 X3
I S I S
IX SX
Y1 Y2 Y3 Y1 Y2 Y3
IY SY
X X1 X2 X3
Y1 Y2 Y3
(d) Curve of factors (e) Piecewise

(linear-linear)
I S
I S1 S2
0 2 0
1 2 2 2
A1 A2 A3 1
Y11 Y12 Y13 Y21 Y22 Y23 Y31 Y32 Y33 Y1 Y2 Y3 Y4 Y5
FIGURE 21.5. Latent growth models with a time-invariant covariate as a control variable (a); covariates measured at each
measurement occasion that are not included in the growth model (b); a multivariate model with correlated sets of latent growth
factors (c); a curve-of-factors growth model (d); a piecewise growth model for a linear–linear process (e). Models shown with
no mean structure or disturbances to save space. I, intercept; S, slope. All loadings on I are fixed to 1.0. The slope factor could
refer to a shape factor in basis growth model or to a polynomial trend, such as linear, in a polynomial growth model.
2015; Preacher et al., 2008)—is represented in Figure tors (i.e., all the models considered to this point). For
21.5(d). The indicators for the latent growth factors in example, all loadings on the intercept factor in the fig-
the figure are not manifest variables; instead, they are ure are fixed to equal 1.0, and coefficients for the slope
concepts modeled as common factors A1–A3 each with factor are specified based on measurement time such as
multiple indicators that are measured at three different (0, 1.0, 2.0), which defines growth as strictly linear and
occasions. We refer to A1–A3 as latent trait variables to defines the mean for the intercept as corresponding to
clearly distinguish them from their indicators, which the first measurement occasion.
are observed variables such as Y11–Y13 for factor A1 at A key feature of a second-order growth model is
the first measurement occasion, and so on. Specifica- that estimates for latent growth factors are adjusted for
tion of coefficients for growth factors in second-order measurement error in repeated measures. For the model
models follows the same logic as for first-order growth in Figure 21.5(d), this property implies that departure
models with observed, not latent, variables as indica- from the true growth pattern (temporal instability),
Pt4Kline5E.indd 387 3/22/2023 2:06:02 PM

which corresponds to the disturbance variances for the are all fixed to equal 2.0 (i.e., no change). The mean for
latent trait variables, is estimated apart from unique S1 represents the rate of linear growth from Y1 through
variation due to measurement error (unreliability), Y3, or the first growth phase. The second slope factor in
which corresponds to the error terms for the observed Figure 21.5(e), S2, represents additional linear change,
variables (Preacher et al., 2008). Another advantage is if any, that occurs after the Y3, the change point, due
the opportunity to test various hypotheses about longi- to a second growth process. The coefficients for S2 are
tudinal measurement invariance. For example, Figure
21.5(d) assumes occasion-specific congenerity, or that (0, 0, 0, 1.0, 2.0)
a single-factor model is the correct measurement model
for indicators of the latent trait factors at all three mea- which specifies the end of the first stage of growth as
surement occasions. That is, the basic correspondence the origin for the second stage of growth. The mean for
between the observed variables and the latent trait fac- S2 is the rate of linear change over Y3 –Y5. A linear–lin-
tors does not change over time. Rejection of the con- ear process consistent with the model in Figure 21.5(e)
generity assumption means that what the observed is presented in Figure 21.6, which depicts linear growth
variables measure depends on the particular measure- that decelerates beginning at Y3, but change both before
ment occasion, which presents a challenge in a longi- and after the knot is strictly linear. A rationale for
tudinal study. Additional, even more stringent hypoth- specifying a particular change point is needed, such as
eses about measurement invariance that can be tested a reduction in the level of funding after Y3 in the fig-
in longitudinal or other types of designs are covered ure, among other justifications for multiphase growth.
in the next chapter. See Bishop et al. (2015) for more Harring et al. (2021) described piecewise models about
information about latent growth models with multiple three-stage or inherently nonlinear growth processes.
indicators for latent trait factors. Grimm and McArdle (2023) described additional types
In a piecewise latent growth model—also called of latent growth models for estimating curvilinearity in
a multiphase model—there are at least two slope development or application in accelerated longitudi-
factors that represent growth over different phases or nal designs, where multiple cohorts are intentionally
processes. Each phase is characterized by different measured at points in the developmental process and
functional forms of change that can represent effects then followed for a period of time. Such a design allows
of interventions, such as the beginning and endpoints
of a treatment or public policy, or shifts in development
that occur at different times throughout the lifespan. A
multiphase model has at least one knot or change point 20.0
that represents the transition from one growth phase to
the next. A knot can be known a priori or estimated in
the data, and it can be modeled as a fixed parameter,
random coefficient, or free model parameter (Preacher 15.0
et al., 2008).
When a change point is known, it loads on two least
Mean
different slope factors, one that models growth before

the knot and the other that specifies a different pattern 10.0
of growth after the knot. Consider Figure 21.5(e), which
represents a linear–linear growth process. The first
slope factor, S1, specifies a linear growth process over
Y1–Y3 that levels out starting at Y3, the change point. 5.0
Thus, the coefficients for S1 are
Y1 Y2 Y3 Y4 Y5
(0, 1.0, 2.0. 2.0, 2.0)
Measurement occasion
where linear growth is expected to taper off or reach an
asymptote at Y3. Accordingly, the coefficients for Y3 –Y5 FIGURE 21.6. A two-phase (linear–linear) growth process.
Pt4Kline5E.indd 388 3/22/2023 2:06:02 PM

the study of a target age range in a shorter period of model person-specific effects by using a particular
time than would be possible in single-cohort longitu- measurement method (i.e., common method variance)
dinal designs. to assess a target concept—see Steyer et al. (2023) for
Bollen and Curran (2004) described a kind of hybrid more information.
between a latent growth model and a panel model for There are even more kinds of statistical models for
repeated measures of the kind analyzed in path analy- longitudinal data than mentioned to this point, and this
sis. It is called an autoregressive latent trajectory superabundance of choices presents a challenge to the
(ALT) model, and such models feature the specification researcher: Complex statistical models for longitudinal
of autoregressive, or direct and indirect effects between data may fit a wide range of data just as a complex lon-
repeated indicators of latent growth factors. It is also gitudinal data set can be consistent with a wide range
possible to specify cross-lagged causal effects when of models. Analysis is not the way to make a choice;
data from multiple sets of repeated measures are ana- that is, attempting to find the “best” model by indis-
lyzed (i.e., multivariate change is studied). In contrast, criminately fitting alternative models is folly. Wisdom
indicators in the first- or second-order latent growth is the only meaningful alternative, a point made many
models described in this chapter have no causal effects times in this book: Substantive knowledge of theory,
on one another (e.g., Figure 21.5). A standard univari- design, measures, and prior empirical results is the
ate latent growth model can be described as a restricted only way forward. We have come as far in this book
case of an ALT model where all autoregressive parame- as I can take readers in the analysis of longitudinal
ters for direct effects between indicators are constrained data. After the chapter summary, I leave you in the
to zero, but simple panel models for observed variables capable hands of the authors of the works listed in the
with autoregressive effects are not nested under ALT “Learn More” section.
models. A potential drawback is that integration of the
autoregressive and latent growth curve parts of an ALT
model rests on the strong assumption that neither part SUMMARY
is misspecified, but curvilinearity in the latent growth
component can violate this requirement (Voelkle, A latent growth model for longitudinal data is basically
2008). Ou et al. (2017) reviewed other criticisms of ALT an SR model with a mean structure. In a typical first-
models, including how flexibility in their specification order growth model, each repeated measures variable
can lead to contradictory predicted growth trajecto- is specified as an indicator of at least two latent growth
ries, and these authors suggested ways to test targeted factors. One factor, Intercept, represents the origin or
hypotheses about the functional forms of change based initial level, and coefficients are fixed to equal 1.0 for
on computer simulation studies with ALT models. all repeated measures. In nonlinear curve fitting, the
There are also many kinds of latent variable mod- other factor, Shape, estimates the shape of growth tra-
els for longitudinal data besides latent growth models. jectory whether linear or curvilinear. In a polynomial
To mention just a few, latent difference score mod- model, there is a separate growth factor for each indi-
els are extensions of simple difference score models. vidual trend in growth, beginning with a linear change
They offer a flexible approach to the study of change factor, followed next by a quadratic factor, and so on.
between specific points in time, and they can be The intercept is defined by the location of the “0”
extended to the analysis of multiple outcomes each (zero) among the coefficients for a shape or slope fac-
measured at multiple occasions—see McArdle (2009) tor. Latent growth factors are often assumed to covary,
for examples. There are latent state-trait models for which allows for the possibility that origin or starting
analyzing stable aspects of constructs over time. In level is related to the rate of change. There are many
such models, the stable variance that does not change variations on latent growth models, and such models
over time, or trait variance, is estimated apart from can be analyzed in techniques other than SEM. But the
systematic variance that changes over time, or state SEM approach to growth modeling offers advantages,
variance. Any nonsystematic variance is treated as including the possibility of representing latent trait
error variance. The idea is to distinguish fundamen- variables as predictors of growth trajectories. The next
tal trends or enduring patterns from more temporary chapter deals with the topic of measurement invariance
fluctuations (Newsom, 2015). There are also ways to in the SEM technique of CFA.
Pt4Kline5E.indd 389 3/22/2023 2:06:02 PM

LEARN MORE Grimm, K. J., Ram, N., & Estabrook, R. (2017). Growth
modeling: Structural equation and multilevel modeling
Grimm et al. (2017) cover a wide range of latent growth approaches. Guilford Press.
models from both SEM and multilevel modeling perspec-
Little, T. D. (2013). Longitudinal structural equation model-
tives. Little (2013) and Newsom (2015) describe latent
ing. Guilford Press.
growth models and other kinds of structural equation models
for longitudinal data. Newsom, J. (2015). Longitudinal structural equation model-
ing: A comprehensive introduction. Routledge.
EXERCISES
1. Compute df M for Figures 21.2(a), 21.2(b), and 5. Write an expression for the predicted mean of Inter-
21.2(c). cept in Figure 21.4.
2. Compute df M for Figure 21.3. 6. Write an expression for the predicted mean of A4 in
Figure 21.4.
3. Fit Figure 21.2(c) to the data in Table 21.1. Compare
the results with those in Tables 21.3 and 21.4 for 7. In Table 21.8, interpret the unstandardized coeffi-
Figure 21.2(b) fitted to the same data. cient for gender as a predictor of Intercept.
4. Fit Figure 21.3 to the data in Table 21.1. Explain

why the solution is inadmissible.
Pt4Kline5E.indd 390 3/22/2023 2:06:08 PM

Appendix 21.A Given Equation 21.9, the linear factor mean will be pos-
itive, if the observed means actually decline. Its value
indicates the average rate of linear decline over the
assessments, which corresponds to the negative signs in
Unequal Measurement Equation 21.9. The actual linear factor mean estimated
Intervals and Options by Murphy et al. (2002) is .041, which is the average
rate of decrease in distress. The estimated correlation
for Defining the Intercept between the intercept and linear factors is also positive,
or .28, which says that higher initial distress predicts
greater decrease in distress over time.
It is not particularly difficult to adjust the coefficients You should know there is no requirement to nega-
for a linear growth factor to reflect unequal measure- tively scale time. For example, defining the linear fac-
ment intervals. Suppose that that young children are tor in Murphy et al. (2002) with the coefficients listed
assessed at 3, 6, 12, and 24 months of age. Specifying next
the coefficients of a linear factor as
(0, 1.0, 2.5, 7.0) (21.10)
(0, 1.0, 3.0, 7.0)
results in an equivalent model, but the signs of both
defines 1 unit of time, or the interval between the first the linear factor mean and its correlation with the
and second measurements, as 3 months. The two coef- intercept factor are reversed. That is, the linear factor
ficients for the third and fourth measurements, 3.0 mean is –.041, and its correlation with the intercept fac-
and 7.0, respectively, preserve the relative differences tor is –.28, when Equation 21.9 scales the linear fac-
in ages over time while specifying that any change is tor. Interpretation of the “new” linear factor mean is
strictly linear. For example, the third assessment at age pretty straightforward: The average rate of increase in
12 months is taken 9 months after the initial measure- distress is –.041, which also says that the average rate of
ment, and 9/3 = 3.0, which is 3 time units; thus, the decline in distress is .041.
coefficient for the third assessment is 3.0 Interpreting a negative correlation between intercept
Sometimes means are expected to decline, not and linear factors is trickier when the linear mean is
increase, over time. One option, while still defining negative, given Equation 21.9 as scaling the linear fac-
the intercept as corresponding to the first measurement tor. Here, the negative correlation of –.28 is interpreted
occasion, is to scale time in a negative direction. For as the magnitude of a decreasing trend (Little, 2013).
example, Murphy et al. (2002) measured levels of dis- This means that parents higher in initial distress will
tress among parents at 4, 12, 24, and 60 months after decrease more rapidly over time, which can also be
the violent death of a child. Because distress is expected described as increasing less rapidly in distress with
to decline over time, the growth trajectory should be time (Newsom, 2015). See Little (2013, pp. 259–260)
negative. In latent growth models analyzed by Murphy and Newsom (2015, pp. 175–176) for tutorials about
et al. (2002), the coefficient for the linear factor at 4 interpreting positive versus negative correlations
months was fixed to equal zero, and the loading for the between intercept and linear factors, given positive or
12-months bereavement (conducted 8 months later) was negative overall growth trajectories.
fixed to –1.0. Because the period of 8 months equals –1 Little (2013) described how the use of items with
in time units, the loading of the 24-months bereave- closed-ended Likert response scales as indicators in
ment—which took place 20 months after the initial latent growth models can generate negative correla-
assessment—was fixed to –20/8, or –2.5. By the same tions between intercept and slope factors. This can
logic, the coefficient for the 60-month assessment was occur because selecting an extreme response category
fixed to –7.0 because it took place 56 months after the on a particular occasion, such as “poor” on the 3-point
initial measurement, and –56/8, or –7.0. The set of coef- Likert scale “good,” “neutral,” or “poor” for a question
ficients for the linear factor analyzed by Murphy et al. about spousal relationship quality, gives no possibil-
(2002, p. 431) is thus ity to change to an even more extreme level at the next
assessment. Also, the response format just described
(0, –1.0, –2.5, –7.0) (21.9) allows no direct comparison with the level of relation-
Pt4Kline5E.indd 391 3/22/2023 2:06:08 PM

ship quality at the previous assessments. Little (2013) (6.0, 12.0, 18.0, 24.0)
offered suggestions for developing time-sensitive
measures with questions about how much change has implies that birth (i.e., 0 months) is the initial level,
occurred since the previous follow-up. which could be a plausible specification in a longitudi-
The intercept is defined as corresponding to the nal study of child development (Little, 2013).
first measurement in many, if not most, latent growth Another example is when coefficients for linear,
models, but there is no requirement to do so. Indeed, quadratic, or higher-order polynomial growth fac-
there are occasions when it makes sense to scale the tors are centered, which means that their coefficients
intercept in a different way. Here are some examples: sum to zero for each trend. For example, when using
Take a quick glance look back at Figure 21.2(b), which the ANOVA technique for repeated measures, poly-
was fitted to the data in Table 21.1 for a sample of 150 nomial trends are specified as contrast codes that are
adolescents tested by Kim-Spoon et al. (2021). Within centered and where relative values of codes specify the
a subset of 126 of these adolescents, the same authors basic shape of a growth trajectory as deviations from
administered at age 18 years (i.e., time 5) a retrospec- the grand mean, or the average score over all measure-
tive measure of childhood neglect or abuse by care- ments. Suppose for the polynomial model in Figure
givers, including biological or adoptive parents. They 21.3 that the coefficients for the linear and quadratic
analyzed latent growth models where abuse and neglect factors were specified as, respectively,
were specified as time-invariant predictors of change in
neural risk processing over ages 14–17 years. Because (–3.0, –1.0, 1.0, 3.0)
the childhood abuse or neglect covariates were mea- (1.0, –1.0, 1.0, –1.0)
sured just after the last assessment of risk processing
at age 17 years, the shape factor was defined with the which are the standard ANOVA orthogonal polyno-
basis coefficients mial contrasts for four repeated measures. The set of
coefficients just listed define the intercept (origin) as
(–1.0, –l2, –l3, 0) falling exactly halfway between the second and third
occasions, or at a point in time that corresponds to no
where l4 = 0 for neural risk processing at age 17 years actual measurement. Thus, the linear and quadratic
defines the intercept as the origin at this point. The trends are scaled as deviations from the grand mean.
rationale is that the abuse and neglect covariates were Model fit is unaffected by the particular scaling for
administered at the end of the study, not at the begin- polynomial growth factors, only parameter interpreta-
ning, or at 14 years of age. tion changes—see also Newsom (2015, pp. 223–228).
There are also situations where it makes sense to Little (2013) reminded us there are infinite equivalent
define the intercept factor as corresponding to a point growth models because there are an infinite number of
in time outside the range of actual measurement occa- sets of coefficients for scaling the growth factors, but
sions. For example, if data are collected at ages 6, 12, the choice in a particular study should reflect substan-
18, and 24 months, then specifying the coefficients for tive considerations.
the linear factor as
Pt4Kline5E.indd 392 3/22/2023 2:06:08 PM

22
Measurement Invariance
Measurement invariance (MI) involves determination of whether scores from indicators of theoretical vari-
ables have the same meaning or interpretation—that is, whether concepts are comparably measured—under
different conditions (Meade & Lautenschlager, 2004). Another description is that probabilities of observed
scores after controlling for factors presumed to underlie those scores do not vary over conditions (Mel-
lenbergh, 1989). A third description is that MI exists when relations between the observed variables and
proxies for theoretical variables are the same over conditions. The absence of this property means that real
differences between persons cannot be unambiguously isolated from differences owing to conditions under
which their scores were obtained.
The major conditions of interest in MI studies are listed studies. For example, if the same indicators measure
next: different things for men and women when gender dif-
ferences should be immaterial, then construct bias is
1. Time of measurement, which involves longitudi- indicated (Reynolds & Suzuki, 2013).
nal measurement invariance, or whether repeatedly Note that MI does not imply that groups are equal in
measured variables represent the same concept in the the amount (level) of a target concept; it means only that
same metric over time. The absence of longitudinal persons who are equal on the concept should have the
invariance means there is little, if any, meaningful same observed scores regardless of group membership.
basis to compare the same group over different times Different groups may truly differ in their average stand-
(Liu et al., 2017). ing on a theoretical variable, which Drasgow and Hulin
(1990) referred to as impact. Measurement invariance
2. Group membership or other nominal variables,
and impact are different phenomena, and one does not
such as ethnicity, race, gender, or geographic region,
imply the presence or absence of the other; that is, there
which concerns whether concepts are measured the
can be any combination of impact and MI. So you can
same way over different populations. Measurement
think of MI in this context as the potential to estimate
invariance over groups is related to the concept of test
true group differences on underlying variables of inter-
bias, which refers to overestimation or underestimation
est without the confounding effects of measurement
of the target domain owing to systematic-but-concept-
artifacts (i.e., concept-irrelevant differences).
irrelevant sources of variation as a function of group
membership. That is, people who are equal on the tar- 3. Cross-national or cross-cultural comparisons,
get construct obtain unequal scores due to their gen- where groups are, respectively, defined more narrowly
ders, ethnicities, cultures, and so on, which should be by membership in a nation–state (i.e., country) versus
irrelevant, given the concept’s definition. Construct more broadly by shared values, norms, or expectations
bias as a form of test bias is of particular interest in MI that can transcend national boundaries, also require that
393
Pt4Kline5E.indd 393 3/22/2023 2:06:09 PM

concepts are measured the same way. This requirement in this context, but they are not yet widely used—see
explains the increasing number of studies published Molenaar et al. (2010) for more information.
over the last 20 years or so about the cross-national Lack of equivalence over conditions for a set of mea-
or cross-cultural comparability of scales presumed to sures is called differential functioning. If the indica-
measure human values, generalized trust, or attitudes tors are items instead of scales (i.e., total scores over
toward democracy, among other concepts (Cieciuch et sets of items), lack of equivalence is called differential
al., 2019; Malhotra et al., 2018). item functioning. The focus of this chapter is on the
evaluation of MI over groups from different popula-
4. Translations of a test to a different language,
tions tested on a single occasion with the technique
which can also be an aspect of cross-national or cross-
of multiple-group CFA. Invariance models tested in
cultural research, if two countries or cultures do not
multiple-group CFA have both mean and covariance
share a common language. Ideally, the original and
structures and, thus, are referred to as MACS (mean
translated versions of a test should measure the same
and covariance structure) models. The basic logic of
things. A challenge is that a single translation of a test
analyzing a MACS model over groups, such as women
is unlikely to have the same measurement properties
and men, generalizes to invariance testing over test
as the original version. One reason is that there can
administration modes, nations, cultures, languages, or
be wide differences among translations from the same
other comparisons over independent samples. But there
source material over different translators. Another is
are special considerations when testing MACS mod-
that translation focused solely on preserving linguis-
els for longitudinal MI. For example, if a single group
tic meaning over original and translated versions may
is measured at two or more occasions, the analysis is
overlook cultural differences in the expressions of a
not a multiple-group CFA because there is just a single
particular concept, including the relevance of a concept
group. Instead, parameter estimates are compared over
in a particular culture (van Widenfelt et al., 2005).
time for the same group—see Little (2013, chap. 5)
5. Test administration methods, which concern and Newsom (2015, chap. 2) for more information and
whether a measurement instrument functions the same examples.
way over different modes of giving the test, such as You should know that the technique of exploratory
over the Internet with computers versus in person with factor analysis (EFA) can also be used to address ques-
paper-and-pencil administration. There is evidence that tions of MI. Unlike in multiple-group CFA, where the
test interpretation can be affected by administration same model is simultaneously fitted to data from ≥ 2
mode, which is a kind of method effect (Chapter 14; see groups, the approach in EFA is to use a single-group
also Whitaker & McKinney, 2007). Increasing num- approach. This means that data from the same indica-
bers of researchers use online respondent pools, such tors are analyzed separately in each sample. Next, sta-
as Clickworker and Prolific, among others, to recruit tistical indices of factor solution similarity over groups
participants, a trend only accelerated by the COVID-19 are computed. One is the Tuckercoefficient of congru-
pandemic. In addition to concerns about the honesty of ence, which measures the amount of shared variation
online survey respondents (Teitcher et al., 2015), estab- between factors defined by the same indicators in dif-
lishing MI over administration methods is crucial. ferent groups. Values that exceed .90 or so indicate sim-
ilarity between the corresponding factors over groups
The conditions just listed can be combined in a single (Nimon & Reio, 2011). Another similarity measure
study, such as in investigations about whether repeated is the Pearson correlation of the factor loadings over
measures assess the same thing over time for both men ≥ 2 groups. There are no formal tests of MI when EFA
and women (Tan et al., 2021). Although MI is usually is applied in separate samples, but EFA is more flex-
described across nominal (unordered) groups, such as ible than CFA when the theory about measurement is
men and women or measurement occasions, in theory not well developed (Chapter 14). Results of computer
it is also possible to evaluate invariance across ordinal simulations by Finch and French (2008) suggested that
or continuous variables, such as age, that do not lend invariant loadings were generally detected by compar-
themselves to easy categorization (Putnick & Born- ing EFA results over samples from two different pop-
stein, 2016). There are special methods, such as mod- ulations. Henseler et al. (2016) described methods to
erated nonlinear factor analysis, that can be applied evaluate MI in composite SEM.
Pt4Kline5E.indd 394 3/22/2023 2:06:09 PM

Measurement Invariance 395
LEVELS OF INVARIANCE CFA model in each group, such as men and women.
In this model, (1) both the number of common factors
Measurement invariance is not a binary property and the correspondence between factors and indicators
of scores from a set of indicators. Instead, there are are identical over groups, but (2) all model parameters
degrees of MI that could be supported by the data are freely estimated in each group. That is, no cross-
ranging from none to successively higher levels. It is group equality constraints are imposed on any model
important to estimate the level of MI because interpre- parameter, including loadings, intercepts, factor vari-
tation of the scores should be adjusted accordingly. In ances or covariances, factor means, and indicator error
this discussion, we assume continuous indicators, but variances or covariances.
special considerations for invariance testing with cat- If the configural invariance hypothesis is rejected,
egorical indicators, such as items with Likert response then MI does not hold at any basic level.2 Another inter-
scales, are addressed later in this chapter. pretation is that the common factor model specified by
There are four essential levels of MI with additional the researcher is untenable in at least one of the groups
or optional levels that can also be evaluated, given (Millsap, 2011). Retaining the configural invariance
the researcher’s hypotheses. Unfortunately, differ- model says only that the correspondence between indi-
ent authors do not always use the same labels to refer cators and factors—that is, the organization of concepts
to the same level or type of MI. To avoid confusion, as defined by the pattern of free versus fixed loadings—
next I emphasize the simplest versions of these labels, is the same over groups (Putnick & Bornstein, 2016).
but alternative names are given, too. The basic levels Because loadings, intercepts, and factor variances and
include (1) configural invariance, (2) weak invariance, means are all free parameters, however, there is little
(3) strong invariance, and (4) strict invariance. These basis for directly comparing the groups at this stage.
types represent increasingly restrictive hypotheses Whether groups are similar on these parameters will be
about MI, and each successive hypothesis requires determined at later analysis steps, but only if the con-
more supporting evidence than the preceding hypoth- figural invariance model is retained.
esis. Because retention of the configural invariance model
Invariance testing also requires complete and gives the green light to testing even more restrictive
transparent reporting of the evidence used to justify hypotheses about MI, its fit to the data should be thor-
retention of a particular invariance hypothesis, if oughly described, including both global fit and local
any. Unfortunately, there are serious problems in the fit. As mentioned, though, many—and perhaps most—
MI literature with reporting of the results. Some of reports of MI analyses in the literature fail to provide
these deficiencies are common in the broader SEM this level of detail. In my experience, it is relatively rare
literature, including the failure to describe local fit to find any comment about the residuals in published
(i.e., the residuals) or deciding whether to retain or MI studies. Almost as infrequent is making the data
reject hypotheses based solely on fixed thresholds for available to readers either in summary form—such
approximate fit indexes while simultaneously ignoring as correlations, standard deviations, and means—for
the results of the model chi-square test (Chapter 10; continuous variables or a raw data file for categorical
see also Hayduk, 2016). variables. Thus, too many authors of MI studies fail
to reassure readers that (1) retained invariance models
Configural Invariance actually fit the data at the level of the residuals and (2)
the results can be reproduced by others when analyzing
Configural invariance is the least restrictive level; the same data.
thus, it is the minimum hypothesis that must be retained
to establish MI.1 It is tested by specifying the same 2 Dimensional invariance requires only that indicators depend
on the same number of factors across groups, but does not require
1 Millsap (2011, pp. 102–103) described a prior step as the formal the same pattern of factor–indicator correspondence. The latter
comparison of the covariance and mean matrices over groups. aspect of dimensional invariance is incompatible with the con-
This test is not often seen in practice, and rejection of the equal- cept of configural invariance; specifically, dimensional invari-
ity null hypothesis is not informative about any particular source ance does not provide evidence that quantitative group compari-
of invariance. sons are tenable (Gregorich, 2006).
Pt4Kline5E.indd 395 3/22/2023 2:06:09 PM

Weak Invariance Gregorich (2006) offered two alternative explana-

tions for rejecting the weak invariance hypothesis. One
The hypothesis of weak invariance—also called
possibility is that the concepts—or at least the scores
metric invariance or pattern invariance—assumes
from a subset of indicators that correspond to those fac-
configural invariance. It also requires equality of the
tors—have different meanings over groups. For exam-
unstandardized factor loadings. This hypothesis is ple, Ryder et al. (2008) reported that the experience
tested by (1) imposing an equality constraint over of depression is not constant over Chinese and North
groups on the unstandardized coefficient of each indi- American samples. Specifically, physical symptoms
cator.3 In contrast, the intercepts and error variances (e.g., sleep disturbance) are emphasized relatively more
of the indicators are free parameters (i.e., they are by Chinese respondents while psychological symp-
allowed to vary over groups). So specified, the weak toms (e.g., demoralization) are reported more often by
invariance model is nested under the configural invari- North American respondents. Cultural factors, such
ance model. Next, (2) compare the relative fit of the as differences in perceived stigma for disclosing psy-
configural invariance and weak invariance models. If chological distress, demand characteristics of patient
the fit of the weak invariance model is not apprecia- roles in medical settings, or capabilities to identify and
bly worse than that of the configural invariance model, describe one’s emotions might explain some of Ryder
the more restrictive weak invariance hypothesis might et al.’s (2008) findings. If so, then depression is not a
be retained, if local fit is also adequate. This outcome universal phenomenon that exists apart from the influ-
means that the unstandardized slopes from regressing ence of culture.
the indicators on their respective factors are similar A second possibility for lack of metric invariance in
across groups. But if the intercepts are quite different self-report measures is response styles, or systematic
over groups, neither factor scores nor factor means will individual differences in responding to questionnaire
provide a meaningful basis for directly comparing the items that have little to do with the item’s content and,
groups, given just equality in loadings. by extension, the target concept. The two response
Little et al. (2007) noted that unless variances for styles described next can affect response variability, and
each factor are equal over groups, equally constrained thus the statistical association between indicators and
loadings are only proportionally, not absolutely, equiv- their factors over groups that differ in these styles (Gre-
alent; that is, the loadings are weighted by group dif- gorich, 2006). One is extreme response style (ERS),
ferences in factor variances. Whether factor variances where the most extreme options (e.g., never, always) are
or covariances are equal over groups can be tested at a favored. The ERS style may be found in populations or
later—and optional—step in invariance testing, and no cultures where decisiveness or firmness is encouraged.
cross-group equality constraints on these parameters Another is midpoint response style (MRS), which
are required for weak invariance. Gregorich (2006) is the tendency to avoid endorsing the most extreme
noted that retaining the weak invariance hypothesis response categories in favor of middling options (e.g.,
justifies the formal comparison (i.e., with a significance sometimes). It may be found in populations that empha-
test) of estimated factor variances or covariances over size modesty or humility instead of boldness.
groups. This is because common variance from each Wetzel et al. (2016) described evidence that ERS,
indicator is allocated to the corresponding factor in the MRS, and acquiescence response style—defined in the
same way over groups. Any group differences in error next section—are substantially enduring characteris-
variances for the indicators cannot confound group dif- tics over an 8-year period within a longitudinal sample
ferences in common factor variation. But because the tested every 2 years since high school. In computer
indicators are affected both by factors and sources of simulations, Liu et al. (2017) found that ERS and MRS
unique (residual) variation, weak invariance by itself can have appreciably negative effects on both model fit
does not support the formal comparison over groups of and parameter estimation when using multiple-group
observed variances or covariances. CFA to evaluate MI. Specifically, lack of evidence
for MI can be caused by group differences in these
3 This statement assumes that each indicator depends on a single response styles, not true differences in measurement
factor. Indicators with loadings on multiple factors are allowed, models for target concepts. Liu et al. (2017) concluded
but only if the exact same patterns are part of the configural that response styles are a real threat in MI studies and
invariance model for all groups. offered recommendations for detecting their effects.
Pt4Kline5E.indd 396 3/22/2023 2:06:09 PM

Strong Invariance changes are not due to true differences in target con-
cepts. Possible sources of differential additive response
Strong invariance, also called scalar invariance,
bias include cultural values about agreement versus dis-
assumes weak invariance. It also requires equal unstan-
agreement per se or in acknowledging health or adjust-
dardized intercepts for each indicator over groups. The
ment problems.
intercept estimates the score on an indicator, given a
Cohort effects in longitudinal studies or procedural
true score of zero on the corresponding factor. Equality
differences in data collection are other possible sources
of intercepts says that different groups use the response
of differential additive response bias. An example is
scale of that indicator in the same way; that is, a per- when patients are weighed in their street clothes in
son from one group and a person from a different group one clinic but wearing examination gowns in a dif-
with the same level on the factor should obtain the same ferent clinic (Gregorich, 2006). In this case, the con-
score on the indicator. The strong invariance hypothesis stant added to true body weight depends on where the
is supported if the fit of the model with equality-con- patients were tested. Thus, observed mean differences
strained unstandardized loadings and intercepts is not between clinics do not accurately reflect true differ-
appreciably worse than that of the model with equality- ences in patient weight, and this bias is due to proce-
constrained loadings only (i.e., weak invariance). dural differences that result in differential additive bias.
Strong invariance guarantees that (1) group differ-
ences in estimated factor means will be unbiased. Also,
(2) group differences in indicator means or estimated Strict Invariance
factor scores will be directly related to the factor means The highest—and most demanding—level of MI is
and will not be confounded by a systematic-but-con- strict invariance, also called residual invariance,
cept-irrelevant source of individual differences. That is, error variance homogeneity, or invariant unique-
the factors have a common meaning over groups, and ness. It assumes strong invariance, or equality in load-
(2) any constant effects on the indicators are canceled ings and intercepts, plus equality in error variances and
out when observed means are compared over groups covariances (if any) for the indicators over groups. The
(Gregorich, 2006). Thus, strong invariance is the mini- strict invariance hypothesis implies that the indicators
mal level for meaningful interpretation of group mean measure the same factors in each group with the same
contrasts. Some significance tests for mean contrasts, degree of precision. It is tested by (1) imposing cross-
such as the standard t test, assume equal population group equality constraints on the error variances and
variances. This assumption can be directly tested in covariances for all indicators, and (2) comparing the
CFA by imposing equality constraints on factor vari- relative fit of the model so constrained to that of the
ances after retaining a strong invariance model. If strong invariance model, where residual variances and
the model with equality-constrained factor variances covariances are freely estimated in all groups.
just described is retained, there is evidence that the Deshon (2004), Wu et al. (2007), and others have
homoscedasticity assumption is tenable; otherwise, argued that strict invariance is required in order to
rejection of that assumption would rule out the use of claim that the theoretical variables are measured iden-
the standard t test, but there are other versions of this tically across groups. This is because unmodeled sys-
test that do not assume equal variances. tematic effects on observed scores can be confounded
Rejection of strong invariance for self-report mea- with differences in loadings or intercepts, which can
sures could suggest the presence of a differential result in specification error. Little (2013) pointed out
additive response bias, which systematically raises that because unique indicator variance reflects both
or lowers observed scores in a particular group apart random error and specific variance that is systematic,
from respondents’ true levels on the factor. Two exam- the hypothesis of strict invariance means actually that
ples are an acquiescence response style (ARS), or the the sum of these two components is exactly equal for
disposition to agree with all items, and a disacquies- each indicator over groups. But if these two sources
cence response style (DRS), or the tendency to dis- of unique indicator variance are only approximately
agree with all items, in both cases regardless of item equal, then imposing equality constraints on the residu-
content. Unlike ERS or MRS patterns of responding, als can introduce bias. Specifically, unless specific indi-
which affect variances, the ARS and DRS patterns cator variance is precisely equal across the groups, the
shift observed group means higher or lower, but these equality constraints will propagate random error across
Pt4Kline5E.indd 397 3/22/2023 2:06:09 PM

all model parameters. Although it may be reasonable to Free versus Constrained Baselines
assume that specific variation is invariant over groups, (Trimming vs. Building)
the expectation that the random error component would
The hierarchy of MI hypotheses corresponds to a model
also be invariant may be less justifiable. Thus, forcing
trimming strategy in which an initial unconstrained
these two sources of unique variance to be equal may
model (configural invariance) is gradually restricted by
be implausible in perhaps most cases.
adding cross-group equality constraints in a sequence
Recall that only strong invariance is required to that corresponds to weak invariance, strong invariance,
meaningfully compare factor means over groups, so and then strict invariance (if tested). Stark et al. (2006)
holding out for the even more demanding hypoth- referred to this strategy as the free baseline approach.
esis of strict invariance brings little to the table Failure to retain the invariance hypothesis at a particu-
regarding the comparability of the groups on prox- lar step means that even more restricted models are not
ies for theoretical concepts. An exception is when the considered.
researcher wishes to formally test group differences It is also possible to test for MI through model build-
in observed variances, such as for indicators or com- ing where constraints on an initially restricted model,
posites based on the indicators. In this case, evidence such as one represented by the strict invariance hypoth-
for strict invariance should be obtained (Gregorich, esis (equal loadings, intercepts, and errors), are gradu-
2006). Also, I suspect that strict invariance would be ally released (e.g., test strong invariance by allowing
observed in relatively few applied studies. For these error variances to be freely estimated in each group).
reasons, not all authors of MI studies report on tests This method is the constrained baseline approach
of strict invariance. (Stark et al., 2006). The problem with this method is
that it may not be clear which particular set of cross-
group equality constraints—those for loadings, inter-
ANALYSIS DECISIONS cepts, or error variances—should be released, if the
Measurement invariance analyses are complex with fully constrained model is rejected. If theory is not spe-
multiple steps and decision points about model iden- cific, the choice may be arbitrary. Ideally, model trim-
tification (e.g., how to scale the common factors), the ming versus model building for the same data would
sequence of model testing (e.g., trimming vs. building), each select the same model, but this is not guaranteed.
statistical criteria for deciding whether to retain or reject
models (e.g., the role of test statistics, approximate fit Statistical Decision Criteria
indexes, and residuals), and whether to include covari-
ates in the model (Hayduk, 2016). In an ideal world, Researchers in early MI studies in SEM tended to rely
decisions in the areas just mentioned would have little, mainly on the chi-square difference test to determine
if any, bearing on the results—that is, the truth will whether to retain or reject the more constrained between
win out regardless of analysis strategy—but that hope two nested invariance models (Byrne et al., 1989). Spe-
is pretty dim in the real world. This is because invari- cifically, a significant difference would signal rejection
ance testing often involves the estimation of a series of the more restricted model in favor of the less con-
of nested models, and decisions made about models strained model, assuming the fit of the less constrained
tested earlier can affect results found for models tested model to the data is satisfactory. Later developments
later. Sometimes these decisions can have unintended in testing practices for MI featured the shift to using
consequences. For example, what seems like a small approximate fit indexes, such as RMSEA and CFI, to
respecification in an earlier model can greatly affect the supplement or even replace the chi-square test in model
choice of the final model at the end of the analysis. This selection, just as in the wider SEM literature (Chapter
is the reason why Millsap and Olivera-Aguilar (2012) 10). This includes the development of fixed thresholds
noted that the effective use of computer tools for invari- for changes in values of approximate fit indexes that
ance testing relies heavily on the experience and judg- supposedly demarcate the limits between trivial ver-
ment of the researcher. Preregistering an analysis plan sus more substantial differences in fit between nested
based on substantive considerations would address the invariance models. But it is doubtful whether such fixed
concern that decisions were aimed mainly at finding a thresholds actually work, and overreliance on them can
model that fits the data, or model hacking. lead to less than optimal decisions in the analysis.
Pt4Kline5E.indd 398 3/22/2023 2:06:09 PM

Cheung and Rensvold (2002) found in computer sim- and DRMSEA ≤ .030, worked better for tests of met-
ulation studies assuming known population invariance ric invariance. Sass et al. (2014) cautioned that thresh-
models with continuous indicators that values of the olds based on changes in approximate fit indexes for
CFI were relatively unaffected by model characteristics continuous data do not perform well with misspecified
such as the number of indicators per factor. They sug- models for ordinal data. Remember that Kenny et al.
gested that changes in the CFI values less than or equal (2015) cautioned that the RMSEA is not very accurate
to .01, or DCFI ≤ .01, indicate that the more restricted for models with relatively few degrees of freedom,
invariance hypothesis should not be rejected. Meade et especially when the sample size is not large.
al. (2008) recommended a different threshold for DCFI More recently, Yuan and Chan (2016) described the
in generated data with varying levels of MI including application of equivalence testing, which allows the
different factor structures (no invariance), different researcher to specify the maximum degree of mis-
loadings, or different intercepts across two groups. In specification allowable before going on to test even
very large samples, such as 6,000 cases per group, the more restricted models (Appendix 10.A), to MI test-
chi-square difference test indicated lack of invariance ing. The degree of misspecification can be expressed
most of the time when there were slight group differ- either in the form of a model chi-square statistic or the
ences in model parameters. In contrast, the CFI was RMSEA. In computer simulations, they compared con-
generally less affected by group size and by the number ventional fixed thresholds for the RMSEA, such as .01,
of indicators per factor. Meade et al. (2008) suggested .05, .08, and .10 for, respectively, excellent, close, fair,
that changes in CFI values less than or equal to .002, or and poor fit, to dynamic thresholds for the RMSEA for
DCFI ≤ .002, may indicate deviations from MI that are specifying maximum degrees of misspecifications in
functionally trivial when groups sizes are very large. equivalence testing. The dynamic thresholds outper-
Other approximate fit indexes, such as the RMSEA formed the fixed thresholds, but whether these dynamic
and SRMR, have also been studied in computer simula- thresholds (Yuan & Chan, 2016, p. 420) generalize
tions of MI analyses, but no single gold standard thresh- beyond computer simulation conditions (e.g., normal
old that works across varying sample sizes, model distributions) is unknown.
configurations, or indicator level of measurement (i.e., Given the results just summarized and others
continuous or categorical) has been described for any reviewed by Putnick and Bornstein (2016), I believe
approximate fit index, including the CFI. Chen (2007) there is little hope of ever finding a set of decision rules
recommended conditional thresholds for DCFI and based on changes in approximate fit indexes that would
DRMSEA based on computer simulations where group universally apply over all types of MI studies. Also, it
sizes were equal versus unequal or the pattern of invari- is poor practice in any type of SEM analysis to ignore
ance was uniform versus mixed. A uniform invariance both the residuals and results of the chi-square test,
pattern affects equally the loadings, intercepts, and especially if group sizes are not large. In these ways,
error variances, but a mixed pattern affects some, but I concur with Hayduk (2016) that there are entrenched
not all, of the types of parameters just mentioned. For and widespread deficiencies in reporting results from
example, in cases where the group size is both small MI analyses. These problems also suggest that the
(n < 300) and unequal and the invariance pattern is MI research literature could have many examples of
uniform, Chen (2007) recommended the cutting points retained invariance models that do not actually fit data
DCFI ≤ .005 and DRMSEA ≤ .010 for retaining the null when evaluated from the perspectives of both global fit
hypothesis of invariance. But Chen (2007) suggested and local fit.
more stringent criteria were needed when the group
size is larger (n > 300) and equal and the invariance
Scaling and Parameterization
pattern is mixed instead of uniform: DCFI ≤ .010 and
DRMSEA ≤ .015. Common factors in CFA models with mean structures
Based on computer simulations with analyses of (i.e., MACS models) require scales for both their vari-
data from larger numbers of groups, such as 10–20, ances and means; otherwise, the model is not identified.
Rutkowski and Svetina (2014) recommend that the The same basic options for scaling factors in single-
thresholds DCFI ≤ .010 and DRMSEA ≤ .010 were group CFA analyses—the reference (marker) variable
generally adequate when testing for scalar invariance, method, variance standardization method, and effects
but a different set of cutting points, or DCFI ≤ .020 coding method (Chapter 13)—are available in multiple-
Pt4Kline5E.indd 399 3/22/2023 2:06:09 PM

group CFA analyses. Selection of one method versus groups, the reference group. In the analysis of a strong
another makes no difference in global model fit, but invariance model, factor variances and means for all
there are special considerations in multiple-group CFA subsequent groups are estimated and scaled relative to
that are outlined next (Little et al., 2006, 2007): the fixed mean (0) and variance (1.0) in the reference
group. (In longitudinal studies where a single group
1. In the reference variable method, the unstan- is followed over time, the reference point would be
dardized loading of one indicator per factor is fixed to the first measurement occasion.) For example, factor
1.0, and its intercept is fixed to 0. The same indicator means in the subsequent groups are contrasts, or dif-
should be selected as the reference variable in each ferences from the factor means in the reference group.
group. This method scales each factor in the metric of Each contrast is the weighted average of the differences
the explained (common) variance of the corresponding between the indicators in the reference group and each
reference variable and identifies the mean structure. subsequent group (Little et al., 2007). Factor variances
Loadings and intercepts for all other indicators are in subsequent groups are each estimated relative to the
free parameters as are the variances, covariances, and fixed variances of 1.0 in the reference group.
means of the factors in all groups. An advantage of this A complication of this method is that because the
method is that there is no need to explicitly free the fac- factors are standardized in the reference group (i.e.,
tor means or variances in any group later in the analy- their variances are 1.0), factor associations for this
sis when cross-group equality constraints are imposed. group in the unstandardized solution are scaled in
Another advantage is that iterative estimation generally a correlation metric. But factor associations for all
converges with no need to specify starting values for subsequent groups are scaled in a covariance metric,
model parameters. which means that direct comparisons of cross-group
A drawback of the reference variable method is that differences in factor associations are not possible in
estimates of factor variances, covariances, and means the unstandardized solution. If the standardized solu-
depend on the specific indicators selected as the refer- tion in a multiple-sample CFA is derived by stan-
ence variables. That choice is sometimes arbitrary, such dardizing each separate within-group covariance
as when scores on multiple indicators for the same fac- matrix, then estimated factor correlations are also not
tor are equally reliable or no indicator is deemed espe- directly comparable over groups (Chapter 12). Little
cially representative of the corresponding theoretical et al. (2006, 2007) described a method using phantom
variable (Chapter 14). variables with no indicators to obtain common-metric
A bigger concern is that the method requires an standardized estimates of factor correlations that are
invariance assumption: Because the loading and inter- more directly comparable. Another advantage of the
cept of the reference variable are fixed to 1.0 and 0, phantom variable method is that differences in factor
respectively, it must be assumed that these coefficients associations over groups can be tested with the stan-
are invariant over groups. This is because fixed coef- dard chi-square difference test.
ficients are excluded from tests of MI, so it must be
3. The effects coding method is for situations where
known a priori that the unstandardized regression of
all indicators of the same factor are measured on the
the reference variable on its factor is identical over all
same scale (Chapter 13). The method does not require
groups. If the researcher inadvertently selects a refer-
(a possibly arbitrary) selection of a single indicator to
ence variable that is not invariant over groups, the
scale its factor. Instead, all indicators of a factor con-
results may be distorted. There are empirical methods
tribute to its scale, and none of the groups is consid-
for selecting invariant reference variables described in
ered as a reference group. For each factor, the average
the next section, but they are data-driven—and thus
loading and intercept are fixed to equal, respectively,
subject to capitalization on chance—and generally
1.0 and 0 in every group. So specified, factor variances
require large, representative samples; see Thompson et
are estimated as the weighted average of the common
al. (2021) for more information.
variances of their indicators, and factor means are esti-
2. The variance standardization method—also mated as the weighted average of means on their indi-
called the reference group method in multiple-group cators. Factor covariances and means are freely esti-
CFA—involves fixing the variances and means for mated in all groups. Just as in the reference variable
all factors to, respectively, 1.0 and 0 in just one of the method, no other parameters need to be freed when
Pt4Kline5E.indd 400 3/22/2023 2:06:09 PM

imposing equality constraints on loadings or intercepts invariance holds. In this case, testing for intercept
in the effects coding method. equality (strong invariance) could still be performed
A drawback is that the effects coding method because noninvariant loadings are freely estimated in
requires the imposition of two linear constraints per each group, which controls for these differences.
factor (loading, intercepts) in each group. This added It is challenging that there are no clear-cut guidelines
complexity can lead to failure of iterative estimation for determining the degree of partial invariance in all
or improper solutions when analyzing larger models. situations that would be acceptable for concluding that
Increasing the default number of iterations or relaxing the indicators measure roughly the same things over
convergence criteria might solve the problem. Another groups. Suppose that a single loading out of a total of
option is to first analyze the model using the reference 20 is not invariant. There may be little harm in test-
variable method, which should converge if there are no ing for invariance of other kinds of parameters, such as
syntax errors, and then use the parameter estimates as intercepts, given that 19/20, or 95%, of the loading are
starting values for a second analysis with effects coding equal across groups. But as more and more loadings
method—see Little et al. (2007) for additional tips. are found to be unequal over groups, such as > 10 (i.e.,
Estimates of factor variances, covariances, and means the majority out of 20), there should be less confidence
differ over the three scaling methods just described, and that indicators estimate the theoretical variables in the
within the reference variable method these results will same ways over groups (Vandenberg & Lance, 2000;
further vary depending on which indicator is specified Steenkamp & Baumgartner, 1998).
as the marker variable for its factor (e.g., Little et al., In a Monte Carlo study, Steinmetz (2013) found
2006, p. 65). But all three methods generate identical negligible effects if a minority of one or two indica-
(within rounding error) estimates of effect size defined tors out of, say, six in total with unequal loadings on
as standardized mean differences, which are mean con- the accuracy of group differences in average factor
trasts expressed as the proportion of a standard devia- scores (composites) as estimators of true differences
tion in the metric used to estimate the group means. in factor means. But inequality of even one intercept
Expressed as the parameter d (lowercase Greek letter can have a substantial impact on bias; specifically, an
delta) for a population standardized mean difference, unequal intercept can lead to spurious composite dif-
the effect size in multiple-group CFA is ferences between groups with equal true factor means.
The same inequality can also result in attenuated dif-
m1 − m 2
d= (22.1) ferences on composites for groups with equal true fac-
(s12 − s 22 ) / 2 tor means. These results suggest that full invariance of
the intercepts may be required for correct interpreta-
where m and s2 represent, respectively, the factor means
tion of group differences on observed variables; oth-
and variances in populations 1 and 2. A sample estima-
erwise, group mean differences on the indicators may
tor is
be confounded with differences in intercepts and factor
mˆ 1 − mˆ 2 means.
d= (22.2)
(sˆ 12 − sˆ 22 ) / 2 Putnick and Bornstein (2016) reviewed a total of 126
articles in psychology journals published in 2013–2014
where factor means and variances are estimated from in which tests for at least one MI model were reported.
sample data. Equation 22.2 assumes equality of vari- Support for configural and full metric (weak) invari-
ances over populations and equal group sizes in sam- ance were reported in about 60% of these analyses,
ples. but partial invariance was indicated in just under one-
third of all tests, or 32%, so it was not a rare finding.
Larger sample size predicted a higher level of strict
PARTIAL MEASUREMENT INVARIANCE (error variance) invariance, but level of invariance (i.e.,
full vs. partial) was generally unrelated to number of
Byrne et al. (1989) described partial measurement groups and model size. Putnick and Bornstein (2016)
invariance as an intermediate state of invariance. For also noted haphazard reporting practices in reviewed
example, weak invariance assumes cross-group equal- studies, including the failure to report model degrees of
ity of each unstandardized factor loading. If some, but freedom, group sizes, or the specific statistical criteria
not all, loadings are invariant, then only partial weak used to evaluate model fit. These kinds of shortcom-
Pt4Kline5E.indd 401 3/22/2023 2:06:09 PM

ings in reporting are not unique to MI studies in SEM 3. Jung and Yoon (2016) described the forward CI
(Chapter 3). method, which tests models with a single invariant
The best scenario for approaching the rejection parameter against the baseline configural invariance
of a full invariance model is when the theory or the model (i.e., tests forward) and implements the chi-square
researcher’s knowledge of indicator psychometrics difference test with confidence intervals for differences
offers guidance about which cross-group equality between estimated loadings over groups. In computer
constraints to release, if the analysis continues to a simulations, the method performed well against the
respecification phase. But without such guidance, an backward MI method and the factor-ratio test.
empirical specification search where the researcher
looks to the data for hints about how to improve model Use the methods just described with extreme cau-
fit is about the only alternative. You already know that tion: With enough testing and releasing of constrained
caution is needed at the point where SEM analyses parameters, any model can be made to fit the data, but
become more purely exploratory, so please consider the this outcome per se has little or no scientific merit. Also
options described next while keeping in mind the spec- closely inspect the residuals at each step.
tre of capitalization on chance and the need to replicate A detailed example of testing for MI is presented
data-driven analyses in a new sample. We assume that a next. It features the simultaneous analysis of a two-
weak invariance model with equal loadings is rejected factor model with continuous indicators over groups
but the configural invariance model is retained in a free from two different populations. Special issues in the
baseline approach: analysis, such as scaling and model fit assessment, are
emphasized. A limitation is that it is not possible in
1. In the backward MI method, start with the these secondary analyses to specify a priori hypotheses
rejected weak invariance model and sequentially about which particular parameters may be invariant
release cross-group equality constraints on loadings or not invariant over the populations in these analyses.
based on the highest corresponding modification index Thus, there is an unavoidable exploratory bent to some
(MI) until a partial metric invariance model (if any) is of the analyses described next, as models are respeci-
found that fits the data. If no such model is retained, fied to better fit the data. In actual primary analyses,
then sequentially remove indicators from the weak the better practice is to respecify the model according
invariance model in the same order and retest the con- to the theoretical expectations or the results of prior
figural and full metric invariance models (Putnick & empirical studies.
Bornstein, 2016). A drawback of this approach is that Listed in Table 22.1 are the syntax and output files
it can greatly inflate the rate of Type I error (Jung & for all analyses in lavaan using the default ML esti-
Yoon, 2016). Also examine the expected parameter mator. All analyses converged to admissible solutions.
change (EPC) for each equality constraint: If the esti- You can download these files from the book’s website,
mated change in the coefficients in both groups is triv- and it would be helpful to carefully study the annotated
ial after releasing an equality constraint, there may be syntax files for both analyses. Note that decisions about
little harm in retaining that constraint, even if the cor- model retention in these examples are based mainly
responding MI is significant. on inspecting the residuals, or local model fit, in all
groups, not just on values of global fit statistics.
2. A more formal option is the factor-ratio test
(Cheung & Rensvold, 1999), where each indicator
serves as the reference variable (its loading is 1.0 for DETAILED EXAMPLE
all groups) as one of the remaining indicators is tested FOR A TWO‑FACTOR MODEL
for invariance (i.e., its loading is constrained to equality OF DIVERGENT THINKING
over groups) with the chi-square difference test. Given
p indicators for a factor, there are p(p – 1) tests for Guo et al. (2021) administered within Chinese (n = 316)
every combination of a reference variable and another and American (n = 302) samples a total of five diver-
indicator. Indicators flagged as invariant are dropped gent thinking tasks. Three indicators are line-meaning
from the model. A drawback is that the test is laborious tests, where participants were asked what different
to use unless automated by the computer (Hammack- geometric figures could symbolize, and two are real-
Brown et al., 2022). world problem-solving tasks, where participants were
Pt4Kline5E.indd 402 3/22/2023 2:06:09 PM

TABLE 22.1. Script and Output Files for Analyses of Confirmatory

Factor Analysis Measurement Models Over Multiple Groups
Analysis Script file
1. Two-factor model of divergent thinking analyzed in guo-single-group.r
separate Chinese and American samples
2. Two-factor model of divergent thinking analyzed guo-mi-models.r

simultaneously over Chinese and American samples
Note. The lavaan package was used for all analyses. Output files have the same names except
the extension is “.out.”
queried about possible solutions for social problems The data for this example are summarized in Table
in home or school settings. Higher scores on all tasks 22.2. On average, Chinese examinees outperformed
indicate higher rated originality. The group sizes in their American counterparts on the line-meaning tasks
this example—about 300 cases—are not at all large. In while the opposite pattern is true for the real-world
computer simulations, Meade and Bauer (2007) found problem tasks. The two-factor CFA model with a mean
that power when analyzing invariance models was gen- structure fitted to these data is presented in Figure 22.1
erally high when the group size is n = 400, but power using compact graphical symbolism with no scaling
estimates for a smaller group size of n = 200 were quite constants. Because indicators for each factor have the
variable. This is because power in invariance testing same metric, the effects coding method is used in this
is affected not just by group size but also by model example to scale the factor variances and means. Exer-
characteristics, such as the magnitude of factor covari- cise 1 asks you to verify in a single-sample analysis that
ances. Meade and Bauer (2007) found little support that (1) the mean structure in the figure is just-identified, so
a simple rule about a ratio of group size to the number the model will perfectly reproduce the sample means;
of indicators could ensure adequate power to detect the and (2) df M = 4 for the whole model, if analyzed with no
absence of MI when the group sizes are not large. equality constraints.
TABLE 22.2. Input Data (Correlations, Standard Deviations, Means) for Analyses
of a Two-Factor Model of Divergent Thinking Over Chinese and American Samples
Chinese
Variable 1 2 3 4 5 M SD
Line meaning
1. LM1 — .715 .631 .377 .452 1.24 1.532
2. LM2 .519 — .738 .408 .495 1.47 1.878
3. LM3 .543 .639 — .319 .391 1.88 2.012
Real-world problem
4. RW1 .280 .321 .317 — .445 .77 1.002
5. RW2 .242 .380 .300 .665 — .89 1.055
American
M .88 1.35 1.27 1.71 1.37
SD 1.094 1.350 1.494 2.257 1.541
Note. These data are from Guo et al. (2021) for originality scores. Chinese (above diagonal), n = 316; American (below
diagonal), n = 302.
Pt4Kline5E.indd 403 3/22/2023 2:06:09 PM

in the American sample. However, the largest absolute

correlation residuals are .037 and .049, respectively,
Line
1 in the Chinese and American samples, so local model
Real
Meaning World fit seems acceptable. Thus, the two-factor model is
retained in both single-group analyses. Exercise 2 asks
you to explain why factor means estimated in these
single-sample analyses are not yet directly comparable
over groups.
LM1 LM2 LM3 RW1 RW2 Reported in the lower part of Table 22.3 are the
results for invariance analyses (i.e., multiple-group
CFA) of the two-factor model. Note in the table that
(1) the model chi-square for the configural invariance
FIGURE 22.1. Two-factor model of divergent thinking with model with no cross-group equality constraints, or
a mean structure fitted to data from Chinese and American 11.522, is just the sum of the model chi-squares from
samples presented in compact graphical symbolism with no the two single-group analyses, or 4.377 + 7.175. Like-
scaling constants. wise, (2) the degrees of freedom for the configural
invariance model, or 8, is just the sum of the model
degrees of freedom from the single-sample analyses, or
4 + 4. You should verify that the within-group residuals
Reported in the top part of Table 22.3 are values of for the configural invariance model are identical to the
selected fit statistics for analyses of Figure 22.1 in each residuals for each respective sample in the single-group
of the separate samples, Chinese and American, with analyses of Figure 22.1 (see the output files for analy-
the default ML estimator. All solutions are described ses 1–2, Table 22.1). Because the configural invariance
next are admissible. In both within-group analyses, the model has the same local fit in the Chinese and Ameri-
model passes the chi-square test, and values of approx- can samples as in the single-group analyses, this first
imate fit indexes are not grossly problematic, except invariance model is retained.
for the upper bound of the 90% RMSEA confidence The weak invariance model with equal loadings
interval in the Chinese group, or .111. Inspection of the across the groups fails both the overall chi-square test
residuals for each sample indicates no obvious local fit (p = .008) and the chi-square difference test (p < .001) of
problem (see the output file for analysis 1 in Table 22.1). its fit relative to that of the configural invariance model—
For example, no standardized residuals for the covari- see Table 22.3. Values of differences in approximate fit
ance structure are significant at the .05 level in the Chi- indexes between the configural and weak invariance
nese sample, and one such residual is significant at .05 models are mixed: DCFI = .009, which, based on some
TABLE 22.3. Values of Selected Global Fit Statistics for a Two-Factor Model of Divergent
Thinking Analyzed over Chinese and American Samples
Model RMSEA
Model chiML dfM p comparison chiD dfD p [90% CI] CFI SRMR
Single-group analyses
1. Chinese 4.377 4 .357 — — — — .017 [0, .088] .999 .015
2. American 7.175 4 .127 — — — — .051 [0, .111] .994 .016
Invariance (multiple-group) models

3. Configural 11.552 8 .172 — — — — .038 [0, .082] .997 .015
4. Weak 25.516 11 .008 4 vs. 3 13.964 3 < .001 .065 [.032, .099] .988 .043
5. Strong 60.549 14 < .001 5 vs. 1 35.033 3 < .001 .104 [.078, .131] .961 .058
6. Partial strong 25.809 12 .011 6 vs. 4 .293 1 .588 .061 [.028, .094] .988 .043
Note. In the partial strong invariance model, intercepts for LM2 and RW1 are not constrained to equality over groups.
Pt4Kline5E.indd 404 3/22/2023 2:06:10 PM

interpretative guidelines, indicates that the null hypoth- Both equality constraints just mentioned are released in
esis of invariance should be retained, and DRMSEA = an alternative partial strong invariance model; that is,
.027, which exceeds the amount in some guidelines for only 3 out 5 intercepts are invariant in this model. This
retaining the weak invariance hypothesis. Results at the specification is data-driven, but my intention is to dem-
level of the within-group residuals are also mixed: On onstrate the comparison of the Chinese and American
one hand, a total of 11/15 standardized residuals in the samples on the two factor means. Now, my decision as
Chinese sample and 7/15 in the American sample are a researcher would be to stop at the weak invariance
significant at the .05 level—see the output file for analy- model for these data, but as textbook author I want to
sis 2 in Table 22.1. But, on the other hand, all absolute teach a few things about factor mean contrasts, which
correlation residuals in both groups are < .10, and the requires at minimum a partial strong invariance model.
largest absolute correlation residuals in the Chinese and Briefly, the global fit of the partial strong invariance
American samples are, respectively, .079 and .088. In model just described is unsurprisingly very similar to
their analyses for the same model and data, Guo et al. that of the weak invariance model—see Table 22.3—
(2021) retained the weak invariance model. I have no and the within-group residuals are also nearly the same
real objection to this decision, given the (admittedly over the two models, too. Reported in the top part of
mixed) evidence for local fit just described, so the weak Table 22.4 are equality-constrained parameter esti-
invariance is retained in this present analysis, too. mates for the covariance part of the model. As expected,
The strong invariance model with equal loadings (1) the unstandardized loading for each indicator is
and intercepts over groups fails both the chi-square identical over the Chinese and American samples, but
test and the chi-square difference test for its relative fit (2) the standardized estimates are different over the
to that of the weak invariance model with equal load- two groups (Chapter 12). The error variances, reported
ings only—see Table 22.3. The within-group residuals in the middle part of the table, are not constrained to
are problematic, too: In the Chinese sample, there are equality over groups. Exercise 3 asks you to analyze the
3 correlation residuals that equal or exceed .10, which partial strong invariance model but with cross-group
I believe is unfavorable in such a small model. When equality constraints imposed on the error variances. I
the intercepts are restrained to equality over groups, can tell you that the fit of this model is poor, so we do
the mean structure in the strong invariance model is no not consider further the hypothesis of equal error vari-
longer just-identified, so differences between observed ances over groups. Estimates of factor variances and
and predicted indicator means, or mean residuals, can covariances within each group are reported in the bot-
depart from zero. In both groups, all standardized mean tom part of Table 22.4. We will revisit the factor vari-
residuals are significant at the .01 level. The magnitudes ances later when we compare the groups on the factor
of the corresponding mean residuals are not strikingly means, but the association between the divergent think-
large—they are all < .10 standard deviations in absolute ing factors is positive in both groups. The factor corre-
value (see the output file for analysis 2, Table 22.1)— lations (.734, .451) are not directly comparable over the
but their pattern in both samples is disquieting. Given two samples, buy they give a rough sense of effect size
all these results, I would not retain the strong invari- for each group.
ance model, which mirrors the decision taken by Guo Listed in Table 22.5 for each group are the observed
et al. (2021) for the same model and data. and predicted means for all indicators plus mean resid-
Guo et al. (2021) analyzed two different partial uals and standardized mean residuals, none of which is
strong invariance models for their originality data: In significant. Thus, the partial strong invariance model
one model, all equality constraints on the intercepts for closely predicts the observed means in both the Chi-
all three line-meaning indicators were released, and in nese and American samples. Estimates for the mean
the second model both equality constraints on the inter- structure in Figure 22.1 are reported in Table 22.6.
cepts for the two real-world problem-solving indicators Estimates for equality-constrained intercepts for three
were both released (see Figure 22.1). Neither model had indicators are reported in the top part of the table, and
satisfactory fit (Guo et al., 2021, p. 7). For pedagogical results in the middle part of the table are for the two
reasons, I take a different strategy here: In the full strong indicators with freely estimated intercepts in each
invariance model, the indicators LM2 and RW1 have group. Although slopes for regressing all five indica-
the two largest MI values for their equality-constrained tors on their factors are identical over groups (Table
intercepts (see the output file for analysis 2, Table 21.1). 22.4), the Chinese and American have different relative
Pt4Kline5E.indd 405 3/22/2023 2:06:10 PM


for the Covariance Structure in a Two-Factor Model of Divergent
Chinese American
Parameter Unst. SE St. Unst. SE St.
Loadings
LM1 .767 .026 .773 .767 .026 .682
LM2 1.128 .029 .915 1.128 .029 .825
LM3 1.105 .030 .814 1.105 .030 .756
RW1 1.002 .050 .667 1.002 .050 .675
RW2 .998 .050 .666 .998 .050 .955
Error variances
LM1 .907 .088 .402 .668 .066 .536
LM2 .567 .111 .162 .588 .086 .319
LM3 1.429 .153 .338 .903 .103 .428
RW1 .586 .069 .555 2.624 .314 .545
RW2 .586 .069 .557 .209 .229 .087

Line Meaning 2.296 .208 1.000 .986 .100 1.000
Real World .469 .065 1.000 2.186 .229 1.000
Line Real .761 .090 .734 .662 .109 .451
Note. These results are for model 6 (Table 22.3). Unst., unstandardized; St., standardized. Stan-
dardized estimates for error variances are proportions of unexplained variance.
standings on two indicators, LM2 and RW1, when the 1.106. If we treat the within-group factor variances, or
score on the corresponding factor is zero. For exam- 2.296 for the Chinese sample and .986 for the American
ple, the Chinese sample has a lower relative standing sample (Table 22.4), as similar enough to pool (homo-
on both indicators just mentioned compared with the geneity is assumed), we can calculate a standardized
American sample. However, the intercepts are only one mean difference for Line Meaning (LM) as
component of the predicted means on these variables.
For instance, the Chinese sample has a higher predicted 1.523 − 1.106
=d LM = .33
mean on the LM2 indicator than the American sam- (2.296 + .986) / 2
ple—see Table 22.5. These patterns are also consistent
with effects coding in the Chinese sample, where some Thus, the mean of the Chinese sample on the line-
intercepts must be negative so that the total for all indi- meaning factor is about one-third of a standard devi-
cators of the same factor is zero (see the syntax file for ation higher than the mean of the American sample.
analysis 2, Table 22.1). The assumption of homoscedasticity can be directly
Estimates of the factor means are reported in the bot- tested by (1) constraining the factor variances to equal-
tom part of Table 22.6. For the line-meaning factor, the ity and (2) comparing the relative fit of the model just
estimate for the Chinese sample, or 1.523, is higher than described against that of the baseline partial strong
the mean for the same factor in the American sample, or invariance model for this example. For these data, the
Pt4Kline5E.indd 406 3/22/2023 2:06:10 PM

TABLE 22.5. Observed and Predicted Indicator Means in a Two-Factor Model

of Divergent Thinking Analyzed over Chinese and American Samples
Indicator Observed mean Predicted mean Mean residual Standardized mean residual
Chinese
LM1 1.240 1.256 –.016 –.709
LM2 1.470 1.470 0 0
LM3 1.860 1.842 .018 .790
RW1 .770 .770 0 0
RW2 .890 .890 0 0
American
LM1 .880 .867 .013 .866
LM2 1.350 1.350 0 0
LM3 1.270 1.282 –.012 –.866
RW1 1.710 1.710 0 0
RW2 1.370 1.370 0 0
Note. These results are for model 6 (Table 22.3).

for the Mean Structure in a Two-Factor Model of Divergent
Chinese American
Intercepts
LM1 .089 .047 .089 .047
LM3 .159 .055 .159 .055
RW2 .061 .052 .061 .052
Intercepts
LM2 –.248 .061 .204 .070
RW1 –.061 .052 .397 .136
Factor means
Line Meaning 1.523 .091 1.016 .073
Real World .830 .049 1.311 .097
Note. These results are for model 6 (Table 22.3).
Pt4Kline5E.indd 407 3/22/2023 2:06:10 PM

model with equal factor variances over the Chinese and of partial strong invariance for fluency, not originality,
American samples is very poor—Exercise 4 asks to you scores derived from the same divergent thinking tasks.
verify this statement—so the homogeneity assumption
is problematic.
An alternative method for estimating effect size PRACTICAL SUGGESTIONS
when within-group variances are heteroscedastic is to FOR MEASUREMENT
calculate two different d statistics, each based on the INVARIANCE TESTING
standard deviation in just one group (Kline, 2013a,
chap. 5), as follows: Recommendations for managing the complexity of MI
analyses in multiple-groups CFA are listed next:
1.523 − 1.106
=
d LM (Chinese) = .28
2.296 1. Consult theory or results from prior empirical stud-
ies when specifying MI hypotheses and planning
1.523 − 1.106 the order in which those hypotheses are to be tested
=
d LM (American) = .42
.986 (e.g., free vs. constrained baseline approach).
2. In written reports, explain the rationale for model
where the denominators are, respectively, the estimated
specification at each point in the analysis.
factor standard deviations in the Chinese and American
samples. Thus, the group contrast on the line meaning- 3. Triple-check program code because it is easy to
factor is approximately .30–.40 standard deviations make mistakes. Syntax files for invariance analy-
depending on the standardizer (i.e., the denominator), ses can be long and complex (e.g., Table 22.1; see
which varies by group. Choi et al. (2009) described also Svetina et al., 2020). Also, thoroughly annotate
additional two-group latent contrast effect size mea- syntax that specifies different invariance hypoth-
sures. Although less informative than an effect size, eses, scales factor variances and means, or con-
a standard t test could be calculated, given the factor trols the output. Annotated syntax is an invaluable
means, standard deviations, and group when assuming “researcher’s diary” of decisions made during the
homoscedasticity. An alternative is the Welch–James analysis, and there are many decisions that should
t test, which does not assume homogeneity of variance be archived in MI analyses.
and for which degrees of freedom are empirically esti- 4. Know the defaults in your computer tool for mul-
mated (Kline, 2013a, chap. 3), but applied to the same tiple-group CFA. In Mplus, for example, loadings
estimated factor means and standard deviations. and intercepts are assumed to be equal over groups,
To summarize, the basic factor structure (i.e., a but indicator error variances and factor variances
two-factor model) of five divergent thinking tasks and covariances are freely estimated in all groups
administered by Guo et al. (2021) is invariant includ- by default. Factor means are fixed to zero in the
ing the strengths of the associations between factors first group but are freely estimated in all subse-
and indicators over Chinese and American samples quent groups (Muthén & Muthén, 1998–2017). In
(weak invariance). But there is poor evidence for strong lavaan, the default reference variable method for
invariance in that some, but not all, intercepts may be scaling factor variances must be explicitly disabled,
equal over the two groups. This means that the Chinese if the effects coding method is used (Rosseel et al.,
and American samples did not all have the same rela- 2023). Program defaults can generally be modified
tive standing on some indicators, given the same posi- by the researcher if they do not correspond to the
tion on the underlying conceptual variables. Given the target model or planned analysis.
nature of the indicators, for which originality scores
were assigned by raters (i.e., they are not self-report 5. Make a strong statement about transparency in a
measures), differential additive response bias would not complex analysis with multiple decision points by
explain the differences in intercepts. Comparison of preregistering the analysis plan and making the
observed means over these tasks may also be problem- data, syntax file(s), and output file(s) available to
atic due to nonequivalence of the intercepts and factor others.
variances. Guo et al. (2021) described stronger evidence 6. Do not rely solely on significance tests nor approxi-
Pt4Kline5E.indd 408 3/22/2023 2:06:10 PM

mate fit indexes, including change-in-fit statistics, figural invariance hypothesis assumes the same num-
when deciding whether or not to retain an invari- ber of common factors and the same correspondence
ance model. Also inspect and describe the residuals between factors and indicators, but indicators in cate-
in each group. gorical CFA are latent response variables, not observed
variables. A complication occurs when, depending on
the number of item response categories, a configural
MEASUREMENT INVARIANCE TESTING invariance model as just described may not be identi-
IN CATEGORICAL CFA fied without imposing cross-group equality constraints,
and these constraints are invariance assumptions. That
This description of basic concepts for invariance test- is, a pure configural invariance model with freely esti-
ing in multiple-group categorical CFA assumes an estimated loadings, thresholds, and error variances in all
mator, such as robust WLS, that treats the data as ordi- groups may not be the baseline model for invariance
nal, not continuous. It is also necessarily brief for a few testing in categorical CFA.
reasons: There are many complexities when analyzing Suppose that all items are dichotomous with two
categorical invariance models, which have covariance, response options, such as agree versus disagree. Binary
threshold, and mean structures. Some of these compli- items have a single threshold parameter that desig-
cations involve identification, and others concern test- nates the cross-over point on the corresponding latent
ing strategy. Debate about the proper sequence to use response variable between the two item categories.
for testing categorical invariance hypotheses is ongo- A method for identifying a categorical CFA model
ing (Svetina et al., 2020; Wu & Estabrook, 2016). Some for binary items where each latent response variable
of these challenges can be sidestepped by analyzing depends on a single common factor by Millsap (2011,
ordinal data with methods for continuous data, such as pp. 129–130) requires (1) the exclusion of all latent
robust ML. In this case, invariance testing proceeds as response variables from the model’s mean structure
described in the previous section. But this option is not (i.e., coefficients for direct tracings of the constant
ideal for Likert-scale items with few response catego- are all fixed to zero, which is an untested invariance
ries, such as ≤ 5 (Chapter 18). Thus, it is beyond the assumption); (2) the specification that the means of the
scope of this introductory chapter to survey the details common factors are fixed to zero in a reference group
of categorical invariance testing. My goal instead is but freely estimated in all other groups; and (3) the
to prepare you to learn more about this topic through specification of invariant item thresholds over groups.
selective review of key works. Thus, the baseline model just described assumes
Recall that indicators that are items with Likert threshold invariance, which is required for identifica-
(ordered-categorical) response formats are treated in tion. The more restricted model tested next has equal
categorical CFA as discretized versions of continuous thresholds and loadings over groups. Thus, the hypoth-
latent response variables that are assumed to be nor- esis of invariant loadings and intercepts is tested in a
mally distributed. Each item is associated with a latent single step. With equality-constrained loadings, it may
response variable through a set of thresholds, which be possible to release the equality constraint for an item
represent points on the underlying continuum where threshold (e.g., Millsap, 2011, pp. 141–146), but load-
responses to the item cross over from one category ings and thresholds cannot be independently tested for
to the next, such as from agree to strongly agree. The invariance over groups. An alternative baseline model
number of thresholds for each item equals the number in a free baseline approach for binary data has invariant
of its response categories minus one. In the measure- loadings and thresholds, and equality constraints are
ment part of the model, the latent response variables— released based on hypotheses, again within the limits
not the observed variables (items)—are the indicators on model identification (Muthén & Asparouhov, 2002).
of common factors that approximate target concepts When items are polytomous with at least three
(e.g., Figure 18.2). response categories (i.e., there are ≥ 2 thresholds per
Measurement invariance hypotheses in multiple- item), requirements for model identification are more
group categorical CFA have different interpretations flexible, especially if each latent response variable
compared with their counterparts for continuous indi- loads on a single common factor. For example, the
cators in multiple-group CFA. For example, the con- threshold structure is identified, if (1) the same thresh-
Pt4Kline5E.indd 409 3/22/2023 2:06:10 PM

old parameter is constrained to equality over groups, (Jorgensen et al., 2022) in R. The semTools package has
and (2) for every latent response variable that is the ref- an option for automatically identifying the variances of
erence (marker) variable for its common factor, a sec- the latent response variables based on guidelines by Wu
ond threshold parameter is constrained to equality over and Estabrook (2016).
groups. All latent response variables are excluded from After establishing configural invariance in the tuto-
the mean structure, and common factor means are fixed rial by Svetina et al. (2020), threshold invariance was
to zero in the reference group and freely estimated in tested next, followed by invariance testing for the
the other groups (Millsap, 2011, pp. 127–129). There loadings. The order just stated differs from the typical
are additional identification requirements for models sequence for continuous data, where loading invari-
where some latent response variables depend on mul- ance is tested before intercept invariance (e.g., Table
tiple common factors for both dichotomous and polyto- 22.2), but Wu and Estabrook (2016) emphasized the
mous items (Millsap, 2011, pp. 130–131). importance of establishing threshold invariance in
The less restrictive identification requirements for multiple-group categorical CFA before testing invari-
models with polytomous items just described can give ance hypotheses for other measurement parameters,
the false impression that invariance testing could rou- if it is possible to do so given model identification
tinely proceed in the same basic sequence as when the requirements. Indeed, they argued that loading invari-
data are continuous. That is, fit as a baseline the con- ance cannot be correctly tested without first establish-
figural invariance model with freely estimated loadings ing basic threshold invariance. For another example of
and thresholds; next test a weak invariance model with invariance testing based on Wu and Estabrook’s (2016)
equal loadings over groups but with freely estimated guidelines, see Mastrotheodoros et al. (2021), who
thresholds, and so on. But Wu and Estrabrook (2016) evaluated longitudinal MI for a categorical CFA model
described a problem with this parallel: Imposing invari- of ethnic, national, and personal identity in a sample of
ance constraints on loadings in a configural invariance immigrant youth tested over three occasions.
model as just described can actually lead to statistically
nonequivalent models depending on how the baseline
model is identified. This means that different options OTHER STATISTICAL
for scaling the common factors or for scaling either the APPROACHES TO ESTIMATING
variances of the latent response variances or the vari- MEASUREMENT INVARIANCE
ances of their error terms (i.e., delta vs. theta param-
eterization; Chapter 18) affect the fit of a model with There are other statistical techniques and methods
equality-constrained loadings, which can lead to differ- besides multiple-group CFA for the evaluation of MI.
ent conclusions (e.g., Wu & Estrabrook, 2016, p. 1022). For example, Brown (2015) described multiple indi-
Wu and Estabrook (2016) described various cators and multiple causes (MIMIC) modeling with
sequences for testing invariance hypotheses such that covariates. In this approach, a single model is fitted
the scaling method for the baseline model does not to the data from all groups combined; that is, the total
change the model-implied probabilities of the observed sample is not partitioned into groups as in multiple-
responses in more restricted invariance models tested group CFA. Instead, group membership is represented
at later steps. This property is realized in their method in the model with coding variables, which are treated
because models that represent more restricted levels of in the analysis as covariates, or observed exogenous
invariance are not defined through simple respecifica- variables. For instance, the coding variable in dummy
tion of the baseline model, and thus are not affected coding for just two groups is the binary variable (0,
by how the baseline model is identified. Instead, more 1), two dummy codes represent membership in three
restricted models are specified in their own right based groups (e.g., Table 7.1), and so on. The data analyzed
on principles of identification for such models, but are the variances and covariances of the indicators and
more restricted models are still properly nested under the coding variables but not observed means. Another
the baseline model. In an extensive tutorial, Svetina key difference is that invariance of loadings, error
et al. (2020) demonstrated one of Wu and Estabrook’s variances and covariances, and factor variances and
(2016) testing sequences for polytomous data in both covariances is assumed; that is, MIMIC modeling with
Mplus and the combination of lavaan and semTools covariates tests only the equality of intercepts and fac-
Pt4Kline5E.indd 410 3/22/2023 2:06:10 PM

tors means over groups. A potential advantage is that tors controlling for the factors. Given all these results,
smaller sample sizes are needed in MIMIC modeling Lúcio et al. (2017) concluded that observed differ-
because a single model and the data matrix for the ences in task performance are due to true differences
whole sample regardless of the number of groups (e.g., in ADHD status and not to differential functioning of
2, 3, 10, etc.), which simplifies the analysis compared the indicators.
with multiple-group CFA. Millsap (2011) described item response theory (IRT)
There are two basic steps in MIMIC modeling with as a flexible and powerful framework for evaluating MI.
covariates (Brown, 2105): (1) Fit the CFA model to the There is also a large body of literature about the role of
data matrix for the whole sample. This model should IRT-based analyses in the evaluation of DIF, including
have acceptable fit. If so, then next (2) add to the model extensions of the approach to mixture distributions, or
covariates that represent group membership as causes latent classes that represent unobserved heterogeneity
of factors and indicators; that is, all measurement over subpopulations (Zumbo et al., 2015). In recent
model variables except error terms are regressed on the special issues on MI in Frontiers in Psychology (van
covariates. For two groups, the unstandardized direct de Schoot et al., 2015) and Sociological Methods &
effect of the single dummy variable for group member- Research (Davidov et al., 2018), two additional themes
ship on a common factor is the estimated group mean about analysis options were addressed, including (1)
difference on that factor. If this difference is appre- the concept of approximate MI and (2) new or extended
ciable, the populations from which the groups were approaches to evaluate MI in complex data sets. Each
sampled may have different levels on the correspond- of these topics is briefly described next.
ing theoretical variable. The unstandardized coefficient The idea behind approximate measurement invari-
for regressing a continuous indicator on the covariate is ance is that cross-group equality constraints that spec-
the estimated group mean difference on that indicator, ify exactly zero differences over groups or occasions
while controlling for the factor. If the estimated differ- are replaced in Bayesian methods by approximate
ence is appreciable, then differential functioning in the zero constraints, or a priori distributions of param-
indicator is suggested that is analogous to the test for eter values around zero that represent slight departures
equal intercepts in multiple-group CFA. Although the from perfect invariance. The researcher must specify
model has no mean structure, it is the unstandardized what constitutes a “slight” (i.e., trivial, negligible)
direct effects of the covariates that provide information deviation from zero difference, which requires strong
about group differences on the factors or indicators. See psychometric knowledge about the measures. In com-
Brown (2015, pp. 273–283) for an example. puter simulations by van de Shoot et al. (2013), a partial
Lúcio et al. (2017) compared results from multiple- MI approach was more likely to detect true measure-
group CFA and MIMIC modeling with covariates ment models with many small differences in loadings
about whether scores from word recognition and spell- or intercepts over populations than conventional full or
ing tests are comparable over samples of children ages partial invariance methods that assume zero differences
6–15 years in control versus attention deficit hyperac- in some parameters. There are increasing numbers of
tivity disorder (ADHD) samples. Two different MIMIC empirical studies in which the Bayesian approach to
models were analyzed: One treating ADHD status as a assessing approximate MI is applied—see Lek et al.
dichotomous variable (control vs. ADHD groups) and (2018) for examples in survey research.
the other treating ADHD symptoms as a continuous Analyzing models over large numbers of groups, such
variable where cases were not partitioned into groups. as 100 or more, is challenging in multiple-group CFA
Other covariates included IQ scores and the presence of due to the often poor fit of weak invariance models with
symptoms for any mental disorder (including ADHD). many large values for modification indices. Muthén and
The hypothesis of strong invariance (equal loadings Asparouhov (2018) described two alternatives. One is
and intercepts) was supported in multiple-group CFA based on the alignment method, which estimates fac-
results. Regardless of whether ADHD status was mea- tor means and variances in each group while assuming
sured as categorical or continuous, there was little evi- only approximate MI. The final aligned measurement
dence in MIMIC modeling for differential functioning model has the same fit to the data as the configural
over the word recognition and spelling tasks; that is, invariance model, which has no cross-group equality
ADHD status predicted the factors but not the indica- constraints and is the best fitting model over all groups.
Pt4Kline5E.indd 411 3/22/2023 2:06:10 PM

The computer uses Bayesian estimation to reweight the the hypothesis of equality is rejected. The most basic
estimates in the configural invariance model to mini- form for continuous data is configural invariance,
mize the degree of noninvariance in the aligned model where the same CFA model is fitted to data from two
for every pair of groups. The method assumes that most or more groups but with no constraints other than those
loadings and intercepts are invariant in the true model. needed to identify the model. The next level is weak
Muthén and Asparouhov (2018) also described an IRT- invariance, which assumes the unstandardized load-
based method of two-level factor analysis that assumes ing for each indicator is equal across the groups. The
only that all parameters are approximately equal with hypothesis of strong invariance requires the additional
each parameter having random variation that make its equality of intercepts. Strict invariance requires all the
values slightly different over groups. equality constraints mentioned to this point plus error
There are still more statistical options for MI analy- variance and covariance homogeneity. Partial mea-
ses beyond the ones just described—see Davidov et surement invariance is indicated when most, but not
al. (2018), Widaman et al. (2023), and van de Schoot all, measurement parameters are equal over groups at a
et al. (2015). Whatever analytical technique is used, particular level of invariance analysis, such as for weak
a challenge in MI studies is for researchers to explain invariance (i.e., some loadings are freely estimated in
the practical or real-world implications of small devia-
each group). Evidence for at least partial strong invari-
tions from invariance, including the definition of what
ance is generally required to directly compare groups
a small (i.e., ignorable) deviation actually means (Put-
on factor means. Other statistical techniques, such as
nick & Bornstein, 2016). The concern is that failure to
methods based on IRT or Bayesian estimation, are more
demonstrate invariance should not preclude subsequent
flexible than CFA in evaluating approximate invari-
research about differences over populations, occa-
sions, modes of administrations, informants, or other ance. Specifically, these alternatives allow for nonzero
conditions of measurement. It could turn out that some but small differences in measurement parameters over
concepts are simply not comparable over conditions, groups, which is probably more realistic.
such as cultures in which certain constructs are seen
as irrelevant or experienced in very different ways, but
LEARN MORE
smaller differences in other domains may not preclude
meaningful comparisons, if there is a rationale about Putnick and Bornstein (2016) offer suggestions for improving
how and why smaller departures from invariance are the state of practice and reporting in CFA-based MI studies,
inconsequential. Wu and Estabrook (2016) describe how to scale baseline
invariance models that do not lead to nonequivalent models
when equality constraints are imposed in later steps, and
SUMMARY Svetina et al. (2020) is a tutorial for implementing Wu and
Estabrook (2016) scaling for latent response variables and
The evaluation of measurement invariance is an active common factors in models for ordinal data.
research area with hundreds of published studies in
many disciplines such as psychology, education, and Putnick, D. L., & Bornstein, M. H. (2016). Measurement
cross-cultural studies, among others. The basic ques- invariance conventions and reporting: The state of the
tion is whether scores from the same measures can be art and future directions for psychological research.
interpreted the same way over populations, cultures, Developmental Review, 41, 71–90.
nations, languages, times of measurement, or modes
of test administration, among other conditions. In tra- Svetina, D., Rutkowski, L., & Rutkowski, D. (2020) Multiple-
group invariance with categorical outcomes using
ditional SEM, the technique of multiple-group CFA
updated guidelines: An illustration using Mplus and the
applied to continuous or categorical data is the primary
lavaan/semTools packages. Structural Equation Model-
statistical method in MI studies. It involves the imposi-
ing, 27(1), 111–130.
tion of cross-group equality constraints on unstandard-
ized measurement parameters that include loadings, Wu, H., & Estabrook, R. (2016). Identification of confirma-
intercepts or thresholds, or error variances. If the fit tory factor analysis models of different levels of invari-
of the model with equality constraints is appreciably ance for ordered categorical outcomes. Psychometrika,
worse than that of an unconstrained (baseline) model, 81(4), 1014–1045.
Pt4Kline5E.indd 412 3/22/2023 2:06:10 PM

EXERCISES
1. Verify in a single-group analysis of Figure 22.1 that straints on indicator error variances to the data in
(a) the mean structure is just-identified and (b) df M Table 22.2. Describe whether the assumption of
= 4 for the whole model. error variance homogeneity is plausible.
2. Why are the estimated factor means in the single- 4. Fit the partial strong invariance model but with
sample analyses of Figure 22.1 (analysis 1, Table cross-group equality constraints on the two factor
22.1) not directly comparable over groups? variances. Describe whether the assumption of fac-
tor variance homogeneity is reasonable.
3. Fit the partial strong invariance model (no. 6 in
Table 22.3) but with cross-group equality con-
Pt4Kline5E.indd 413 3/22/2023 2:06:10 PM

23
Best Practices in SEM
The techniques that comprise the whole of SEM—traditional (covariance-based), nonparametric, and com-
posite (variance-based)—provide researchers with an extensive set of tools for testing hypotheses with mul-
tivariate data. As with any complex set of statistical techniques, though, its use must be guided by reason.
Some of the issues mentioned next were raised in earlier chapters, but they are discussed together here in
the form of best practices. They are addressed under categories that include specification, identification,
measures, sample and data, analysis, respecification, and reporting. Other topics include tips to make your
work exceptional and the crux of SEM, or the most important things about it. These categories are not mutu-
ally exclusive, but they offer a useful way to focus this discussion. You are encouraged to use these points as
a checklist for your own analyses. By adopting best practices and avoiding common mistakes, you help to
improve the state of practice and the quality of the research literature. The psychologist William James (1917,
p. 348) wrote “there is no worse lie than a truth misunderstood by those who hear it,” which is a thoughtful
insight about this chapter’s aim and that of the whole book.
RESOURCES • The technique of SEM is about testing theories,

not just models. The model analyzed represents predic-
Listed in the top part of Table 23.1 are citations for tions based on a particular body of work, but outside
works about best practices and reporting guidelines. of this role, the model has little intrinsic value. This
These include reporting standards for traditional stud- means that the model provides a vehicle for testing
ies by the American Psychological Association (Appel- ideas, and the real goal of SEM is to evaluate these
baum et al., 2018; Chapter 3), Hair et al.’s (2019) sugges- ideas in meaningful and valid ways. Whether or not a
tions for reporting in PLS-PM studies, and Schreiber’s model is retained is incidental to this goal. Besides, any
(2017) suggestions for reporting CFA results. Listed in wrong model can be retained simply by adding enough
the bottom part of the table are works about the use free parameters until it is so complex that it would fit
of SEM in particular disciplines such as information basically any data, but that is testing nothing.
systems (Benitez et al., 2020) and imagining genetics
(Huisman et al., 2018), among others. • If no model is retained, then explain the impli-
cations for theory. For example, in what way(s) could
theory be incorrect, based on your results?
BOTTOM LINES • If a model is retained, explain just what was
AND STATISTICAL BEAUTY learned as a result of your study; that is, what is the
substantive significance of your findings? How has the
The points summarized next deal with the role of SEM state of knowledge in your area been advanced? What
as a tool for science; that is, what it all means in the end: new questions are posed? What comes next?
414
Pt4Kline5E.indd 414 3/22/2023 2:06:11 PM

Best Practices in SEM 415
TABLE 23.1. Citations for Works About Best Practices and Standards
or Guidelines for Reporting of the Results in Structural Equation Modeling
Citation Comment
Best practices
Mueller & Hancock (2008) Best practice suggestions with analysis examples
Morrison et al. (2017) Recommendations for correct use in psychological research
General reporting guidelines or standards

Appelbaum et al. (2018) APA JARS-Quant reporting standards for SEM studies
Hair et al. (2019) Considerations and metrics about reporting results for PLS-PM analyses
Schreiber (2017) Core reporting practices including praise for retaining no model
Applying SEM and reporting results in particular disciplines

Benitez et al. (2020) Information systems (PLS-PM)
Fan et al. (2016) Ecological studies
Goodboy & Kline (2017) Communication research
Huisman et al. (2018) Imaging genetics
Igolkina & Samsonova (2018) Molecular biology
Ockey & Choi (2015) Language assessment
Zhang et al. (2021) Organizational and management research
Note. APA JARS-Quant, American Psychological Association journal article reporting standards for quantitative studies.
• If your sample size is not large enough to ran- MIGHTILY DISTINGUISH YOUR WORK
domly split and cross-validate your analyses, then (BE A HERO)
clearly state this as a limitation. If so, replication is a
necessary “what comes next” activity. This is espe- The four things listed next are best practices but,
cially true if a model is retained after a specification unfortunately, they are rarely seen in the research
search that was not part of a preregistered analysis plan. literature individually—much less all together. Their
general absence also marks the relative immaturity of
• A strong analytical method such as SEM cannot our collective use of SEM in the behavioral sciences.
compensate for poor study design or shoddy ideas. For You can greatly distinguish your own work by follow-
example, expressing poorly thought-out hypotheses in a ing even one of the best practices listed next. All four?
path diagram does not give them credibility. The speci- Then count yourself as a hero among SEM practitio-
fication of direct or indirect effects cannot be viewed as ners:
a substitute for an experimental or longitudinal design.
Inclusion of an error term for a measure with poor psy- • Describe local model fit, not just global model fit.
chometrics cannot somehow transform it into a good Report the residuals either in article tables or appendi-
measure. Applying SEM in the absence of good design, ces for smaller models or in the supplemental materi-
measures, and ideas cannot spin gold from straw. als for a journal article about a larger model. Doing so
Pt4Kline5E.indd 415 3/22/2023 2:06:11 PM

has been demonstrated many times in this book (e.g., merate which causal effects are identified versus oth-
Tables 8.2, 9.4, 12.4, 14.4, 17.4, 18.3, and 19.3). Show ers that are not identified. Awareness of those effects
us that the residuals fail to indicate appreciable dis- not identified should prompt you to think about how
crepancies between model and data at the level of the to measure at least proxies (indicators) of omitted con-
measured variables, if any model is retained. Reporting founders (e.g., Figure 6.1) or locate instruments for
solely about global fit is insufficient. causal variables that overlap with the disturbances of
outcome variables, or endogeneity (e.g., Figure 6.6).
• Acknowledge directly the phenomenon of equiva-
lent or near-equivalent models. Show us at least a few • Avoid common factor snobbery, or the false belief
plausible alternatives to a retained model, and offer us that common factors are always superior to composites
arguments about why any retained model would be pre- as proxies for conceptual variables. If measurement is
ferred over alternative models that explain the data just assumed to be reflective, such as when tests are devel-
as well, or nearly so. Without such arguments, there can oped within classical measurement theory, then com-
be no preference. Actually dealing with this issue can mon factors are the way to go. But the problem of factor
reduce one of the most pernicious forms of confirma- indeterminacy—which is not the same thing as mea-
tion bias in the SEM literature. surement error—is serious enough to (1) blur the asso-
ciation between common factors and the theoretical
• Preregister your analysis plan; specifically, tell us variables they are supposed to represent and (2) intro-
how and why the initial model will be respecified, if
duce uncertainty in estimates of casual effects between
its fit to the data is poor. If the actual analysis differs
concepts approximated by common factors. The latter
from the preregistered plan, then disclose that fact in
form of uncertainty just mentioned is worse when there
the written report and explain why.
are fewer indicators per factor and standardized fac-
• Report evidence that your findings replicate, tor loadings are lower, such as around .30, instead of
either internally, such as in cross-validation, or, even higher, such as .70 or so (Rigdon et al., 2019). Thus,
stronger, externally, such as in a different sample col- common factors are not royal roads to approximating
lected by a different researcher in a new location. Kraft theoretical variables. Petter (2018) described additional
et al. (2009, p. 561) described how genetic epidemiol- stereotypes about common factors versus composites
ogy “learned the importance of replication the hard as proxies.
way” through the publication of many genotype–phe-
notype associations that failed to replicate. Now top • Use valid and principled alternatives to reflective
measurement that include synthesis theory, which is for
journals in this field generally require evidence for rep-
forged concepts such as emergent variables made up of
lication. This is a lesson not yet fully appreciated in the
their indicators, not the other way around. Composites
SEM literature.
are natural proxies for emergent variables, not common
factors, which require assumptions about covariances
among indicators that are incompatible with formative
FAMILY RELATIONS
measurement. But composites are also imperfect prox-
ies: They are affected by measurement error in their
I would bet that many, if not most, readers were more
indicators, and indeterminacy also characterizes their
familiar with traditional (covariance-based) methods
associations with conceptual variables although not
for SEM before reading this book. Even though tradi-
in ways that are yet directly estimable (Rigdon et al.,
tional SEM was the main focus here, I hope that pre-
2019).
sentations about the other two members of the family,
nonparametric SEM and composite SEM, expanded • Do not fit to the same data reflective models based
readers’ sense of what else is possible: on common factors versus formative models based on
composites in the hope of finding the “correct” mea-
• Use insights from Pearl’s structural causal model surement model. There is actually no way to discern
(nonparametric SEM) to help you specify a recursive the true measurement model in an individual empiri-
structural model and plan the study. For example, you cal study. Instead, the model is assumed to be correct
can work with directed acyclic graphs where some and then fitted to the data. That model should be well
causes are assumed to be unmeasured in order to enu- thought out and based on the researcher’s best under-
Pt4Kline5E.indd 416 3/22/2023 2:06:11 PM

standing of theory and measurement in a particular names, such as verbal aggression, defensive aggression,
area. To do otherwise is HARKing, or hypothesizing dominance aggression, and so on.
after the results are known. Thus, the choice between
common factors or composites as proxies should be • Use multiple-indicator measurement, which is
generally better than single-indicator measurement.
informed, not accidental, and shared with your readers
An exception is when only one among a set of indica-
in written summaries of the results.
tors has good psychometric characteristics. Another is
• Likewise, do not rely on empirical tests of whether when a particular indicator can be viewed as the super
a set of indicators is formative or reflective. For exam- ultimate expression of theoretical concept. In either
ple, do not automatically conclude that a set of indi- case, it may be better to rely on a single best indicator.
cators is formative if their intercorrelations are not all
positive. Also, do not rely solely on corresponding tests • Avoid the specification error where the single
indicator of an exogenous concept is assumed to have
of whether a variable is a proper instrument for another
no measurement error, especially if this assumption
variable that is involved in a causal loop or that is cor-
is known to be false. Instead, estimate the score reli-
related with the disturbance of an outcome variable.
ability for a single indicator and specify an error term
These kinds of specifications should come primarily
for that indicator, which isolates its measurement error
from your knowledge of measurement or substantive
away from the structural model. An alternative is to
issues about causation in a particular research area.
specify an instrument for a single indicator, but the
Empirical tests could just capitalize on sampling error,
instrument must have excellent psychometric charac-
especially in small samples.
teristics.
• Do not forget that there are alternatives to reflec- • Specify design-driven correlated error terms,
tive measurement with effect indicators and forma-
such as correlated disturbances in a structural model
tive measurement with cause indicators or composite
or correlated errors in a measurement model, if doing
indicators. Some of these alternatives, such as network
so is theoretically justifiable and identification require-
models or methods based on establishing content valid-
ments can be satisfied. Omission of such terms can lead
ity, refer to neither latent variables nor emergent vari-
to inaccurate results, especially for results based on
ables. Measurement models with reactive indicators
proxies for theoretical variables. In some disciplines,
assume mutual causation between proxies for theoreti-
such as economics, the specification of correlated error
cal variables and their indicators.
terms is routine. Such specification should not be seen
as a necessary evil. The flip side of this advice is to add
correlated error terms without a basis in theory or study
SPECIFICATION
design (e.g., repeated measures), such as to improve
model fit with no good reason to expect such effects.
Despite all the statistical machinations in SEM, speci-
Doing so makes the model more complex, which gener-
fication is the most important step, but occasionally
ally improves fit, but at the likely cost of capitalizing on
researchers spend the least amount of time on it. Listed
sampling error.
next are ways to do your homework in this critical area:
• Realize that it is sometimes appropriate to expect
• Describe the theoretical framework or body of that an indicator depends on two or more common
empirical results that form the basis for specification. factors, but this specification should come from prior
Articulate the specific problem addressed in the anal- knowledge of that variable. Just like error correlations,
ysis. Explain why the application of SEM is needed, the specification that an indicator is multidimensional
including why using a simpler statistical technique is instead of unidimensional makes a measurement model
not better. less parsimonious.
• Define the corresponding theoretical variables • State the rationale for directionality specifica-
in clear terms for models with proxies. For example, tions. This includes both the measurement model and
avoid vague labels for concepts such as “aggression.” the structural model. For example, is reflective mea-
Instead, describe the kinds or aspects of aggression that surement appropriate for describing the directionali-
correspond to the target concept, and use more precise ties of indicator–proxy relations? Or would the speci-
Pt4Kline5E.indd 417 3/22/2023 2:06:11 PM

fication of formative measurement make more sense? IDENTIFICATION

Or perhaps both (i.e., mixed measurement or MIMIC
models)? For the structural model, explain hypotheses The problem of identification must be dealt with in
about causal priority, especially if your research design many, if not most, SEM studies. Some recommenda-
has no formal elements, such as time precedence, that tions for managing identification are listed next:
support causal inference.
• Provide explicit justification for specification • Explicitly tally the number of observations, free
parameters, and model degrees of freedom, df M for
of causal loops in models fitted to data in cross-sec-
the initial model. Keep track of df M for any respecified
tional studies, or an account of why the variables are
expected to be both causes and effects of each other models, that is, relate changes in model specification to
and that causal lags are expected to be very short. Also changes in df M for revised models.
acknowledge the special assumptions of causal loops, • Scale the common factors or composites properly.
such as equilibrium and stationarity, and comment In multiple-group SEM, standardizing common factors
on their plausibility in your study. Avoid specifying by fixing their variances to 1.0 is incorrect if groups
mutual causation as a way to mask uncertainty about differ in their variabilities. Fixing the loading for a ref-
directionality. For example, does Y1 cause Y2 or vice erence variable to 1.0 (i.e., the factor is unstandardized)
versa—not sure, why not both (Y1  Y2)? is preferable, but note that (1) the same loading must
be fixed in each group and (2) indicators with fixed
• Be mindful about the consequences of omitting nonzero loading are assumed to be invariant across
causes that are correlated with other variables in the
model. If an omitted cause is unrelated to measured all groups. The effects coding method, where aver-
causes, then estimates of direct effects are not biased age unstandardized loadings or intercepts are fixed to
due to this omission. But it is rare that the types of equal, respectively, 1.0 or 0 in all groups, is an alterna-
causes studied by behavioral scientists are truly inde- tive for indicators of the same factor that also share the
pendent. Depending on the pattern of correlations same metric. In single-group analyses, fixing to 1.0 the
between measured and unmeasured variables, esti- variances of factors measured over time is also wrong
mates of direct effects can be too high or too low. if factor variability is expected to change. In composite
SEM, the specification of a dominant indicator orients
• Respect the parsimony principle: Specify the (determines the sign of) the associated composite, and
simplest model possible as your initial model, one analyzing the composite in standardized form scales it.
that includes the effects of highest priority, given rel-
evant theory. Doing so in a single sample corresponds • Comment on sufficient requirements that identify
to model building, where the simplest model in a set the particular kind of structural equation model you are
of nested models is tested first. But when testing for analyzing. For example, if the structural model is non-
measurement invariance in multiple-group CFA, it is recursive, is the rank condition sufficient to identify it?
usually better to analyze the most complex model first. If the measurement model has indicators that depend
This is the model of configural invariance, which is on multiple factors or has correlated error terms, does
then made simpler (it is trimmed) by imposing equality their pattern satisfy sufficient conditions?
constraints on certain parameters, such as factor load-
ings, intercepts, or error variances.
• Assure your readers that your model is actually
identified if it is an especially complex one. Remem-
• Do not be discouraged by previous comments on ber that it is theoretically possible for the computer to
parsimony as they are not intended to deter you from generate a converged, admissible solution for a model
analyzing complex models per se. This is because a that is not truly identified, yet give no warning about
phenomenon that is complex may require a relatively the problem. Whatever solution is so computed, it is
intricate statistical model in order to capture its basic but one of an infinite number of solutions (i.e., it has
essence. The main point is that the model should be no meaningful interpretation), if the model is not really
as simple as possible while still respecting theory and identified. If you are uncertain about whether your
prior empirical results. Models that are complex with- model is identified, then you are not ready to analyze it
out justification, or overparameterized, are probably so much less report analysis results for that model. See the
specified to maximize fit. advice in Chapter 19 about handling identification for
Pt4Kline5E.indd 418 3/22/2023 2:06:11 PM

nonrecursive models (e.g., start with a core model you parcels in the same analysis is likely to capitalize heav-
know is identified and next build it up). ily on chance. If so, then parceling can mask the true
multidimensionality and distort the results.
MEASURES
SAMPLE AND DATA
Your scores come from your measures, so those mea-
sures better be good. Some advice for dealing with the The nature of samples and data are critical in any type
measurement problem in SEM are considered next: of statistical analysis. Emphasized next are issues spe-
cific to SEM:
• Explain the operationalizations for your target
concepts; that is, establish the links between concept • Use a sample size that is large enough for your
definitions and specific characteristics or behaviors that model and estimation methods. As models become more
are to be measured or observed. complex relative to the number of cases, the statistical
precision of the estimates is more doubtful in smaller
• Describe the psychometric characteristics of your samples. There is greater capitalization on chance in
measures, including evidence for score reliability (i.e.,
smaller samples, too. Methods that make fewer dis-
are they precise?) and validity (e.g., can they really be
tributional assumptions generally require more cases.
interpreted as measuring the target concepts?). It is
The analysis of ordinal data may require more cases
best practice to estimate score reliability in your own
compared with analyzing continuous data. Convince
sample(s). If it is impossible to do so, report coefficients
your readers that the same size is large enough to do
from other samples (reliability induction), but describe
the job. There is no shame in using a simpler statisti-
whether those other samples are similar to yours.
cal technique in a smaller sample (see Chapter 17 for
• Specify and measure auxiliary variables that may exceptions).
predict the data loss pattern if you anticipate miss-
ing data—for example, the design is longitudinal and • Describe how the sample size for your study was
established. If a target sample size was established in a
participants can choose to withdraw from the study at
power analysis, state the minimum level of power (e.g.,
any point. These variables need not be included in the
.90) and describe the level of analysis (i.e., whole model
model, but they may be helpful when imputing multiple
vs. individual parameters; e.g., Tables 10.3, 10.4). State
scores for each missing observation. That is, both the
the specific null and alternative hypotheses and other
variables in the model and auxiliary variables should
power analysis parameter, such as the level of signifi-
be part of the study plan.
cance (a) and population values of approximate fit
• Report evidence about extreme multicollinearity, indexes. If target sample size is estimated in precision
such as values of the variance inflation factor (VIF), for planning, also called accuracy in parameter estima-
among composite indicators for the same proxy (e.g., tion, then provide a rationale for the precision param-
Table 16.2). This is especially true when inner weights eter, or the margin of error for the population value of
are specified as PLS Mode B, or regression weights, the RMSEA or related index. If sample size limitations
which are based on all intercorrelations among the are fixed by resource limitations (i.e., time or money),
indicators for the same composite. Extreme collinear- then just be honest and say so. This is much better than
ity can adversely affect regression weights. making up a post hoc rationale for sample size.
• Be careful, in small samples, when analyzing • Be sure, if the sample is archival—that is, you are
parcels—averages or total scores over sets of items— fitting a model within an extant data set—to mention
as continuous indicators in CFA. Doing so generates possible specification errors due to omission of rel-
simpler models with fewer indicators compared with evant causal or outcome variables. Another drawback
item-level analyses (i.e., categorical CFA), where each to archival samples is the realization that the model is
item is an indicator. It assumes that the items in each not identified. With the data already collected, it may
parcel are unidimensional, a requirement that should be too late to do anything about identification. Adding
be addressed before analyzing the data in CFA. This exogenous variables is one way to remedy an identifica-
is because trying to establish the unidimensionality of tion problem for a nonrecursive structural model, and
Pt4Kline5E.indd 419 3/22/2023 2:06:11 PM

adding indicators can help to identify a measurement 15.6, and 16.2). Report sufficient descriptive statistics
model. so that others can perform a secondary analysis and
reproduce your original results. To save space, reli-
• State the computer program and algorithm used, ability coefficients or values of standardized skew or
the number and sizes of generated samples, and how
kurtosis indexes can be reported in the same place (e.g.,
many generated samples were lost due to nonconver-
Sauvé et al., 2021, p. 240). Verify that the data matrix
gence, inadmissible solutions, or other problems in the
is positive definite.
analysis if any data are simulated.
• Do not in covariance-based SEM standardize the • Evaluate distributional assumptions of your esti-
mation method, such as multivariate normality for the
raw scores (i.e., convert to normal deviates, z), especially
default ML method. Report values of standardized
if you plan to use an estimation method that assumes
skew and kurtosis indexes for continuous outcomes.
unstandardized variables. Situations where standard-
Also verify that relations between continuous variables
izing the scores is especially inappropriate include the
are linear. Curvilinear relations are no problem, if the
analysis of a model across independent samples with
researcher (1) detects them and (2) includes the appro-
different variabilities, longitudinal data characterized
priate power terms in the analysis (e.g., Figures 7.5–7.6)
by changes in variances or means over time, or a type
or uses a nonparametric estimation method.
of SEM analysis that requires the analysis of means,
such as a latent growth model, which needs the input • State clearly the type of data matrix analyzed,
of not only raw score means but covariances, too. Note which is ordinarily a covariance matrix (and possi-
that standardized scores are often analyzed in compos- bly means) for models with continuous outcomes or a
ite SEM, but there are alternatives (Chapter 16). matrix of polychoric correlations (along with thresh-
olds and asymptotic covariances) for ordinal indicators.
• Describe how data-related complications were If just a Pearson correlation matrix with no standard
handled. This includes the extent and strategy for
deviations is analyzed for continuous data, then use an
dealing with missing observations or outliers, how

appropriate estimation method that is intended for ana-
extreme multicollinearity was managed, and the use
lyzing correlation structures.
of transformations, if any, to normalize continuous
variables. Reporting standards call for transparency in • If the data are nested, such as repeated measures
descriptions of modifications to the data before they are or collected in complex sampling designs, then explain
analyzed, not just in SEM studies, but in other kinds of how nonindependence of the scores was taken into
empirical studies, too (Appelbaum et al., 2018; Inter- account (e.g., correlated disturbances are specified for
national Committee of Medical Journal Editors, 2021). repeated measures variables, a two-level model was
analyzed).
• Keep in mind that some types of analyses in SEM
require the input of raw data, such as when using spe-
cial methods or estimators that analyze incomplete data
ESTIMATION
files, correct for nonnormality in continuous outcome
variables, or analyze ordinal data. In such cases, make
Undetected problems at earlier stages may make some
the raw data file available so that others can reproduce
of the complications described next more likely to hap-
your analyses. Some journals, such as Psychological
pen:
Science, encourage authors to submit data files along
with other supplemental materials and award open-
science badges for complete and transparent reporting • State which SEM computer tool was used (and its
version), and list the syntax for your final model in an
about data (Eich, 2014).
appendix. If the latter is not feasible due to length limi-
• Remember that other kinds of analyses in SEM— tations, then post syntax along with other supplemental
such as when raw data files are complete, all outcomes materials for a journal article. If you are controlling
are continuous with normal distributions, and the esti- the analysis through a graphical user interface that can
mator is default ML—require only the input of sum- automatically write syntax for your model, then post
mary statistics, including means, correlations, and that syntax file. That is, be transparent regarding both
standard deviations (e.g., Tables 11.2, 12.1, 14.2, 15.2, the data and model specification as it was actually ana-
Pt4Kline5E.indd 420 3/22/2023 2:06:11 PM

lyzed. For even greater transparency, post the syntax • Whenever possible, conduct local fit assessment;
for the analysis of earlier versions of your final model, that is, inspect fit at a more molecular level by examin-
too (e.g., Sauvé et al., 2021). ing the residuals, including implied conditional inde-
pendencies based on the d-separation criterion or cova-
• Post the output file for your final model. If output riance, standardized, normalized, correlation, mean, or
about the residuals is optional, then request that output
threshold residuals generated in more standard kinds
so that the reader can understand both global fit and
of analyses. Treat significance levels of the residu-
local fit of your model.
als just mentioned, such as standardized residuals for
• Check your computer syntax carefully, then check covariance residuals, with caution. In smaller samples,
it again. Just as in manual data entry, it is easy to make such tests can fail to be significant even when the cor-
an error in syntax that misspecifies the model, data, or responding discrepancy between sample and predicted
analysis options. Although SEM computer tools have values is substantial. In very large samples, these sig-
become easier to use, they still cannot detect a mistake nificance tests can signal discrepancies that may be
that is logical rather than a syntax error. A logical error considered trivial in magnitude.
does not cause the analysis to fail but instead results
in an unintended specification, for instance, Y1 → Y2 • State the decision rule(s) used to select one model
over another. Report the results of the chi-square dif-
is specified when Y2 → Y1 is intended. Verify that the
ference test for relevant comparisons of nested models.
model analyzed was actually the one that you intended
If you compare the relative fits of alternative-but-not-
to specify.
nested models with predictive fit indexes, such as the
• State the estimation method used, even if it was AIC or BIC, do not forget that the particular rank order
default ML. If a different estimator is used, then clearly indicated by the statistic is subject to sampling error;
state this method and give your rationale for selecting that is, the model preferred by the index may not be
it, such as robust DWLS for ordinal data. Justify the the true model in the population. The amount of this
application of a method for continuous data, such as sampling error in model selection also increases along
robust ML, to the analysis of ordinal data (e.g., there with the sample size instead of getting smaller. These
are ≥ 6 response categories for Likert scale items and problems explain why replication is a gold standard in
item response histograms are reasonably symmetrical). science, not statistical prediction about the model that
is most likely to replicate in hypothetical future stud-
• Say whether estimation converged and whether ies.
the solution is admissible. Describe any complica-
tions, such as failure of iterative estimation or Hey- • Remember that equality constraints for the same
wood cases, and how such problems were handled (e.g., parameter, such as cross-group equality constraints in
increasing the default limit on the number of iterations). multiple-group analyses, usually apply in the unstan-
Remember that SEM computer programs do not always dardized solution only. It is expected that values of the
print warning or error messages for inadmissible solu- standardized estimates for the same parameter will
tions, so you must carefully inspect the entire output. be different across the groups. Keep in mind also that
Likewise, do not interpret results from a solution that is values of standardized parameter estimates are, in gen-
not admissible as it is untrustworthy. eral, not directly comparable across groups.
• Never retain a model based solely on values of • Assuming the model is identified, watch out for
global fit statistics. Specifically, do not uncritically empirical underidentification, which can occur due to
rely on cutting points or thresholds for approximate fit data-related problems such as extreme multicollinear-
indexes, whether such thresholds are static or dynamic, ity or estimates of key parameters that are close to zero,
to justify the retention of the model, especially if that close to their absolute maximum values, or nearly equal
model failed the chi-square test or the endogenous vari- to each other. Measurement models where some com-
ables are not continuous. Estimate and report the power mon factors have just two indicators or nonrecursive
of the model chi-square test. If a model is retained yet structural models with causal loops may be especially
the power of the chi-square test is low, then explicitly susceptible to empirical underidentification. Respecifi-
acknowledge that the test is unlikely over random sam- cation of a model when the data are the problem may
ples to detect a false model. lead to a specification error.
Pt4Kline5E.indd 421 3/22/2023 2:06:11 PM

• Comparing group means on observed variables is readers that its specification was not merely the result of
more complicated than simply applying the standard chasing sampling error. If there is no such rationale, the
t test for independent samples. Doing so requires that model may be overparameterized (good fit is achieved
scores in both groups measure the same conceptual at the cost of too many parameters), and results from
variable as approximated by common factors or by such models are unlikely to replicate. It is usually bet-
composites. For reflective measurement models, direct ter to retain no model in this case. This is a perfectly
comparison of group mean differences on observed acceptable outcome, if there is no theoretically defensi-
variables requires strict invariance, or the assumption ble respecification that leads to satisfactory model–data
of equal factor loadings, intercepts, and error vari- correspondence.
ances and covariances over groups; otherwise, appre-
ciable differences in the parameters just mentioned
can confound real (i.e., latent) group differences on the TABULATION
observed variables.
• Establish strong measurement invariance in order At the conclusion of the analysis, you must organize the
to meaningfully compare group means on common statistical results so that they can be reported. Here are
factors. This is because group differences in factor some suggestions for doing so in a clear and thorough
loadings or intercepts say that the indicators do not way:
measure the common factors in the same way across
the groups. Formal comparison of groups on factor • Report the parameter estimates for your model (if
variances or covariances requires only weak invari- a model is retained). This includes the unstandardized
ance, or cross-group equality of the unstandardized estimates, their standard errors, and the standardized
factor loadings. estimates. It is an error to report just the standardized
solution. Explain how the standardized solution was
derived both in single-group analyses (e.g., all variables
RESPECIFICATION standardized vs. common factors only) and in multiple-
group analyses (e.g., within-group standardization vs.
Except when working in a strictly confirmatory mode, common metric standardization).
respecification is part of most SEM analyses. It is cru-
cial to get right the things considered next: • Do not indicate anything about statistical signifi-
cance for the standardized parameter estimates unless
their standard errors are also included in computer
• Explain the theoretical basis for respecifying a output. This is because p values are typically different
model; that is, how are the changes justified? Indicate
the particular statistics, such as correlation residuals, for unstandardized parameter estimates and their stan-
standardized residuals, or modification indexes, con- dardized counterparts.
sulted in respecification and how the values relate to • Report information about the residuals, either in
theory. text, in a table, in an appendix, or as part of supplemen-
• Differentiate plainly between results from a priori tal materials. Show your readers the details of fit. Just
specifications versus those found after fitting the model reporting values of global fit statistics is inadequate, and
and otherwise examining the data. A specification doing so can give the impression that the researcher is
search guided entirely by statistical criteria is unlikely trying to hide something. As a reviewer, I never accept
to lead to the true model. Use your knowledge of theory the failure to describe the residuals for analyses where
and empirical findings to inform the use of such statis- residuals are available in the output. Sometimes when
tics. provided, those residuals do not suggest a problem
with local fit. But I have seen many examples where
• State clearly the nature and number of respecifica- the residuals indicate grossly poor fit for certain pairs
tions, such as how many free parameters were added or
of measured variables even though values of global fit
dropped and which ones?
statistics are not problematic (i.e., the model should not
• Remember that if the final model is quite differ- be retained). Just as in regression analysis, the residuals
ent from your initial model, you need to reassure your matter in SEM, too.
Pt4Kline5E.indd 422 3/22/2023 2:06:11 PM

• Always report the model chi-square (or compa- could reflect any of the following (not all mutually
rable global fit statistic) and its degrees of freedom and exclusive) possibilities: the model (1) accurately reflects
p value in analyses where the computer generates these reality; (2) is an equivalent or near-equivalent version
values. If the model fails the chi-square test, then explic- of the one that corresponds to reality but itself is incor-
itly state this result and tentatively reject the model. Do rect; (3) fits the data in a nonrepresentative sample (i.e.,
not hem and haw, or avoid making a definite statement yours) but has poor fit in the population; or (4) has so
here. For example, do not falsely claim that the chi- many freely estimated parameters that it can hardly
square test is somehow “biased” by large sample size have poor fit even if it were grossly misspecified. In a
or that its value is always inflated by sample size (true single study, it is usually impossible to determine which
for false models only). Reporting the model chi-square one of these scenarios explains the acceptable fit of the
without its p value comes across as an attempt to hide researcher’s model. If the analysis is never replicated,
bad news (i.e., the model failed the exact-fit test). then we will never know. This is another way of say-
ing that SEM is more useful for rejecting a false model
• Report, if possible, the values of a minimal set of than for somehow “confirming” whether a given model
approximate fit indexes that include the RMSEA and its
90% confidence interval, CFI, and SRMR. The failure is actually true, especially without replication. For the
to report the RMSEA confidence interval can be seen as same reasons, close fit to the data does not “prove” the
an attempt to hide an unfavorable result about the upper directionality specifications (causal effects) represented
bound of that interval. If the confidence interval based in the model.
on the RMSEA is not available in a particular type of • Do not confuse statistical significance with effect
analysis (check the documentation for your computer size or whether results are clinically, theoretically, or
tool), then say so. Do not blindly refer to thresholds practically significant. Be careful not to commit one
for approximate fit statistics that supposedly indicate of many kinds of cognitive errors about statistical sig-
“good” fit (they don’t work for all models or data). Avoid nificance (e.g., the false belief that “significant” results
selective reporting of the values of just those fit statistics are real and not due to chance). Do not be dazzled by
that favor your model. Explain the specification of the asterisks (i.e., statistical significance), for they do not
baseline model in your computer tool for the CFI, which light the path to truth in SEM—nor in any other kind of
is a comparative approximate fit index. statistical analysis.
• Report information for individual outcome vari- • Do not automatically refer to estimates of indirect
ables about the explanatory power of their presumed effects in structural models as indicating “mediation” in
direct causes, such as R2 for continuous endogenous cross-sectional designs where all variables are concur-
variables in recursive models or blocked-error R2 for rently measured. This is because mediation is a strong
outcomes involved in nonrecursive relations. Remem- causal hypothesis that generally requires time prece-
ber that R2 in computer output for an ordinal indicator dence in measurement among putative cause, mediator,
in categorical CFA applies to the corresponding latent and outcome variables. The only alternative in a cross-
response variable, and thus not directly to that indica- sectional design is a strong argument about conceptual
tor. Also keep in mind that R2 for individual outcomes timing that would convincingly rule out equivalent
has basically nothing to do with global model fit. Inter- models where some direct effects are reversed among
pret effect sizes (e.g., unstandardized or standardized the presumed cause, mediator, or outcome in the origi-
path coefficients, R2) in reference to results expected in nal model. If the causal variable is experimental but the
a particular research area. mediator is an individual difference variable, be espe-
cially wary that omitted common causes of the media-
tor and the outcome could bias the results.
INTERPRETATION
• Do not commit the naming fallacy, or the false
Issues in the interpretation of SEM results for various belief that naming a common factor or composite
kinds of effects and models are considered next: means that the corresponding theoretical concept is
understood. Factor or composite names are not expla-
• Do not automatically interpret “closer to fit” as nations. For example, if a three-factor CFA models fits
“closer to truth.” Close model–data correspondence the data, this does not prove that the names assigned
Pt4Kline5E.indd 423 3/22/2023 2:06:11 PM

to the factors by the researcher are correct. Alternative been a pleasure. As Garrison Keillor says at the conclu-
explanations of factors are often possible in many, if not sion of The Writer’s Almanac, the daily podcast about
most, factor analyses. poetry, literature, and history: Be well, do good work,
and keep in touch.
SUMMARY
LEARN MORE
So concludes this journey of discovery about SEM. As
on any guided tour, you may have found some places McCoach et al. (2007) outline types of inference errors in
along the way more interesting or relevant than others. SEM, Tomarken and Waller (2005) survey common misun-
derstandings, and Tu (2009) addresses the use of SEM in
You may decide to revisit certain places by using par-
epidemiology and reminds us of its limitations.
ticular techniques in your own work. In any event, I
hope that reading this book has given you new ways
McCoach, D. B., Black, A. C., & O’Connell, A. A. (2007).
of looking at your data and hypotheses. Use SEM in Errors of inference in structural equation modeling. Psy-
all its guises and forms to address good questions and chology in the Schools, 44(5), 461–470.
to provide new perspectives on older ones, but use it
directed by good sense and strong domain knowledge. Tomarken, A. J., & Waller, N. G. (2005). Structural equa-
Use it also as a way to reform methods of data analysis tion modeling: Strengths, limitations, and misconcep-
by focusing more on models instead of specific effects tions. Annual Review of Clinical Psychology, 1(1), 31–65.
analyzed with traditional significance tests. And thank Tu, Y.-K. (2009). Commentary: Is structural equation model-
you for your attention, hard work, dedication, and most ling a step forward for epidemiologists? International
of all for including me as part of your journey. It has Journal of Epidemiology, 38(2), 549–551.
Pt4Kline5E.indd 424 3/22/2023 2:06:11 PM

SUGGESTED ANSWERS TO EXERCISES
Suggested Answers to Exercises
CHAPTER 4
1. There is some slight rounding error in these calcula- I used the R code in Topic Box 4.1 to input the
tions working from standard deviations (see Table 4.1): covariance matrix just listed and generate the
results described next. The eigenvalues are (98.229,
s X2 = 6.20482 = 38.4995
7.042, –3.671), and the determinant is –2,539.380.
sW2 = 14.57742 = 212.5006
The matrix is clearly NPD. The correlation matrix
sY2 = 4.69042 = 21.9999 implied by the covariance matrix for these data
covXW = .4699(6.2048)(14.5774) = 42.5024 is presented next in lower-diagonal form with an
covXY = .6013(6.2048)(4.6904) = 17.4996 out-of-bounds entry for the correlation between
covWY = .7496(14.5774)(4.6904) = 51.2530 W and Y:
1.0
2 2
2. Given s X = 12.00 and sY = 10.00, the covariance and .7501 1.0
the out-of-bounds correlation are
–.8959 –1.4792 1.0
covXY = rXY 12.00 × 10.00
= rXY (10.9545) = 13.00 4. Listed next is the syntax for R that reads in the data
rXY = 13.00/10.9545 = 1.19 as an array and generates the box plot with values
for the requested parts of diagram or data:
3. The full covariance matrix computed with pairwise scores <- rep(c(10,11,12,13,14,15,16,
deletion for variables X, W, and Y, respectively, is 17,27),times=c(5,15,14,13,5,5,4,1,
1))
86.4 15.9 –26.3333
boxplot(scores, horizontal = TRUE,
15.9 5.2 –10.6667 frame = FALSE)
–26.3333 –10.6667 10.0 a <- boxplot.stats(scores)$stats
425
AnswrsKline5E.indd 425 3/22/2023 4:48:16 PM

b <- boxplot.stats(scores)$out Note that the computation of hinges for boxplots
values <- append(a,b) can vary over software, so don’t worry if your val-
text(x = values, labels = values,
ues for H1 and H2 are not exactly the same as those
y = 1.25)
just listed if you used a different computer tool.
The box plot generated in R is displayed next:
5. In a normal distribution, there is zero skewness
10 11 12 13.5 17 27
and kurtosis, so the distribution is clearly not nor-
mal. These results fall just short of a rule of thumb
for severe nonnormality (i.e., < 2.0 and < 7.0 for,
respectively, absolute skewness and kurtosis), but
that rule is not universal.
6. Base R 4.2.2 had no native functions for skewness

10 15 20 25
or kurtosis, so I used lessR (Gerbing, 2022). For the
The lowest and highest scores that are not outliers data in Exercise 4, G1 = 3.13 and G 2 = 15.93. Before
are 10 and 17; the median is 12; the hinges are H1 = applying a transformation, add the constant –9 so
11 and H2 = 13.5; H2 – H1 = 2.5; and the outlier is that the lowest score is 1.0. For a square root trans-
27, which exceeds formation, G1 = 1.29 and G 2 = 4.35. Even greater
reduction in nonnormality is afforded by the trans-
13.5 + 1.5(2.5) = 20.75 formation ln X, for which G1 = 0 and G 2 = .60.
CHAPTER 6
1. The modified DAG for Figure 6.1(e) that depicts 2. Both confounders and instruments directly affect
residual confounding is presented next: causal variables, but instruments do not directly
UC P influence the outcome, whereas confounding vari-
ables do. Proxies for unmeasured confounders or
measured confounding variables are usually treated
X Y
as covariates in standard regression analysis, but
The modified graph implies four paths between X instruments are analyzed in special instrumental
and Y, three of which are biasing paths: variable regression techniques, such as 2SLS.
X←P→Y
3. The graph presented next
X UC P→Y
X UC Y Y→A→X
X→Y implies that same conditional independence, or
X ⊥ Y | A, as in Figures 6.2(a) and 6.2(b), so all three
Regressing Y on X and P will close the first and sec-
graphs are d-separation equivalent.
ond biasing paths just listed, but the third biasing
path that involves only the unmeasured confounder
is not entirely closed. Thus, the coefficient for X will 4. These definitions are from Elwert and Winship
be biased by the amount of residual confounding (2014) for a cause X and its outcome Y: Confound-
due to P not perfectly measuring all of UC. ing bias results from the failure to condition on a
426

common cause of X and Y; overcontrol bias results The second graph is

from conditioning on a variable (or its descendant)
along a causal pathway between X and Y; and col-
X Y C
lider bias is due to conditioning on a collider (or
its descendant) on a noncausal pathway between X where C is a direct cause of only Y. Regressing Y on
and Y. both X and C will close the biasing path
X C→Y
5. Listed next are sets of conditional independencies
implied by the graphs in Figures 6.3(a)–6.3(c):
8. In Figure 6.5(a), there are three back-door paths
(a) E ⊥ S1 | A (b) E ⊥ A (c) E ⊥ S1 | A between D and Y, including
E ⊥ S2 | A S1 ⊥ S2 | A, E
D←B→Y
6. The DAG shown next D←A→X→Y
D←A→X→E→Y
A
E S2
There are no colliders along any biasing path. The
adjustment set (A, B) would block all three paths
S1
just listed. The set (A, B) is also minimally suffi-
cient because conditioning on either member of this
represents the hypotheses that (a) E, A, and S1 are pair by itself would not close all three paths. The
all direct causes of S2; (b) E indirectly affects S2 set (B, X) is also sufficient because conditioning on
through A; and (c) E and S1 have a common latent both covariates would also close all three back-door
cause. In this graph, estimating the total effect of E paths. It is also minimally sufficient because both
by regressing S2 on both E and S1 (but not also A) covariates are required to close all biasing paths.
closes the only biasing path, or There are larger adjustment sets that identify the
E S1 → S2 total effect, such as (A, B, X), but they are not mini-
mally sufficient.
7. In the DAG presented next where C (a covariate) is

a direct cause of only X (exposure), 9. The sets (B, X) and (D, X) each d-separate vari-
ables E and Y in a modified version of Figure 6.5(a),
where the direct effect from E to Y is deleted. There
C X Y are larger sets of variables that d-separate E and Y,
such as (B, D, X), but they are not minimally suf-
regressing Y on both X and C will close the biasing ficient. There are no smaller sets (i.e., < 2 variables)
path that d-separate variables E and Y. Thus, (B, X) and
X←C Y (D, X) are the minimally sufficient adjustment sets
that each identify the direct effect of E on Y.
CHAPTER 7
1. There are 2(3)/2 = 3 observations for Figure 7.2(a) and the disturbance), and 1 coefficient for the effect
but 4 free parameters, including 2 variances (of X of X on Y, so df M = 3 – 4 = –1.
and the disturbance for Y), 1 covariance (between X
427

2. In the alternative version of Figure 7.2(b) presented 5. The correlation is practically zero, or rXY = –.047,
next but the association is mainly quadratic. When Y is

regressed on both X and X2, R = .927. The equation
P
is
Ŷ = –.052X + .083X2 + 7.355
X Y
which defines the quadratic regression line plotted
next:
variable X covaries with P while P is still a cause
of Y. Because X and P are specified as correlated 15
causes, the overlap between them is controlled for
when the coefficient for the effect of X on Y is com- 10
Y
puted. The model just presented and Figure 7.2(b)
5
are equivalent because both will perfectly fit the
data (i.e., df M = 0 for both models). −5 0 5
X
3. For Figure 7.3(a), it is assumed that exogenous vari-

ables X and W covary but for reasons unknown, 6. For variable W, M W = 16.375 and SDW = 6.022,
their scores are perfectly reliable (rXX = rWW = 1.0), so the three target values are W = 10.353, 16.375,
and there is zero interaction between them in their 22.387. When Y is regressed on X and W, the mul-
strictly linear effects on Y. Also, directionality is tiple correlation is R = .183, but when Y is regressed
correctly specified (X → Y, W → Y), and all unmea- on X, W, and XW, the multiple correlation is R =
sured causes of Y are independent of both X and W. .910. The unstandardized equation is
Ŷ = 1.768X + .734W – .108XW – 3.118
4. With v = 4 variables in Figure 7.4(a), there are
For every 1-point increase in W, the slope of the
4(5)/2, or 10 observations. Free parameters include
linear regression of Y on X decreases by .108. By
4 variances (of X1, X2, and two error terms), 1 cova-
entering the three different values for W into the
riance (of X1 and X2), and 5 coefficients for direct
unstandardized regression equation, we generate
effects for a total of 10, so df M = 10 – 10 = 0. There
equations for the three simple linear regressions of
is a single directed (causal) path between the two
Y on X listed next:
endogenous variables, or Y1 → Y2, and three undi-
rected (back-door, biasing) paths for spurious asso- W = 10.353, Ŷ = .650 X + 4.481
ciations that involve common causes, or W = 16.375, Ŷ = –.001 X + 8.901
Y1 ← X1 → Y2 Y1 ← X2 → Y2 W = 22.387, Ŷ = –.650 X + 13.314
and Y1 ← X1 X2 → Y2 Your plot of the simple regression lines just listed
Even if Y1 → Y2 = 0, the two variables should covary should resemble Figure 7.9 for the same data.
due to expected noncausal associations.
CHAPTER 8
1. Given the p values in Table 8.2 at three-decimal ln (.260) + ln (.455) + ln (.087)

accuracy, + ln (.118) + ln (.048) = –9.750
C = –2 (–9.750) = 19.500
428

which is similar to the value calculated by the com- 6. In Figure 8.1(a), the unstandardized indirect effect

puter for these data, C = 19.521, but at greater than of hardy on illness through stress is estimated as
3-decimal accuracy as in these hand calculations. –.203(.574), or –.117 at three-decimal accuracy. This
result indicates that illness is expected to decrease
2. In Table 8.3, a 1-point increase in the raw score by .117 points in its original metric while holding
metric of the hardy variable predicts a decrease of hardy constant but decreasing stress to whatever
.203 points in raw score metric of the stress vari- value it would attain under a 1-point increase in the
able. The significance test is z = –.203/.045 = –4.51, original metric of hardy. The standardized estimate
p < .01. is –.230(.308), or –.071, so illness should decrease
by .071 standard deviations while keeping hardy
constant and decreasing stress by the value it would
3. In Table 8.3, a 1-point increase in stress predicts an
attain, given an increase in hardy of a full standard
increase of .574 points in illness, controlling for fit-
deviation.
ness. The significance test is z = .574/.089 = 6.45,
p < .01.
7. In Table 8.3 for unstandardized variables, a = .108,
SEa = .013 for the effect of exercise on fitness and
4. In Table 8.4, fitness and stress explain .177 of the
b = –.849, SEb = .162 for the effect of fitness on
total variance for illness, so the standardized error
illness controlling for stress. The value of Sobel
variance is 1 – .177, or .823. The observed variance
approximate standard error for ab = –.092 is calcu-
is 3,903.75, so the unstandardized error variance is
lated as
.823 (3,903.75), or 3,212.786.
=SE ab .108 2 (.162 2 ) + (−.849
= 2
)(.0132 ) .021
5. In Figure 8.1(b), for every increase in exercise of 1
standard deviation, fitness is expected to increase 8. In Table 8.5 for the indirect effect of exercise on
by .390 standard deviations; and stress is expected illness through fitness, the Sobel test result for the
to decrease by .230 standard deviations, given an product estimator is z = –.092/.021 = 4.38, p < .01;
increase of 1 standard deviation in hardy. the test result for the estimator based on hardy as
the covariate is z = –.080/.048 = 1.67, p = .096; and
the test result for the estimator based on stress as
the covariate is z = –.059/.046 = 1.28, p = .200.
CHAPTER 9
1. In Table 9.2, the quantity 1 – R2 = .840 = 3,212.567/σˆ 2I Fitness ← Exercise Hardy → Stress
where σˆ 2I is the predicted variance for illness. Thus,
The product of the standardized coefficients (Table
σˆ 2I = 3,212.567/.840 = 3,824.485 9.2) for this tracing is
which is very close to the value in lavaan output .390 (–.030) (–.230) = .003
from the fitted covariance matrix for this variable,
so the predicted correlation is .003. The sample
or 3,824.102 (see the output file for analysis 1, Table
correlation is –.130 (Table 4.3), so the correlation
9.1).
residual equals (–.130 –.003), or –.133.
2. There is a single noncausal path between fitness and

stress in the covariance structure of Figure 9.1; it is
429

2
3. For these data, M X = 11.00, s X = 38.50, MY = 25.00, 4. The intercept in Figure 9.1 for regressing illness on
2
and sY = 22.00. fitness and stress is 114.874, which is the predicted
score on illness in its raw score metric when the
a. Ŷ = .455 X + 20.000, R2 = .3616. scores on both fitness and stress equal zero.
b. Ŷ = .455 X + 20.000 UNIT, the coefficient for
the constant is the intercept for the regression of 5. The means of all variables are zero in the Std.all
Y on X in the standard analysis (3a). solution, so all means and intercepts in the mean
structure of Figure 9.1 would equal zero in this
c. X̂ = 11.000 UNIT, the coefficient for the con-
standardized solution.
stant is the mean of X.
d. Disturbance variance = (1 – .3616) (22.000) =
6. In Figure 9.1, the coefficient for the direct pathway
14.054:
from the constant to illness is 114.874. There are
14.054 two parents of illness, fitness and stress. Earlier the
1
predicted mean for fitness was calculated as 67.10.
11.000 20.000 DY The predicted mean for stress equals the sum of the
1
intercept for its regression on hardy, 24.00, and the
product of the mean of its parent, hardy, and the
38.500 X Y
.455 regression coefficient, or 0(–.203) = 0; that is, the
predicted mean for stress is 24.00. The unstandard-
Predicted mean for Y = 20.000 + 11.000 (.455) = ized regression coefficients for regressing illness on
25.000. The mean structure has no degrees of free- fitness and stress are, respectively, –.849 and .574,
dom, so the predicted and observed means for Y are so the predicted mean for illness is
equal.
114.8714 – .849 (67.10) + .574 (24.00) = 71.68
which equals the sample mean for illness, 71.67,
within slight rounding error (Table 4.3).
CHAPTER 10
11.107 − 5 
1. The result chiML = 0 means that the model has per- CFI = 1 –   = .961
fect fit, but ê = 0 says only that chiML ≤ df M.  165.608 − 9 
4. In lavaan, free parameters of the baseline model

2. In Table 10.2, ln L 0 = –9,429.689, ln L1 = –9,424.135,
include the variances and covariances of the exog-
so
enous variables, exercise and hardy, and the vari-
–2 (–9,429.689) + 2 (–9,424.135) = 11.108, ances of the endogenous variables, fitness, stress,
and illness, which have no covariances with any
which matches within slight rounding error
other variable, for a total of 6 free parameters. The
chiML(5) = 11.107 for this analysis.
number of observations is 5(6)/2 = 15, so df B = 15 –
6 = 9. The lavaan syntax listed next specifies the
3. In Table 10.2, N = 373, chiML (5) = 11.107, and baseline model just described. By lavaan default,
chiB(9) = 165.608, so the variances of the exogenous variables are speci-
11.107 − 5 fied as free model parameters:
RMSEA = = .057
5(372)
430

roth.baseline <- ‘ 5. In Table 10.3 at N = 373, if the model does not have

exercise ~~ hardy close fit in the population, the probability of cor-
fitness ~~ fitness
rectly rejecting in significance testing the close-
stress ~~ stress
fit hypothesis is .317. The minimum sample size
illness ~~ illness ‘
required for power ≥ .90 is N = 1,997.
Fitting the baseline model just specified to the data
in Table 4.3 in default ML generates chiB (9) =
165.608.
CHAPTER 11
1. The scaled chi-square difference statistic is calcu- chiD (1) = 11.107 – 5.937 = 5.170, p = .03
lated for these results as follows:
Values of approximate fit indexes are RMSEA =
df D = 17 – 12 = 5; chiD (5) = 57.50 – 18.10 = 39.40 .036, 90% CI [0, .092], CFI = .988, SRMR = .034.
57.50 18.10 The two largest absolute correlation residuals are
c1 = = 2.028; c2 = = 1.567 –.095 for hardy and illness and .082 for hardy fit-
28.35 11.55
39.40 39.40 ness. The standardized residual for hardy and ill-
chiSD (5) = = ness (–1.963) is just significant at the .05 level.
[2.028(17) − 1.567(12)] / 5 3.134
Although the respecified model has better global fit
= 12.57, p = .028
than the original model, there remain concerns at
the level of local fit.
2. The model chi-square is predicted to decrease by
MI = 3.907, which would be a significant reduc-
4. With 5 observed variables in Figure 11.1(b), there
tion at the .05 level (p = .048). The actual reduction
are 5(6)/2 = 15 observations. Free parameters
is chiD (1) = 3.972. The expected unstandardized
include 5 variances (of illness symptoms, neurolog-
parameter change (from zero) by freely estimat-
ical dysfunction, and disturbances for diminished
ing the disturbance covariance between fitness and
SES, low morale, and poor relationships), 1 covari-
stress is –56.529, and the estimated standardized
ance (between illness symptoms and neurological
parameter change for the same respecification is
dysfunction), and 6 direct effects for a total of 12,
–.102. These estimates suggest that stress and fit-
so df M = 15 – 12 = 3.
ness share at least one common unmeasured cause
that acts by increasing fitness while reducing stress.
5. In Table 11.4, for the psychosomatic model,
3. Syntax for the respecified path model with a direct AIC = –2 (–8,572.844) + 2 (10) = 17,165.688
effect from fitness to stress in lavaan is listed next: AIC2 = 40.488 + 2 (10) = 60.488
roth2.model <- ‘ BIC = –2 (–8,572.844) + 10 (ln 469) = 17,207.194
fitness ~ exercise BIC2 = 40.488 + 10 (ln 469) = 101.994
stress ~ fitness + hardy
illness ~ fitness + stress ‘ and for the conventional medical model,
The respecified model passes the chi-square test, AIC = –2 (–8,554.222) + 2 (12) = 17,132.444
chiML (4) = 5.937, p = .204, and the reduction in the AIC2 = 3.245 + 2 (12) = 27.245
model chi-square from the original model is signifi- BIC = –2 (–8,554.222) + 12 (ln 469) = 17,182.251
cant at the .05 level, or
BIC2 = 3.245 + 12 (ln 469) = 77.052
431

which match the results in Table 11.4 within slight CI [0, .080], CFI = .999, and SRMR = .016. Abso-
rounding error. lute correlation residuals range from 0 to .041, and

there are no significant standardized residuals, so
6. The conventional medical model in Figure 11.1(b) local fit is also not grossly problematic. Remember
passes the chi-square test, chiML (3) = 3.245, p = .355, that these results do not confirm the correctness of
and results for approximate fit indexes do not signal the original model, and there are equivalent ver-
a gross global fit problem: RMSEA = .013, 90% sions, too.
CHAPTER 12
1. In Table 12.6, for every 1-point increase in achieve- more unexplained variance for delinquency in the
ment in its original metric, there is in the Black Black sample (2.306) than in the White sample
sample a decrease of .493 points in delinquency, (1.881).
controlling for SES, effort, and verbal IQ. The
significance test for this result is z = –.493/.138 = 3. In Table 12.7, the ratios of unstandardized indirect
3.572, p < .01. In the White sample, the comparable effects to their standard errors are all < 1.0 in the
decrease is .087 points, and z = –.087/.124 = .701, White sample. In the Black sample, z = –.016/.005
p = .483. = 3.20, so the unstandardized indirect effect of ver-
bal IQ on delinquency through achievement is sta-
2. For the Black sample in Table 12.6, the R2 values tistically significant at the .01 level. The other two
for achievement and delinquency are, respectively, unstandardized indirect effects are not significant
1 – .697 = .303 and 1 – .855 = .145. The correspond- in the Black sample.
ing values in the White sample for achievement and
delinquency are, respectively, 1 – .668 = .332 and 4. In Table 12.7, the unstandardized indirect effect
1 – .913 = .087. Because R2 is a standardized effect of verbal IQ on delinquency through achievement
size, its values are not directly comparable over in the Black sample is –.016. Thus, while holding
groups with different variances and covariances verbal IQ constant while increasing achievement to
(see Table 12.1). Instead, the unstandardized distur- the level it would attain under a 1-point increase in
bance variances in Table 12.6 are directly compa- verbal IQ, the level of delinquency is expected to
rable over groups. For example, there is somewhat decline by .016 points in its raw score metric.
CHAPTER 14
1. For v = 3 indicators, the number of observations 2 indicator error variances, and 1 factor loading (the
is 3(4)/2, or 6. Free parameters include 1 factor other is fixed to 1.0) for a total of 4, so df M = 3 – 4 =
variance, 3 indicator error variances, and 2 factor –1 (the model is underidentified).
loadings (the third is fixed to 1.0 for the reference
variable) for a total of 6. Thus, df M = 6 – 6 = 0 (the 3. In Figure 14.2(b), there are 6(7)/2 = 21 observations
model is just-identified). and, assuming the variances for both factors are
fixed to 1.0, free parameters include 6 factor load-
2. For v = 2 indicators, there are 2(3)/2, or 3 obser- ings, 6 indicator error variances, and 1 factor cova-
vations. Free parameters include 1 factor variance, riance for a total of 13, so df M = 21 – 13 = 8.
432

4. In Figure 14.2(c), the ECI constraint for indicators 7. The estimated factor correlation is .557 (Table

X4 –X6 of factor B is 14.3). Structure coefficients computed with slight
rounding error are shown next:
(d + e + f)/3 = 1.0,
which is equivalent to 3 – d – e – f = 0 Indicator Simultaneous
HM .497 (.557) = .277
5. In Figure 14.3, the number of observations is 8(9)/2 NR .807 (.557) = .449
= 36, and model free parameters include 10 vari- WO .808 (.557) = .450
ances (of 2 factors and 8 indicator error terms), 6
factor loadings (2 are fixed to 1.0 as scaling con- Indicator Sequential
stants), and 1 factor covariance for a total of 17, so GC .503 (.557) = .280
df M = 36 – 17 = 19. TR .726 (.557) = .404
SM .656 (.557) = .365
6. In a word, awful. To elaborate, absolute correlation MA .588 (.557) = .328
residuals range from .003 to .397, and the worst PS .782 (.557) = .436
results are for the pairs of indicators listed next:
Hand Movement, Word Order, .101 8. The sample variance for Hand Movements is s2 =
Number Recall, Word Order, .397 3.402, or 11.560 (Table 14.2). In lavaan, the vari-
Number Recall, Gestalt Closure, –.130 ance for this indicator for N = 200 is rescaled as
S2 = (199/200) 11.560, or 11.502. The standard-
A total of 9 standardized residuals are significant at ized loading for this indicator is .497 (Table 14.3),
.05, including those for all 3 pairs of indicators just so R2 = .4972, or .247. The standardized common
listed. Not a pretty sight (i.e., local fit is poor). (explained) variance is .247(11.502), or 2.841. This
result is within the bounds of expected rounding
error from the estimated variance of the sequential
factor, 2.838 (Table 14.3).
CHAPTER 15
1. For v = 9 indicators in Figure 15.1(b), there are total of 20. There is one unit loading identity (ULI)
9(10)/2 = 45 observations. Free parameters include constraint per factor, so the number of free factor
12 variances (1 for exogenous factor A, 2 for the dis- loadings is 3(2) + 1, or 7, across all four factors.
turbances of endogenous factors B and C, and 9 for Thus, the total number of free parameters for the
indicator error terms), 6 factor loadings for indica- whole model is 20 + 7 = 27, so df M = 66 – 27 = 39.
tors that are not reference variables (2 per factor),
and 2 direct effects between the factors for a total of 3. Given the standardized path coefficients in Table
20. Thus, df M = 45 – 20 = 25. 15.5, for every increase in the cognitive ability of
a full standard deviation, the level of classroom
2. For v = 11 indicators in Figure 15.3, the number adjustment is expected to increase by .431 standard
of observations is 11(12)/2 = 66. Free parameters deviations, while controlling for risk. A decrease of
include 15 variances (of 2 exogenous factors, 2 dis- .151 standard deviations in adjustment is predicted,
turbances for endogenous factors, and 11 indicator given an increase in risk of a full standard deviation
error terms), 1 covariance between exogenous fac- controlling for achievement.
tors, and 4 direct effects between factors for a sub-
433

4. In Table 15.5, the estimated proportions of explained Cognitive =~ isl + islr + gml + ocl
variation for the endogenous factors are the comple- + cpal + gmr
isl ~~ islr
ments of their standardized disturbances, or R2 =
gml ~~ gmr
1 – .498 = .502 for the achievement factor and R2 =
sumd ~ Cognitive ‘
1 – .732 = .268 for the adjustment factor.
The global fit of the model just specified to the data
5. In Figure 15.6, v = 7 and the number of observations in Table 15.6 is identical to that of Figure 15.6; that
is 7(8)/2 = 28. Free parameters include a total of 5 is, the two models are equivalent (e.g., chiML(12) =
loadings, 6 error variances, and two 2 covariances 9.885 for both models). Their estimates differ only
for indicators of the cognitive factor. The 3 other for the symptom unawareness outcome:
free parameters are the variance of the cognitive Parameter Unst. St.
factor, the disturbance variance for the unaware-
ness factor, and the direct effect of cognitive level Figure 15.6
on unawareness. The total number of free param- (separate error and disturbance terms)
eters is 16, so df M = 28 – 16 = 12. Cognitive → Unaware –.105 –.254
Disturbance variance 1.060 .936
6. For Figure 15.6, I used semTools to estimate the
power of the chi-square test for N = 193, df M = 12, No separate error term
e0 = 0, e1 = .05, and a = .05, and the result is .284. Cognitive → SUMD –.105 –.221
Thus, the likelihood of detecting a model that does Disturbance variance 1.420 .951
not perfectly fit the population data matrix over ran- Note. Unst., unstandardized; St., standardized.
dom samples is just a bit over .28. The target sample
size for power to equal or exceed .90 is N = 729, or When an outcome is measured with error but that
nearly 4 times larger than the actual sample size— error is uncontrolled (i.e., the second set of results
see the output file for analysis 4 in Table 15.1. listed in the table just presented), standardized
coefficients tend to be too small (i.e., –.221 vs.
–.254 when error is controlled), but unstandard-
7. The lavaan syntax for a model where SUMD is
ized coefficients are unaffected (i.e., –.105 in both
regressed directly on the cognitive factor (i.e.,
models). Disturbance variance both standardized
treated as a single indicator in a path model with no
and unstandardized is greater in the analysis where
separate error term) is listed here:
measurement error is not removed from the distur-
sauveSR2.model <- ‘ bance for the same outcome.
CHAPTER 16
1. Given Equation 16.2, common variance for 4 ele- 2. Fixing the error variance for the single indicator
ments is counted 4 times among the total variances to a constant and its unstandardized loading to
and 12 times among their covariances, or 2 × 4(3)/2 1.0 identifies the measurement part of the model.
= 12, for a total of 16 times. In contrast, unique vari- The structural part of Figure 16.1 is recursive, so
ance is counted only 4 times within the total vari- it, too, is identified. Because the measurement and
ances. Thus, the relative contribution of common structural parts of the model are both identified, the
variance to unique variance is 16:4, or 4:1. whole model is identified (Rule 15.1).
434

3. For v = 8 indicators in Figure 16.1, there are 8(9)/2 4. The OLS estimate for the standardized indirect

= 36 observations. Free parameters include 11 vari- effect of acculturation on depression through
ances (2 for exogenous factors [acculturation, SES], stress in analysis 4 for Figure 16.3 is .059 with a
2 for disturbances of endogenous factors [stress, standard error of .014. In the bootstrap percentile
depression], and 7 indicator measurement errors method, the 95% confidence interval is [.034, .093].
[the error variance for SCL90D is fixed to equal a The ML estimate in HO specification for the same
constant]); 2 covariances (1 between a pair of exog- effect is .064 with a standard error of .017. Thus,
enous factors and 1 between a pair of indicators); 4 the expected increase in depression is about 6% of
factor loadings for indicators that are not reference a standard deviation while keeping acculturation
variables; and 3 direct effects on endogenous fac- constant and increasing stress to whatever level it
tors for a total of 20. Thus, df M = 36 – 20 = 16. would attain under an increase in acculturation of
one standard deviation.
CHAPTER 17
1. The number of observations for Figure 17.1 is 5(6)/2 estimated in this method are, respectively, 702.393
= 15. Free parameters include 2 factor loadings for and 158.501, so the estimated factor correlation is
the mother and father-mother indicators, 2 factor
157.495
variances and 1 factor covariance, and 5 indicator = .472
error terms for a total of 10. Thus, df M = 15 – 10 = 5. 702.393 × 158.501
4. In the output for analysis 1, Table 17.1, sample

2. Fixing the unstandardized loadings for both indica-
SRMR = .045 with a standard error of .014. The
tors of marital adjustment, problems and intimacy,
p value for the exact-fit test is .196. The unbiased
to 1.0 assumes their metrics are the same, which
estimate for population SRMR (designated as
is not true in these data, so it is no surprise that
"usrmr" in the output) is .032 with the 90% confi-
the reanalysis in lavaan with the constraints just
dence interval [–.013, .076]. In the close-fit test of
mentioned generates a Heywood case (the error
the null hypothesis where population SRMR = .05,
variance for intimacy is –37.158.)
the p value is .749. Neither the exact-fit test nor the
close-fit test is failed in this analysis, but power is
3. The MIIV-2SLS estimate of factor covariance probably low. The normal deviate for the Bentler-
in Table 17.3 is 157.495. Variances of the marital type correlation residual of –.112 for the problems
adjustment and family-of-origin experiences factors and father indicators, or z = –2.421, is significant at
the .05 level.
CHAPTER 18
1. The threshold t2 = .253 in Figure 18.1(b) is the 2. The lavaan syntax to specify the model in Figure
value of the normal deviate that corresponds to the 18.2 but where the depression common factor is
60th percentile in a normal distribution. It marks scaled in the effects coding method is listed here:
that point on the continuous variable X* where the radloff.model <- ‘
responses on X shift from “2” for neutral to “3” for Depression =~ NA*x1 + a*x1 + b*x2
agree. + c*x3 + d*x4 + e*x5
5 - a - b - c - d - e == 0 ‘
435

Estimates for the unstandardized loadings of X1* 3. The lavaan-generated thresholds for item X1 are
through X 5* are, respectively, .772, 1.420, and 1.874 (Table 18.2). Using a normal
curve calculator or table, the cumulative propor-
.889, .951, 1.142, .892, and 1.126
tions that correspond to these normal deviate val-
which average to 1.0. The estimated unstandardized ues are, respectively, .7799, .9222, and .9695. The
variance of the depression factor in the effects cod- cumulative proportions of responses for X1 over the
ing method is .469, which as expected is different four categories coded as (0, 1, 2, 3) in the data are,
from the estimate of .370 in the reference variable respectively, .7799, .9222, .9696, and 1.0. (The last
method where X1 is the reference variable (Table value just listed is not a threshold.) These observed
18.2). Values of all other parameter estimates, proportions match those generated by the lavaan
global fit statistics, and correlation residuals are thresholds within slight rounding error.
identical across the two methods to scale the com-
mon factor.
CHAPTER 19
1. In Figure 19.3(b), a total of 3 variables are excluded 3. Figure 19.4(b) has a direct feedback loop with no
from the equation for Y1, or (X2, X3, Y2); there are disturbance covariance and a single instrument (X
2 excluded variables for Y2, or (X1, X3); and 4 vari- for Y1) and matches Figure 19.2(b) using the block
ables are excluded for Y3, or (X1, X2, Y1, Y2). The classification method. Thus, Figure 19.4(b) meets
required minimum number of excluded variables the minimum requirement for identifying the
for each equation is 3 – 1 = 2, so the model meets parameters of its causal loop.
the order condition.
4. Figure 19.4(b) fails the order condition for a mini-
2. We can exclude Y3 from the system matrix for Fig- mum of 2 – 1 = 1 excluded variable for each of two
ure 19.4(a) because it is recursively related to all endogenous variables; specifically, there are zero
other variables in the model. Evaluation for Y1: omitted variables for Y1. The rank condition is also
X1 X2 Y1 Y2 failed for Y1:
X Y1 Y2
► Y1 1 0 1 1
► Y1 1 1 1
Y2 1 0 1 1 → 0 → → Rank = 0
Y2 0 1 1 → → Rank = 0
After crossing out all entries in the row for Y1, the
columns for X1, X2, and Y2 are also deleted because After crossing out the first row for Y1 and then
there are 1s in the corresponding row for Y1. What deleting all columns with 1s in the row for Y1, the
remains is a matrix with a single element that reduced system matrix is empty, so rank = 0. The
equals zero, which is then deleted. Because no rows minimum rank under the rank condition is 2 – 1 =
remain in the final reduced matrix, the rank is zero, 1. The rank of the system matrix for Y2 is also zero.
so the equation for Y1 is underidentified. The same
conclusion is reached for Y2: The rank of the final 5. If Figure 19.4(b) is respecified so that the distur-
reduced system matrix for Y2 is zero, so its equation bance covariance is a free parameter, an instrument
is underidentified, too. Thus, the Figure 19.4(a) fails is required for Y2. That instrument should by theory
the rank condition. have a direct effect on Y2, but not also on Y1. For
436

example, adding a second exogenous variable X2 mon factor, 2 for the disturbances of trust and insti-

and the path X2 → Y2 would identify the parameters tutional quality); 2 covariances between pairs of
of the direct feedback loop, including the distur- exogenous variables (monarchy, technology and the
bance covariance (see Figure 19.3(a)). disturbances of trust and quality); 4 direct effects in
the structural model; 2 loadings in the measurement
6. Given v = 6 indicators, there are 6(7)/2 = 21 obser- model; and 4 error terms for indicators of common
vations in the analysis of Figure 19.5. There are a factors for a total of 16 free parameters. Thus, df M =
total of 4 variances of exogenous variables (1 for 21 – 16 = 5.
monarchy, 1 for the information technology com-
CHAPTER 20
1. In Table 20.2, the product estimator for the unstan- ances and 1 disturbance covariance), and 5 direct
dardized indirect effect is –.028, which says that effects, so df M = 1. The disturbance covariance is
for every 1-point increase in academic values in in a bow-free pattern, so the model is identified.
its original metric, the level of deviant behavior Adding a direct effect from X at time 1 to Y at time
is expected to decrease by .028 points in its raw 2 would result in a just-identified model (df M = 0)
scores metric through the effect of academic values with perfect fit to the data.
on deviance tolerance. The standardized indirect
effect is .291(–.479), or –.140, which says that devi- 4. In Figure 20.3(a), no direct causal effect of X on Y is
ant behavior is predicted to increase by .14 standard represented in the model. That is, variable X cova-
deviations, given an increase in academic values of ries with Y at time 1 and has an association with Y
a full standard deviation transmitted through devi- at time 2 through two noncausal paths, or
ance tolerance.
X1 Y1 → Y2
2. Given the results in Table 20.2: X1 M1 → Y2
P̂M = –.028/–.038 = .733 but the indirect effect of X at time 1 on Y at time 2
R̂ M = –.028/–.010 = 2.744 can be estimated by the product ab for a linear
υ̂ = (–.140)2 = .019 model of continuous variables and no interactions.
The observed unstandardized indirect effect of

5. The product a2 b2 in Equation 20.18 represents all
achievement values on deviant behavior through
effects of X on Y that are mediated by M2, but not
deviance tolerance corresponds to 73.3% of the
including any variable affected by M2. The path-
total effect and is 2.744 times larger in magnitude
specific indirect effects are listed next:
than the direct effect. The proportion of variance in
deviant behavior jointly explained by achievement X → M2 → Y
values and deviance tolerance while controlling for X → M1 → M2 → Y
spurious associations is .019, or 1.9%. X → M3 → M2 → Y
X → M1 → M3 → M2 → Y
3. In Figure 20.3(a), there are 5(6)/2 = 15 observa- X → M3 → M1 → M2 → Y
tions and a total of 14 free parameters, including 5
variances (of 3 exogenous variables and of 2 distur-
bances), 4 covariances (3 among exogenous covari-
437

CHAPTER 21
1. For all models in Figure 21.2, v = 4 and the num- For example, l4 = 1.0 for Figure 21.2(b) says that all
ber of observations, including means, is 4(7)/2, or growth has occurred by age 17 years, but l4 = 1.475
14. For the random intercept-only model (Figure for Figure 21.2(c) says that the level at age 17 years
21.2(a)), free parameters include the factor mean is the initial level plus 1.475 times the increase
(1) and variance (1) and the error variances of the from ages 14–15 years. The mean of .817 for Fig-
repeated measures (4), so df M = 14 – 6 = 8. Free ure 21.2(b) is the average increase over ages 14–17
parameters in the basis growth model (Figure years, but the mean of .554 for Figure 21.2(c) is the
21.2(b)) include the factor means (2), variances (2), average increase over ages 14–15 years. All results
and covariance (1); 2 freely estimated basis coef- just listed reflect nothing more than just different
ficients; and 4 error variances for the repeated ways to scale Shape in a basis growth model.
measures, so df M = 14 – 11 = 3. The linear change
model (Figure 21.2(d)) has a total of 9 free param- 4. Specification of the polynomial growth model in
eters, including factor means (2), variances (2), and Figure 21.3 in lavaan syntax is listed next:
covariance (1) plus 4 error variances for repeated
kimspoonQuadratic.model <- ‘
measures, so df M = 14 – 9 = 5.
Intercept =~ 1*R1 + 1*R2 + 1*R3 + 1*R4
Linear =~ 0*R1 + 1*R2 + 2*R3 + 3*R4
2. In Figure 21.3, the number of observations is 14. Quadratic =~ 0*R1 + 1*R2 + 4*R3 + 9*R4
Free parameters include 7 variances (3 latent R1 ~~ 0*R1 ‘
growth factors and 4 indicator errors) and 3 covari- There are three Heywood cases: In the unstandard-
ances and 3 means (for latent growth factors) for a ized solution, the variances of Linear (–.045) and
total of 13, so df M = 1. Quadratic (–.038) are both negative, and in the
standardized solution, the estimated correlation
3. Specification in lavaan syntax of Figure 21.2(c) is between Linear and Quadratic is 1.60, or > 1.0 in
listed next: absolute value. The solution is clearly inadmissible.
kimspoonBasis2.model <- ‘
Intercept =~ 1*R1 + 1*R2 + 1*R3 + 1*R4 5. In Figure 21.4, the mean of the Intercept factor is
Shape =~ 0*R1 + 1*R2 + R3 + R4
R1 ~~ 0*R1 ‘ µ I = a I + β IGµG + β IFµF
Estimates for the indicators and Intercept are iden-

tical for Figures 21.2(b) and 21.2(c), and the two 6. In Figure 21.4, the predicted mean of A4 is
models have identical global fit, predicted means, µ A 4 = a I (1.0) + a L (3.0) + β IGµG (1.0) + β LFµF (3.0)
and residuals. The two models differ only in esti- + β IFµF (1.0) + β LGµG (3.0)
mates for the basis coefficients, the mean and vari-
ance of Shape, and the covariance between Shape 7. In Table 21.8, the mean level of alcohol use at age
and Intercept (but not the correlation, or .410): 13 for years for young women (coded gender = 1)
Shape Intercept, is .011 higher than the mean for young men (coded
mean Shape gender = 0) at the same age, while controlling for
Figure Basis coefficients (variance) covariance
family status. This difference is not statistically sig-
21.2(b) (0, .678, .646, 1.0) .817 (.487) .014 nificant at the .05 level (z = .011/.105 = .10).
21.2(c) (0, 1.0, .953, .554 (.224) .010
1.475)
438

CHAPTER 22

1. For 5 indicators in a single-group analysis of Figure 3. The error variances of the five indicators in Figure
22.1, there are 5(8)/2, or 20 observations (5 vari- 22.1 can be automatically constrained to equality
ances, 10 covariances, 5 means). Effects coding over groups by adding this command to the lavaan
identification constraints reduce the number of free syntax for the partial strong invariance model:
loadings and intercepts by 1 each per factor. Thus, group.equal = c(“residuals”)
the mean structure has 5 parameters, including 2
factor means and a total of 3 intercepts (2 for Line The global fit of this model is poor—chiML(17) =
Meaning, 1 for Real World). Because the mean 84.622, p < .001; RMSEA = .113, 90% CI [.090,
structure is just-identified (i.e., 5 observations [indi- .138], CFI = .943, SRMR = .074—and the residuals
cator means], 5 parameters), all predicted means suggest poor local fit in the Chinese sample (e.g.,
will equal their sample counterparts. Parameters two correlation residuals exceed .20).
for the covariance structure include 3 loadings
(2 for Line Meaning, 1 for Real World), 7 variances 4. The variances of the two factors in Figure 22.1 can
(of 5 indicators and 2 factors), and 1 factor cova- be constrained to equality by adding the lavaan
riance, for a total of 11. The total number of free statement
parameters for the whole model is 5 + 11 = 16, so group.equal = c(“lv.variances”)
df M = 20 – 16 = 4.
to the syntax for the partial strong invariance
model. The fit of the model so constrained is poor—
2. Direct comparison of factor means requires at least
chiML(14) = 170.920, p < .001; RMSEA = .190, 90%
partial strong invariance, where most unstandard-
CI [.165, .217]; CFI = .867; SRMR = .074—and both
ized intercepts and loadings are constrained to
the correlation residuals and standardized residuals
equality over groups. There are no cross-group
are bad news (i.e., poor fit is indicated).
equality constraints in the single-group analyses of
Figure 22.1.
439

References
Abelson, R. P. (1997). A retrospective on the significance fication. IEEE Transactions on Automatic Control, 19(6),
test ban of 1999 (If there were no significance tests, they 716–723.
would be invented). In L. L. Harlow, S. A. Mulaik, & J. H. Allison, P. (2015, March 5). Imputation by predictive mean
Steiger (Eds.), What if there were no significance tests? matching: Promise & peril. Statistical Horizons. https://
(pp. 117–141). Erlbaum. statisticalhorizons.com/predictive-mean-matching
Abt, M., & Welch, W. J. (1998). Fisher information and max- Allison, P. D. (2012). Handling missing data by maximum
imum-likelihood estimation of covariance parameters in likelihood (Paper 312-2012). Paper presented at the SAS
Gaussian stochastic processes. Canadian Journal of Sta- Global Forum 2012, Orlando, FL. https://support.sas.
tistics, 26(1), 127–137. com/resources/papers/proceedings12/312-2012.pdf
Abu-Bader, S. H. (2010). Advanced and multivariate statisti- Altman, D. G., & Sauerbrei, W. (2006). Dichotomizing con-
cal methods for social science research with a complete tinuous predictors in multiple regression: A bad idea. Sta-
SPSS guide. Lyceum Books. tistics in Medicine, 25(1), 127–141.
Acharya, A., Blackwell, M., & Sen, M. (2016). Explaining Amador, X. F., Strauss, D. H., Yale, S. A., Flaum, M. M., End-
causal findings without bias: Detecting and assessing icott, J., & Gorman, J. M. (1993). Assessment of insight
direct effects. American Political Science Review, 110(3), in psychosis. American Journal of Psychiatry, 150(6),
512–529. 873–879.
Acock, A. A. (2013). Discovering structural equation model- Amemiya, Y., & Yalcin, I. (2001). Nonlinear factor analysis
ing using Stata (Rev. ed.). Stata Press. as a statistical method. Statistical Science, 16(3), 275–294.
Aguinis, H., & Gottfredson, R. K. (2010). Best-practice rec- Amrhein, A., Greenland, S., & McShane, B. (2019). Retire
ommendations for estimating interaction effects using statistical significance. Nature, 567, 305–307.
moderated multiple regression. Journal of Organizational Ananth, C. V., & Brandt, J. S. (2022). A principled approach
Behavior, 31(6), 776–786. to mediation analysis in perinatal epidemiology. American
Aguinis, H., Werner, S., Abbott, J. L., Angert, C., Park, J. Journal of Obstetrics and Gynecology, 226(1), 24–32.e6.
H., & Kohlhausen, D. (2010). Customer-centric science: Andersen, H. K. (2022). A closer look at random and fixed
Reporting significant research results with rigor, rel- effects panel regression in structural equation modeling
evance, and practical impact in mind. Organizational using lavaan. Structural Equation Modeling, 29(3), 476–
Research Methods, 13(3), 515–539. 486.
Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Doctoral Anderson, J. C., & Gerbing, D. W. (1988). Structural equation
training in statistics, measurement, and methodology in modeling in practice: A review and recommended two-
psychology: Replication and extension of Aiken, West, step approach. Psychological Bulletin, 103(3) 411–423.
Sechrest, and Reno’s (1990) survey of PhD programs in Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sam-
North America. American Psychologist, 63(1), 32–50. ple-size planning for more accurate statistical power: A
Akaike, H. (1974). A new look at the statistical model identi- method adjusting sample effect sizes for publication bias
441
RefsKline5E.indd 441 3/22/2023 4:43:42 PM

442 References
and uncertainty. Psychological Science, 28(11), 1547– Asparouhov, T., & Muthén, B. (2020). IRT in Mplus. http://
1562. www.statmodel.com/download/MplusIRT.pdf
Andrews, R. M., & Didelez, V. (2021). Insights into the cross- Asparouhov, T., & Muthén, B. (2021). Multiple imputation
world independence assumption of causal mediation anal- with Mplus. https://www.statmodel.com/download/
ysis. Epidemiology, 32(2), 209–219. Imputations7.pdf
Angrist, J. D., & Krueger, A. B. (2001). Instrumental vari- Audigier, V., Husson, F., & Josse, J. (2017). MIMCA: Mul-
ables and the search for identification: From supply and tiple imputation for categorical variables with multiple
demand to natural experiments. Journal of Economic Per- correspondence analysis. Statistics and Computing, 27(2),
spectives, 15(4), 69–85. 501–518.
Antonakis, J. (2017). On doing better science: From thrill of Bagozzi, R. P. (2011). Measurement and meaning in informa-
discovery to policy implications. Leadership Quarterly, tion systems and organizational research: Methodologi-
28(1), 5–21. cal and philosophical foundations. MIS Quarterly, 35(2),
Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). 261–292.
On making causal claims: A review and recommenda- Bagozzi, R. P., & Phillips, L. W. (1982). Representing and
tions. Leadership Quarterly, 21(6), 1086–1120. testing organizational theories: The holistic construal.
Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2014). Administrative Science Quarterly, 27(3), 459–489.
Causality and endogeneity: Problems and solutions. In Bagozzi, R. P., & Yi, Y. (2012). Specification, evaluation, and
D. V. Day (Ed.), The Oxford handbook of leadership and interpretation of structural equation models. Journal of
organizations (pp. 93–117). Oxford University Press. the Academy of Marketing Science, 40(1), 8–34.
APA Publications and Communications Board Working Baiocchi, M., Cheng, J., & Small, D. S. (2014). Instrumental
Group on Journal Article Reporting Standards. (2008). variable methods for causal inference. Statistics in Medi-
Reporting standards for research in psychology: Why do cine, 33(13), 2297–2340.
we need them? What might they be? American Psycholo- Bandalos, D. L., & Leite, W. (2013). Use of Monte Carlo stud-
gist, 63, 839–851. ies in structural equation modeling. In G. R. Hancock & R.
Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., O. Mueller (Eds.), Structural equation modeling: A second
Nezu, A. M., & Rao, S. M. (2018). Journal article report- course (2nd ed., pp. 625–666). IAP.
ing standards for quantitative research in psychology: The Baron, R., & Kenny, D. A. (1986). The moderator–media-
APA Publications and Communications Board Task Force tor variable distinction in social psychological research:
report. American Psychologist, 73(1), 3–25. Conceptual, strategic, and statistical considerations. Jour-
Arbuckle, J. L. (2021). IBM SPSS Amos 28 user’s guide. nal of Personality and Social Psychology, 51(6), 1173–
Amos Development Corporation. 1182.
Arbuckle, J. L., & Wothke, W. (1999). Amos 4.0 user’s guide. Barrett, P. (2007). Structural equation modelling: Adjudging
Smallwaters. model fit. Personality and Individual Differences, 42(5),
Asparouhov, T., & Muthén, B. (2005, November 14–16). 815–824.
Multivariate statistical modeling with survey data [Paper Bartholomew, D. J. (2002). Old and new approaches to latent
presentation]. Federal Committee on Statistical Meth- variable modeling. In G. A. Marcoulides & I. Moustaki
odology (FCSM) Research Conference, Washington, (Eds.), Latent variable and latent structure models
DC, United States. https://www.fcsm.gov/assets/files/ (pp. 1–13). Erlbaum.
docs/2005FCSM_ Asparouhov_Muthen_IIA.pdf Bartlett, M. S. (1950). Tests of significance in factor analysis.
Asparouhov, T., & Muthén, B. (2009). Exploratory structural British Journal of Statistical Psychology, 3(2), 77–85.
equation modeling. Structural Equation Modeling, 16(3), Bauer, D. J., Howard, A. L., Baldasaro, R. E., Curran, P. J.,
397–438. Hussong, A. M., Chassin, L., & Zucker, R. A. (2013). A tri-
Asparouhov, T., & Muthén, B. (2010). Simple second order factor model for integrating ratings across multiple infor-
chi-square correction. https://www.statmodel.com/ mants. Psychological Methods, 18(4), 475–493.
download/WLSMV_new_chi21.pdf Beauducel, A., & Wittman, W. (2005). Simulation study on
Asparouhov, T., & Muthén, B. (2013). Computing the fit indices in confirmatory factor analysis based on data
strictly positive Satorra-Bentler chi-square test in Mplus with slightly distorted simple structure. Structural Equa-
(Mplus Web Notes No. 12). https://www.statmodel.com/ tion Modeling, 12(1), 41–75.
examples/webnotes/SB5.pdf Beaujean, A. A. (2014). Latent variable modeling using R: A
Asparouhov, T., & Muthén, B. (2016). Structural equation step-by-step guide. Routledge.
models and mixture models with continuous nonnormal Becker, J.-M., Rai, A., & Rigdon, E. E. (2013). Predictive
skewed distributions. Structural Equation Modeling, validity and formative measurement in structural equation
23(1), 1–19. modeling: Embracing practical relevance. In Proceedings
Asparouhov, T., & Muthén, B. (2019). Nesting and equiva- of the 34th International Conference on Information Sys-
lence testing for structural equation models. Structural tems. Association for Information Systems. https://aisel.
Equation Modeling, 26(2), 302–309. aisnet.org/icis2013/

References 443
Benitez, J., Henseler, J., Castillo, A., & Schuberth, F. (2020). Stouthamer–Loeber (1993) interpretation. Journal of
How to perform and report an impactful analysis using Abnormal Psychology, 104(2), 395–398.
partial least squares: Guidelines for confirmatory and Blozis, S. A., & Cho, Y. (2008). Coding and centering of time
explanatory IS research. Information & Management, in latent curve models in the presence of interindividual
57(2), Article 103168. time heterogeneity. Structural Equation Modeling, 15(3),
Bentler, P. M. (1980). Multivariate analysis with latent vari- 413–433.
ables: Causal modeling. Annual Review of Psychology, Blum, M. G. B., Valeri, L., Olivier, F., Cadiou, S., Siroux,
31(1), 419–456. V., Lepeule, J., & Slama, R. (2020). Challenges raised by
Bentler, P. M. (1990). Comparative fit indexes in structural mediation analysis in a high-dimension setting. Environ-
models. Psychological Bulletin, 107(2), 238–246. mental Health Perspectives, 128(5), Article 055001.
Bentler, P. M. (2000). Rites, wrongs, and gold in model test- Blunch, N. J. (2016). Introduction to structural equation
ing. Structural Equation Modeling, 7(1), 82–91. modeling using IBM SPSS Statistics and EQS. Sage.
Bentler, P. M. (2006). EQS 6 structural equations program Boker, S. M., Neale, M. C., Maes, H. H., Wilde, M. J.,
manual. Multivariate Software. Spiegel, M., Brick, T. R., Estabrook, R., & Bates, T. C.
Bentler, P. M. (2010). SEM with simplicity and accuracy. (2022). OpenMx: Extended structural equation modelling
Journal of Consumer Psychology, 20(2), 215–220. (R package 2.20.6). https://CRAN.R-project.org/
Bentler, P. M. (2014). On components, latent variables, PLS package=OpenMx
and simple methods: Reactions to Rigdon’s rethinking of Bollen, K. A. (1987). Total, direct, and indirect effects in
PLS. Long Range Planning, 47(3), 138–145. structural equation models. Sociological Methodology, 17,
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and 37–69.
goodness of fit in the analysis of covariance structures. Bollen, K. A. (1989). Structural equations with latent vari-
Psychological Bulletin, 88(3) 588–606. ables. Wiley.
Bentler, P. M., & Huang, W. (2014). On components, latent Bollen, K. A. (1996). An alternative two stage least squares
variables, PLS and simple methods: Reactions to Rigdon’s (2SLS) estimator for latent variable equations. Psy-
rethinking of PLS. Long Range Planning, 47(3), 138–145. chometrika, 61(1), 109–121.
Bentler, P. M., & Raykov, T. (2000). On measures of explained Bollen, K. A. (2000). Modeling strategies: In search of the
variance in nonrecursive structural equation models. Jour- Holy Grail. Structural Equation Modeling, 7(1), 74–81.
nal of Applied Psychology, 85(1), 125–131. Bollen, K. A. (2002). Latent variables in psychology and
Bentler, P. M., & Satorra, A. (2010). Testing model nesting the social sciences. Annual Review of Psychology, 53(1),
and equivalence. Psychological Methods, 15(2), 111–123. 605–634.
Bentler, P. M., & Wu, E. J. C. (2008). EQS 6.1 for Windows: Bollen, K. A. (2012). Instrumental variables in sociology and
User’s guide. Multivariate Software. the social sciences. Annual Review of Sociology, 38(1),
Bentler, P. M., & Wu, E. J. C. (2020). EQS 6.4 for Windows 37–72.
[Computer software]. Multivariate Software. https:// Bollen, K. A. (2019). Model implied instrumental variables
mvsoft.com/ (MIIVs): An alternative orientation to structural equa-
Bentler, P., Bagozzi, R., & Cudeck, R. (2001). SEM using tion modeling. Multivariate Behavioral Research, 54(1),
correlation or covariance matrices. Journal of Consumer 31–46.
Psychology, 10(1), 85–87. Bollen, K. A., & Bauldry, S. (2011). Three Cs in measure-
Beran, R., & Srivastava, M. S. (1985). Bootstrap tests and ment models: Causal indicators, composite indicators, and
confidence regions for functions of a covariance matrix. covariates. Psychological Methods, 16(3), 265–284.
Annals of Statistics, 13(1), 95–115. Bollen, K. A., & Curran, P. J. (2004). Autoregressive latent
Berkson, J. (1946). Limitations of the application of fourfold trajectory (ALT) models a synthesis of two traditions.
table analysis to hospital data. Biometrics Bulletin, 2(3), Sociological Methods & Research, 32(3), 336–383.
47–53. Bollen, K. A., & Davis, W. R. (2009). Causal indicator mod-
Bernstein, I. H., & Teng, G. (1989). Factoring items and fac- els: Identification, estimation, and testing. Structural
toring scales are different: Spurious evidence for multi- Equation Modeling, 16(3), 498–522.
dimensionality due to item categorization. Psychological Bollen, K. A., & Diamantopoulos, A. (2017). In defense of
Bulletin, 105(3), 467–477. causal-formative indicators: A minority report. Psycho-
Berry, W. D. (1984). Nonrecursive causal models. Sage. logical Methods, 22(3), 581–596.
Bishop, J., Geiser, C., & Cole, D. A. (2015). Modeling latent Bollen, K. A., Fisher, Z. F., Giordano, M. L., Lilly, A. G., Luo,
growth with multiple indicators: A comparison of three L., & Ye, A. (2022). An introduction to model implied
approaches. Psychological Methods, 20(1), 43–62. instrumental variables using two stage least squares
Blalock, H. M. (1961). Correlation and causality: The multi- (MIIV-2SLS) in structural equation models (SEMs). Psy-
variate case. Social Forces, 39(3), 246–251. chological Methods, 27(5), 752–772.
Block, J. (1995). On the relation between IQ, impulsivity, Bollen, K. A., Fisher, Z., Lilly, A., Brehm, C., Luo, L., Marti-
and delinquency: Remarks on the Lynam, Moffitt, and nez, A., & Ye, A. (2022). Fifty years of structural equation

444 References
modeling: A history of generalization, unification, and dif- Brandmaier, A. M., & Jacobucci, R. C. (2023). Machine
fusion. Social Science Research, 107, Article 102769. learning approaches to structural equation modeling. In R.
Bollen, K. A., Kirby, J. B., Curran, P. J., Paxton, P. M., & H. Hoyle (Ed.), Handbook of structural equation model-
Chen, F. (2007). Latent variable models under misspecifi- ing (2nd ed., pp. 722–740). Guilford Press.
cation: Two-stage least squares (2SLS) and maximum Brandmaier, A. M., von Oertzen, T., McArdle, J. J., & Lin-
likelihood (ML) estimators. Sociological Methods & denberger, U. (2013). Structural equation model trees. Psy-
Research, 36(1), 48–86. chological Methods, 18(1), 71–86.
Bollen, K. A., & Pearl, J. (2013). Eight myths about causal- Breckler, S. J. (1990). Applications of covariance structure
ity and structural equation models. In S. L. Morgan (Ed.), modeling in psychology: Cause for concern? Psychologi-
Handbook of causal analysis for social research (pp. 301– cal Bulletin, 107(2), 260–273.
328). Springer. Breitsohl, H. (2019). Beyond ANOVA: An introduction to
Bollen, K. A., & Stine, R. A. (1993). Bootstrapping goodness- structural equation models for experimental designs.
of-fit measures in structural equation models. In K. A. Bol- Organizational Research Methods, 22(3), 649–677.
len & J. S. Long (Eds.), Testing structural equation models Breivik, E., & Olsson, U. H. (2001). Adding variables to
(pp. 111–135). Sage. improve fit: The effect of model size on fit assessment in
Bollen, K. A., & Ting, K.-F. (1993). Confirmatory tetrad anal- LISREL. In R. Cudeck, S. Du Toit, & D. Sörbom (Eds.),
ysis. Sociological Methodology, 23, 147–175. Structural equation modeling: Present and future. A Fest-
Bollen, K. A., & Ting, K.-F. (2000). A tetrad test for causal schrift in honor of Karl Jöreskog (pp. 169–194). Scientific
indicators. Psychological Methods, 5(1), 3–22. Software International.
Bono, R., Blanca, M. J., Arnau, J., & Gómez-Benito, J. (2017). Brito, C., & Pearl, J. (2003). A new identification condition
Non-normal distributions commonly used in health, edu- for recursive models with correlated errors. Structural
cation, and social sciences: A systematic review. Frontiers Equation Modeling, 9(4), 459–474.
in Psychology, 8, Article 1602. Brønnick, K., Lervåg, A., Resaland, G. K., & Moe, V. (2017).
Boomsma, A. (1985). Nonconvergence, improper solutions, Executive functions do not mediate prospective relations
and starting values in LISREL maximum likelihood esti- between indices of physical activity and academic perfor-
mation. Psychometrika, 50(2), 229–242. mance: The Active Smarter Kids (ASK) study. Frontiers
Boomsma, A., Hoyle, R. H., & Panter, A. T. (2012). The in Psychology, 8, Article 1088.
structural equation modeling research report. In R. H. Brosseau-Liard, P. E., & Savalei, V. (2014). Adjusting incre-
Hoyle (Ed.), Handbook of structural equation modeling mental fit indices for nonnormality. Multivariate Behav-
(pp. 341–358). Guilford Press. ioral Research, 49(5), 460–470.
Borsboom, D. (2017). A network theory of mental disorders. Brosseau-Liard. P. E., Savalei, V., & Li, L. (2012) An inves-
World Psychiatry, 16(1), 5–13. tigation of the sample performance of two nonnormal-
Borsboom, D., Deserno, M. K., Rhemtulla, M., Epskamp, S., ity corrections for RMSEA. Multivariate Behavioral
Fried, E. I., McNally, R. J., Robinaugh, D. J., Perugini, M., Research, 47(6), 904–930.
Dalege, J., Costantini, G., Isvoranu, A.-M., Wysocki, A. Brown, T. A. (2015). Confirmatory factor analysis for applied
C., van Borkulo, C. D., van Bork, R., & Waldorp, L. J. research (2nd ed.). Guilford Press.
(2021). Network analysis of multivariate data in psycho- Browne, M. W. (1982). Covariance structures. In D. M.
logical science. Nature Reviews Methods Primers, 1(1), Hawkins (Ed.), Topics in applied multivariate analysis
Article 58. (pp. 72–141). Cambridge University Press.
Bound, J., Jaeger, D. A., & Baker, R. M. (1995). Problems Browne, M. W. (1984). Asymptotically distribution-free
with instrumental variables estimation when the correla- methods in the analysis of covariance structures. Brit-
tion between the instruments and the endogenous explana- ish Journal of Mathematical and Statistical Psychology,
tory variable is weak. Journal of the American Statistical 37(1), 62–83.
Association, 90(430), 443–450. Browne, M. W., & Cudeck, R. (1993). Alternative ways
Box, G. E. P. (1976). Science and statistics. Journal of the of assessing model fit. In K. A. Bollen and J. S. Long
American Statistical Association, 71(356), 791–799. (Eds.), Testing structural equation models (pp. 136–162).
Box, G. E. P., & Cox, D. R. (1964). An analysis of transfor- Sage.
mations. Journal of the Royal Statistical Society: Series B Bryant, F. B., & Satorra, A. (2012). Principles and practice of
(Methodological), 26(2), 211–243. scaled difference chi-square testing. Structural Equation
Brailean, A., Aartsen, M. J., Muniz-Terrera, G., Prince, M., Modeling, 19(3), 372–398.
Prina, A. M., Comijs, H. C., Huisman, M., & Beekman, Brydges, C. R., Ozolnieks, K. L., & Roberts, G. (2017). Work-
A. (2017). Longitudinal associations between late-life ing memory–not processing speed–mediates fluid intelli-
depression dimensions and cognitive functioning: A cross- gence deficits associated with attention deficit/hyperac-
domain latent growth curve analysis. Psychological Medi- tivity disorder symptoms. Journal of Neuropsychology,
cine, 47(4), 690–702. 11(3), 362–377.

References 445
Bryk, A. S., & Raudenbush, S. W. (1987). Application of hier- lack of measurement invariance. Structural Equation
archical linear models to assessing change. Psychological Modeling, 14(3) 464–504.
Bulletin, 101(1), 147–158. Chen, F. F., West, S. G., & Sousa, K. H. (2006). A comparison
Bullock, J. G., & Green, D. P. (2021). The failings of conven- of bifactor and second-order models of quality of life. Mul-
tional mediation analysis and a design-based alternative. tivariate Behavioral Research, 41(2), 189–225.
Advances in Methods and Practices in Psychological Sci- Chen, F., Bollen, K. A., Paxton, P., Curran, P. J., & Kirby, J. B.
ence, 4(4), 1–18. (2001). Improper solutions in structural equation models:
Burt, R. S. (1976). Interpretational confounding of unob- Causes, consequences, and strategies. Sociological Meth-
served variables in structural equation models. Sociologi- ods and Research, 29(4), 468–508.
cal Methods and Research, 5(1), 3–52. Chen, F., Curran, P. J., Bollen, K. A., & Paxton, P. (2008).
Byrne, B. M. (2016). Structural equation modeling with An empirical evaluation of the use of fixed cutoff points in
AMOS (3rd ed.). Routledge. RMSEA test statistic in structural equation models. Socio-
Byrne, B. M., & Crombie, G. (2003). Modeling and testing logical Methods & Research, 36(4), 462–494.
change: An introduction to the latent growth curve model. Chen, J. W., & Zhang, J. (2007). Comparing text-based and
Understanding Statistics, 2(3), 177–203. graphic user interfaces for novice and expert users. AMIA
Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing Annual Symposium Proceedings, 2007, 125–129.
for the equivalence of factor covariance and mean struc- Chen, Y., Moustaki, I., & Zhang, S. (2023). On the estimation
tures: The issue of partial measurement invariance. Psy- of structural equation models with latent variables. In R.
chological Bulletin, 105(3), 456–466. H. Hoyle (Ed.), Handbook of structural equation model-
Cain, M. K., Zhang, Z., & Yuan, K.-H. (2017). Univariate and ing (2nd ed., pp. 145–162). Guilford Press.
multivariate skewness and kurtosis for measuring non- Cheng, C., Spiegelman, D., & Li, F. (2021). Estimating the
normality: Prevalence, influence and estimation. Behavior natural indirect effect and the mediation proportion via
Research Methods, 49(5), 1716–1735. the product method. BMC Medical Research Methodol-
Calin-Jageman, R. J., & Cumming, G. (2019). The new sta- ogy, 21(1), Article 253.
tistics for better science: Ask how much, how uncertain, Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial
and what else is known. American Statistician, 73(Suppl. invariance across groups: A reconceptualization and pro-
1), 271–280. posed new method. Journal of Management, 25(1), 1–27.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and Cheung, G. W., & Rensvold, R. B. (2002). Evaluating good-
discriminant validation by the multitrait–multimethod ness-of-fit indexes for testing measurement invariance.
matrix. Psychological Bulletin, 56(2), 81–105. Structural Equation Modeling, 9(2), 233–255.
Carvacho, G., Chaves, R., & Sciarrino, F. (2019). Perspective Choi, J., Fan, W., & Hancock, G. R. (2009). A note on confi-
on experimental quantum causality. Europhysics Letters, dence intervals for two-group latent mean effect size mea-
125(3), Article 30001. sures. Multivariate Behavioral Research, 44(3), 396–406.
Castanho Silva, B., Bosancianu, C. M., & Littvay, L. (2020). Chou, C. P., & Bentler, P. M. (2002). Model specification in
Multilevel structural equation modeling. Sage. structural equation modeling by imposing constraints.
Chakraborty, S., & Ghosh, M. (2012). Applications of Computational Statistics and Data Analysis, 41(2), 271–
Bayesian neural networks in prostate cancer study In R. 287.
Chakraborty, C. R. Rao, & P. Senn (Eds.), Handbook of Chou, C. P., & Huh, J. (2012). Model modification in struc-
statistics (Vol. 28, pp. 241–262). Elsevier. tural equation modeling. In R. Hoyle (Ed.), Handbook
Chalak, K., & White, H. (2011). Viewpoint: An extended of structural equation modeling (pp. 232–246). Guilford
class of instrumental variables for the estimation of causal Press.
effects. Canadian Journal of Economics, 44(1), 1–51. Chou, C.-C., Pressler, S. J., Giordani, B., & Fetzer, S. J.
Chambers, C. D. (2018). Introducing the transparency and (2015). Validation of the Chinese version of the CogState
openness promotion (TOP) guidelines and badges for open computerised cognitive assessment battery in Taiwanese
practices at Cortex. Cortex, 106, 316–318. patients with heart failure. Journal of Clinical Nursing,
Chang, W., Franke, G. R., & Lee, N. (2016). Comparing 24(21–22), 3147–3154.
reflective and formative measures: New insights from rel- Chou, J.-S., & Yang, J.-G. (2013). Evolutionary optimiza-
evant simulations. Journal of Business Research, 69(8), tion of model specification searches between project
3177–3185. management knowledge and construction engineering
Chatterjee, S., & Price, B. (1991). Regression analysis by performance. Expert Systems with Applications, 40(11),
example (2nd ed.). Wiley. 4414–4426.
Chen, B., & Pearl, J. (2014). Graphical tools of linear struc- Choudhary, A. (2015, November 13). Multidisciplinary
tural equation modeling. http://ftp.cs.ucla.edu/pub/stat_ research. Academike. https://www.lawctopus.com/
ser/r432.pdf academike/multidisciplinary-research/
Chen, F. F. (2007). Sensitivity of goodness of fit indexes to Cieciuch, J., Davidov, E., Schmidt, P., & Algesheimer, R.

446 References
(2019). How to obtain comparable measures for cross- Cornoni-Huntley, J., Barbano, H. E., Brody, J. A., Cohen,
national comparisons. Kölner Zeitschrift für Soziologie B., Feldman, J. J., Kleinman, J. C., & Madans, J. (1983).
und Sozialpsychologie, 71(1), 157–186. National Health and Nutrition Examination I—Epide-
Cliff, N. (1983). Some cautions concerning the application miologic followup survey. Public Health Reports, 98(3),
of causal modeling methods. Multivariate Behavioral 245–251.
Research, 18(1), 115–126. Cortina, J. M., Green, J. P., Keeler, K. R., & Vandenberg, R.
Clifton, A., & Webster, G. D. (2017). An introduction to J. (2017). Degrees of freedom in SEM: Are we testing the
social network analysis for personality and social psy- models that we claim to test? Organizational Research
chologists. Social Psychological and Personality Science, Methods, 20(3), 350–378.
8(4), 442–453. Cortina, J. M., Markell-Goldstein, H. M., Green, J. P., &
Coffman, D. L., & Millsap, R. E. (2006). Evaluating latent Chang, Y. (2021). How are we testing interactions in latent
growth curve models using individual fit statistics. Struc- variable models? surging forward or fighting shy? Organi-
tural Equation Modeling, 13(1), 1–27. zational Research Methods, 24(1), 26–54.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Costa, P., Magalhães, E., & Costa, M. J. (2013). A latent
Applied multiple regression/correlation for the behavioral growth model suggests that empathy of medical students
sciences (3rd ed.). Routledge. does not decline over time. Advances in Health Sciences
Cole, D. A., Ciesla, J. A., & Steiger, J. H. (2007). The insidi- Education, 18(3), 509–522.
ous effects of failing to include design-driven correlated Crawford, J. R. (2007). SBDIFF.EXE [computer software].
residuals in latent-variable covariance structure analysis. http://homepages.abdn.ac.uk/j.crawford/pages/dept/
Psychological Methods, 12(4), 381–398. sbdiff.htm
Cole, D. A., & Maxwell, S. E. (2003). Testing mediational Crosswell, A. D., & Lockwood, K. G. (2020). Best practices
models with longitudinal data: Questions and tips in the for stress measurement: How to measure psychological
use of structural equation modeling. Journal of Abnormal stress in health research. Health Psychology Open, 7(2),
Psychology, 112(4), 558–577. 1–12.
Cole, D. A., & Preacher, K. J. (2014). Manifest variable path Crowne, D. P., & Marlowe, D. (1960). A new scale of social
analysis: Potentially serious and misleading consequences desirability independent of psychopathology. Journal of
due to uncorrected measurement error. Psychological Consulting Psychology, 24(4), 349–354.
Methods, 19(2), 300–315. Cudeck, R. (1989). Analysis of correlation matrices using
Cole, S. R., Platt, R W., Schisterman, E. F., Chu, H., West- covariance structure models. Psychological Bulletin,
reich, D., Richardson, D., & Pool, C. (2010). Illustrating 105(2), 317–327.
bias due to conditioning on a collider. International Jour- Cudeck, R., & Henly, S. J. (1991). Model selection in covari-
nal of Epidemiology, 39(2), 417–420. ance structures analysis and the “problem” of sample size:
Collier, J. E. (2020). Applied structural equation modeling A clarification. Psychological Bulletin, 109(3), 512–519.
using AMOS: Basic to advanced techniques. Routledge. Culkin, J. M. (1967, March 18). A schoolman’s guide to Mar-
Collins, L. M., Schafer, J. L., & Kam, C.-M. (2001). A com- shall McLuhan. Saturday Review, 51–53, 70–72.
parison of inclusive and restrictive strategies in modern Cumming, G., & Calin-Jageman, R. (2017). Introduction to
missing data procedures. Psychological Methods, 6(4), the new statistics: Estimation, open science, and beyond.
330–351. Routledge.
Combrisson, E., & Jerbi, K. (2015). Exceeding chance level Curran, P. J. (2003). Have multilevel models been struc-
by chance: The caveat of theoretical chance levels in brain tural equation models all along? Multivariate Behavioral
signal classification and statistical assessment of decod- Research, 38(4), 529–569.
ing accuracy. Journal of Neuroscience Methods, 250, Curran, P. J., & Bauer, D. J. (2007). Building path diagrams
126–136. for multilevel models. Psychological Methods, 12(3), 283–
Comeau, J., & Boyle, M. H. (2018). Patterns of poverty expo- 297.
sure and children’s trajectories of externalizing and inter- Curran, P. J., Bollen. K. A., Chen, F., Paxton, P., & Kirby, J.
nalizing behaviors. SSM-Population Health, 4, 86–94. (2003). Finite sampling properties of the point estimates
Comrey, A. L., & Lee, H. B. (Eds.) (1992). A first course in and confidence intervals of the RMSEA. Sociological
factor analysis. Erlbaum. Methods & Research, 32(2), 208–252.
Cooper, H. (2011). Reporting research in psychology: How to Curran, P. J., Obeidat, K., & Losardo, D. (2010). Twelve fre-
meet journal article reporting standards. American Psy- quently asked questions about growth curve modeling.
chological Association. Journal of Cognition and Development, 11(2), 121–136.
Cooper, S. R., Jackson, J. J., Barch, D. M., & Braver, T. S. Curran, P. J., West, S. G., & Finch, J. F. (1996). The robust-
(2019). Neuroimaging of individual differences: A latent ness of test statistics to nonnormality and specification
variable modeling perspective. Neuroscience and Biobe- error in confirmatory factor analysis. Psychological Meth-
havioral Reviews, 98, 29–46 ods, 1(1), 16–29.

References 447
Curran, T., Hill, A. P., & Niemiec, C. P. (2013). A conditional multi-item and single-item scales for construct measure-
process model of children’s behavioral engagement and ment: A predictive validity perspective. Journal of the
behavioral disaffection in sport based on self-determi- Academy of Marketing Science, 40(3), 434–449.
nation theory. Journal of Sport & Exercise Psychology, Dienes, Z. (2016). How Bayes factors change scientific prac-
35(1), 30–43. tice. Journal of Mathematical Psychology, 72, 78–89.
Daly, A., Dekker, T., & Hess, S. (2016). Dummy coding vs Diggle, P. D., & Kenward, M. G. (1994). Informative drop-out
effects coding for categorical variables: Clarifications and in longitudinal data analysis. Journal of the Royal Statisti-
extensions. Journal of Choice Modelling, 21, 36–41. cal Society, Series C, Applied Statistics, 43(1), 49–93.
Davidov, E., Muthén, B., & Schmidt, P. (Eds.) (2018). Mea- Dijkstra, T. K. (2017). A perfect match between a model and a
surement invariance [Special issue]. Sociological Methods mode. In H. Latan & R. Noonan Cham (Eds.), Partial least
& Research, 47(4). squares path modeling: Basic concepts, Methodological
Davidson, L., & White, W. (2007). The concept of recovery issues and applications (pp. 55–80). Springer.
as an organizing principle for integrating mental health Dijkstra, T. K., & Henseler, J. (2011). Linear indices in
and addiction services. Journal of Behavioral Health Ser- nonlinear structural equation models: best fitting proper
vices & Research, 34(2), 109–120. indices and other composites. Quality & Quantity, 45(6),
Davies, N. M., Smith, G. D., Windmeijer, F., & Martin, R. M. Article 1505.
(2013). Issues in the reporting and conduct of instrumen- Dijkstra, T. K., & Henseler, J. (2015a). Consistent and asymp-
tal variable studies: A systematic review. Epidemiology, totically normal PLS estimators for linear structural
24(3), 363–369. equations. Computational Statistics & Data Analysis, 81,
Demirtas, H., Freels, S. A., & Yucel, R. M. (2008). Plausi- 10–23.
bility of multivariate normality assumption when multiply Dijkstra, T. K., & Henseler, J. (2015b). Consistent partial least
imputing non-Gaussian continuous outcomes: A simula- squares path modeling. MIS Quarterly, 39(2), 297–316.
tion assessment. Journal of Statistical Computation and Ding, P., Vanderweele, T. J., & Robins, J. M. (2017). Instru-
Simulation, 78(1), 69–84. mental variables as bias amplifiers with general outcome
Dempster A. P., Laird, N. M., & Rubin, D. B. (1977). Maxi- and confounding. Biometrika, 104(2), 291–302.
mum Likelihood from incomplete data via the EM Algo- DiStefano, C., & Dombrowski, S. C. (2006). Investigating
rithm. Journal of the Royal Statistical Society: Series B the theoretical structure of the Stanford-Binet—Fifth Edi-
(Methodological), 39(1), 1–38. tion. Journal of Psychoeducational Assessment, 24(2),
Deng, L., Yang, M., & Marcoulides, K. M. (2018). Structural 123–136.
equation modeling with many variables: A systematic Dolan, C. V., & Molenaar, P. C. M. (1991). A comparison of
review of issues and developments. Frontiers in Psychol- four methods of calculating standard errors of maximum-
ogy, 9, Article 580. likelihood estimators in the analysis of covariance struc-
Depaoli, S. (2021). Bayesian structural equation modeling. ture. British Journal of Mathematical and Statistical Psy-
Guilford Press. chology, 44(2), 359–368.
Depaoli, S., Kaplan, D., & Winter, S. D. (2023). Foundations Dong, Y., & Dumas, D. (2020). Are personality measures
and extensions of Bayesian structural equation modeling. valid for different populations? A systematic review of
In R. H. Hoyle (Ed.), Handbook of structural equation measurement invariance across cultures, gender, and
modeling (2nd ed., pp. 701–721). Guilford Press. age. Personality and Individual Differences, 160, Article
Derogatis, L., Rickels, K., & Rock, A. (1976). The SCL-90 109956.
and the MMPI: A step in the validation of a new self-report Dong, Y., & Peng, C.-Y. J. (2013). Principled missing data
scale. British Journal of Psychiatry, 128(3), 280–289. methods for researchers. SpringerPlus, 2(1), Article 222.
Desai, R. J., Mahesri, M., Abdia, Y., Barberio, J., Tong, A., Drasgow, F., & Hulin, C. L. (1990). Item response theory.
Zhang, D., Mavros, P., Kim, S. C., & Franklin, J. M. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of
(2018). Association of osteoporosis medication use after industrial and organizational psychology (pp. 577–636).
hip fracture with prevention of subsequent nonvertebral Consulting Psychologists Press.
fractures. JAMA Network Open, 1(3), Article e180826. Duncan, S. C., & Duncan, T. E. (1996). A multivariate latent
Deshon, R. P. (2004). Measures are not invariant across growth curve analysis of adolescent substance use. Struc-
groups without error variance homogeneity. Psychology tural Equation Modeling, 3(4), 323–347.
Science, 46(1), 137–149. Duncan, T. E., & Duncan, S. C. (2009). The ABC’s of LGM:
Devlieger, I., Mayer, A., & Rosseel, Y. (2016). Hypothesis An introductory guide to latent variable growth curve
testing using factor score regression: A comparison of four modeling. Social and Personality Psychology Compass,
methods. Educational and Psychological Measurement, 3(6), 979–991.
76, 741–770. Duncan, T. E., Duncan, S. C., & Strycker, L. A. (2006). An
Diamantopoulos, A., Sarstedt, M., Fuchs, C., Wilczynski, introduction to latent variable growth curve modeling:
P., & Kaiser, S. (2012). Guidelines for choosing between Concepts, issues, and application (2nd ed.). Routledge.

448 References
Dunn, G., Everitt, B., & Pickles, A. (2020). Modelling covari- Eisenhauer, J. G. (2003). Regression through the origin.
ances and latent variables using EQS. Chapman & Hall/ Teaching Statistics, 25(3), 76–80.
CRC. Elwert, F. (2013). Graphical causal models. In S. L. Morgan
Dunn, K. J., & McCray, G. (2020). The place of the bifactor (Ed.), Handbook of causal analysis for social research
model in confirmatory factor analysis investigations into (pp. 245–273). Springer.
construct dimensionality in language testing. Frontiers in Elwert, F., & Winship, C. (2014). Endogenous selection bias:
Psychology, 11, Article 1357. The problem of conditioning on a collider variable. Annual
Dunn, W. M., III. (2005) A quick proof that the least squares Review of Sociology, 40(1), 31–53.
formulas give a local minimum. College Mathematics Enders, C. K. (2005). An SAS macro for implementing the
Journal, 36(1), 64–65. modified Bollen–Stine bootstrap for missing data: Imple-
Dwivedi, A. K., Mallawaarachchi, I., & Alvarado, L. A. menting the bootstrap using existing structural equation
(2017). Analysis of small sample size studies using non- modeling software. Structural Equation Modeling, 12(4),
parametric bootstrap test with pooled resampling method. 620–641.
Statistics in Medicine, 36(14), 2187–2205. Enders, C. K. (2010). Applied missing data analysis. Guilford
Dziak, J. J., Coffman, D. L., Lanza, S. T., Li, R., & Jermiin, L. Press.
S. (2020). Sensitivity and specificity of information crite- Enders, C. K. (2011). Missing not at random models for latent
ria. Briefings in Bioinformatics, 21(2), 553–565. growth curve analyses. Psychological Methods, 16(1),
Edwards, J. R. (2009). Seven deadly myths of testing modera- 1–16.
tion in organizational research. In C. E. Lance & R. J. Van- Enders, C. K. (2013). Analyzing structural equation mod-
denberg (Eds.), Statistical and methodological myths and els with missing data. In G. R. Hancock & R. O. Muel-
urban legends: Doctrine, verity and fable in the organiza- ler (Eds.), Structural equation modeling: A second course
tional and social sciences (pp. 143–164). Taylor & Francis. (2nd ed., pp. 493–520). Information Age Publishing.
Edwards, J. R. (2011). The fallacy of formative measurement. Enders, C. K. (2023). Fitting structural equation models with
Organizational Research Methods, 14(2), 370–388. missing data. In R. H. Hoyle (Ed.), Handbook of structural
Edwards, J. R., & Lambert, L. S. (2007). Methods for inte- equation modeling (2nd ed., pp. 223–240). Guilford Press.
grating moderation and mediation: A general analytical Enders, C. K., & Bandalos, D. L. (2001). The relative perfor-
framework using moderated path analysis. Psychological mance of full information maximum likelihood estimation
Methods, 12(1), 1–22. for missing data in structural equation models. Structural
Edwards, M. C., Wirth, R. J., Houts, C. R., & Xi, N. (2012). Equation Modeling, 8(3), 430–457.
Categorical data in the structural equation modeling Epskamp, S. (2022). semPlot: Path diagrams and visual anal-
framework. In R. Hoyle (Ed.), Handbook of structural ysis of various SEM packages’ Output. (R package 1.1.6).
equation modeling (pp. 195–208). Guilford Press. https://CRAN.R-project.org/package=semPlot
Efron, B. (1987). Better bootstrap confidence intervals. Jour- Erceg-Hurn, D. M., & Mirosevich, V. M. (2008). Modern
nal of the American Statistical Association, 82(397), robust statistical methods: An easy way to maximize the
171–184. accuracy and power of your research. American Psycholo-
Eich, E. (2014). Business not as usual. Psychological Science, gist, 63(7), 591–601.
25(1), 3–6. Ernst, A. F., & Albers, C. J. (2017). Regression assumptions
Eid, M., Koch, T., & Geiser, C. (2023). Multitrait–multi- in clinical psychology research practice—A systematic
method models. In R. H. Hoyle (Ed.), Handbook of struc- review of common misconceptions. PeerJ, 5, Article 3323.
tural equation modeling (2nd ed., pp. 349–366). Guilford Esposito Vinzi, V., Trinchera, L., & Amato, S. (2010). PLS
Press. path modeling: From foundations to recent developments
Eid, M., Lischetzke, T., Nussbeck, F. W., & Trierweiler, L. I. and open issues for model assessment and improvement.
(2003). Separating trait effects from trait-specific method In V. E. Vinzi, W. W. Chin, J. Henseler, & H. Wang (Eds.),
effects in multitrait-multimethod models: A multiple-indi- Handbook of partial least squares (pp. 47–82). Springer-
cator CT-C(M–1) model. Psychological Methods, 8(1), Verlag.
38–60. Fabrigar, L. R., & Wegener, D. T. (2012). Exploratory factor
Eid, M., Nussbeck, F. W., Geiser, C., Cole, D. A., Gollwitzer, analysis. Oxford University Press.
M., & Lischetzke, T. (2008). Structural equation model- Fairchild, A. J., & MacKinnon, D. P. (2009). A general model
ing of multitrait–multimethod data: Different models for for testing mediation and moderation effects. Prevention
different types of methods. Psychological Methods, 13(3), Science, 10(1), 87–99.
230–253. Falk, R. F., & Miller, N. B. (1992). A primer for soft modeling.
Einstein, A. (1916/1997). Max Ernst (A. Engel, Trans.). In A. University of Akron Press.
J. Kox, M. J. Klein, & R. Schulman (Eds.), The collected Falke, A., Schröder, N., & Endres, H. (2020). A first fit index
papers of Albert Einstein (Vol. 6, pp. 141–145). Princeton on estimation accuracy in structural equation models
University Press. (Original work published 1916) Journal of Business Economics, 90(2), 277–302,

References 449
Fan, W., & Hancock, G. R. (2006). Impact of post hoc mea- Fox, J., Nie, Z., & Byrnes, J. (2022). sem: Structural equation
surement model overspecification on structural parameter models (R package 3.1-15). https://CRAN.R-project.org/
integrity. Educational and Psychological Measurement, package=sem
66(5), 748–764. Fraley, R. C., Waller, N. G., & Brennan, K. A. (2000). An
Fan, X., & Sivo, S. A. (2005). Sensitivity of fit indexes to mis- item-response theory analysis of self-report measures of
specified structural or measurement model components: adult attachment. Journal of Personality and Social Psy-
Rationale of the two-index strategy revisited. Structural chology, 78(2), 350–365.
Equation Modeling, 12(3), 343–367. French, D. P., & Sutton, S. (2010). Reactivity of measurement
Fan, Y., Chen, J., Shirkey, G., John, R., Wu, S. R., Park, H., in health psychology: How much of a problem is it? What
& Shao, C. (2016). Applications of structural equation can be done about it? British Journal of Health Psychol-
modeling (SEM) in ecological studies: An updated review. ogy, 15(3), 453–468.
Ecological Processes, 5(1), Article 19. Fritz, M. S., Kenny, D. A., & MacKinnon, D. P. (2016). The
Feng, Y., & Hancock, G. R. (2023). Power analysis within a combined effects of measurement error and omitting
structural equation modeling framework. In R. H. Hoyle confounders in the single-mediator model. Multivariate
(Ed.), Handbook of structural equation modeling (2nd ed., Behavioral Research, 51(5), 681–697.
pp. 163–183). Guilford Press. Gagne, P., & Hancock, G. R. (2006). Measurement model
Fewell, Z., Smith, G. D., & Sterne, J. A. C. (2007). The impact quality, sample size, and solution propriety in confirma-
of residual and unmeasured confounding in epidemiologic tory factor models. Multivariate Behavioral Research,
studies: A simulation study. American Journal of Epide- 41(1), 65–83.
miology, 166(6), 646–655. Galimard, J.-E., Chevret, S., Protopopescu, C., & Resche-
Fiedler, K., Schott, M., & Meiser, T. (2011). What mediation Rigon, M. (2016). A multiple imputation approach for
analysis can (not) do. Journal of Experimental Social Psy- MNAR mechanisms compatible with Heckman’s model.
chology, 47(6), 1231–1236. Statistics in Medicine, 35(17), 2907–2920.
Finch, W. H., & Bolin, J. E. (2017). Multilevel modeling using Gana, K., & Broc, G. (2019). Structural equation modeling
Mplus. CRC Press. with lavaan. ISTE/Wiley.
Finch, W. H., & French, B. F. (2008). Using exploratory fac- Garn, A. C., & Simonton, K. L. (2022). Motivation beliefs,
tor analysis for locating invariant referents in factor invari- emotions, leisure time physical activity, and sedentary
ance studies. Journal of Modern Applied Statistical Meth- behavior in university students: A full longitudinal model
ods, 7(1), 223–233. of mediation. Psychology of Sport and Exercise, 58, Arti-
Finkel, S. E. (1995). Causal analysis with panel data. Sage. cle 102077.
Finney, S. J., & DiStefano, C. (2013). Nonnormal and cate- Geiser, C. (2021). Longitudinal structural equation model-
gorical data in structural equation modeling. In G. R. Han- ing with Mplus: A latent state-trait perspective. Guilford
cock & R. O. Mueller (Eds.), Structural equation model- Press.
ing: A second course (2nd ed., pp. 439–492). IAP. Geiser, C. (2023). Structural equation modeling with the
Fisher, F., Bollen, K., Gates, K., & Rönkkö, M. (2021). Mplus and lavaan programs. In R. H. Hoyle (Ed.), Hand-
MIIVsem: Model implied instrumental variable (MIIV) book of structural equation modeling (2nd ed., pp. 241–
estimation of structural equation models (R package 258). Guilford Press.
0.5.8). https://CRAN.R-project.org/package=MIIVsem Geiser, C., Bishop, J., & Lockhart, G. (2015). Collapsing fac-
Fisher, R. A. (1954). Statistical methods for research workers tors in multitrait-multimethod models: Examining conse-
(12th ed.). Oliver & Boyd. quences of a mismatch between measurement design and
Flora, D. B. (2020). Your coefficient alpha is probably wrong, model. Frontiers in Psychology, 6, Article 946.
but which coefficient omega is right? A tutorial on using R Gelman, A., & Loken, E. (2014). The statistical crisis in sci-
to obtain better reliability estimates. Advances in Methods ence. American Scientist, 102(6), 460–465.
and Practices in Psychological Science, 3(4) 484–501. Gerbing, D. (2022). lessR: Less code, more results (R package
Flora, D. B., & Curran, P. J. (2004). An empirical evalua- 4.1.9). https://CRAN.R-project.org/package=lessR
tion of alternative methods of estimation for confirmatory Gerbing, D. W., & Anderson, J. C. (1987). Improper solutions
factor analysis with ordinal data. Psychological Methods, in the analysis of covariance structures: Their interpret-
9(4), 466–491. ability and a comparison of alternate respecifications. Psy-
Flora, D. B., & Flake, J. K. (2017). The purpose and practice chometrika, 52(1), 99–111.
of exploratory and confirmatory factor analysis in psycho- Gerbing, D. W., & Anderson, J. C. (1993). Monte Carlo evalu-
logical research: Decisions for scale development and vali- ations of fit in structural equation models. In K. A. Bollen
dation. Canadian Journal of Behavioural Science, 49(2), & J. S. Long (Eds.), Testing structural equation models
78–88. (pp. 40–65). Sage.
Fox, J. (2020). Regression diagnostics: An introduction (2nd Geyer, C. J. (2011). Introduction to Markov Chain Monte
ed.). Sage. Carlo. In S. Brooks, A. Gelman, G. L. Jones, & X.-L.

450 References
Meng (Eds.), Handbook of Markov Chain Monte Carlo ables to FIML-based structural equation models. Struc-
(pp. 3–48). CRC Press. tural Equation Modeling, 10(1), 80–100.
Ghisletta, P., & McArdle, J. J. (2012). Teacher’s corner: Latent Graham, J. W. (2012). Missing data: Analysis and design.
curve models and latent change score models estimated in Springer.
R. Structural Equation Modeling, 19(4), 651–682. Graham, J. W., & Coffman, D. L. (2012). Structural equation
Gigerenzer, G., & Murray, D. (1987). Cognition as intuitive modeling with missing data. In R. H. Hoyle (Ed.), Hand-
statistics. Erlbaum. book of structural equation modeling (pp. 277–295). Guil-
Gignac, G. E. (2008). Higher-order models versus direct hier- ford Press.
archical models: g as superordinate or breadth factor? Psy- Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007).
chology Science Quarterly, 50, 21–43. How many imputations are really needed? Some practi-
Glymour, C., Scheines, R., Spirtes, P., & Kelly, K. (1987). cal clarifications of multiple imputation theory. Prevention
Discovering causal structure. Academic Press. Science, 8(3), 206–213.
Glymour, M. M. (2006). Using causal diagrams to understand Green, S. B., Thompson, M. S., & Poirier, J. (2001). An
common problems in social epidemiology. In M. Oakes adjusted Bonferroni method for elimination of parameters
& J. Kaufman (Eds), Methods in social epidemiology in specification addition searches. Structural Equation
(pp. 387–422). Jossey–Bass. Modeling, 8(1), 18–39.
Golden, R. M. (2003). Discrepancy risk model selection test Greene, W. H. (2012). Economic analysis (7th ed.). Prentice
theory for comparing possibly misspecified or nonnested Hall.
models. Psychometrika, 68(2), 229–249. Gregorich, S. E. (2006). Do self-report instruments allow
Gomer, B., Jiang, G., & Yuan, K.-H. (2019). New effect size meaningful comparisons across diverse population
measures for structural equation modeling. Structural groups? Testing measurement invariance using the confir-
Equation Modeling, 26(3), 371–389. matory factor analysis framework. Medical Care, 44(11,
Gonzalez, O., Valente, M. J., Cheong, J., MacKinnon, D. P. Suppl. 3), S78–S94.
(2023). Mediation/indirect effects in structural equation Greiff, S., & Heene, M. (2017). Why psychological assess-
modeling. In R. H. Hoyle (Ed.), Handbook of structural ment needs to start worrying about model fit. European
equation modeling (2nd ed., pp. 409–426). Guilford Journal of Psychological Assessment, 33(5), 313–317.
Press. Grewal, R., Cote, J. A., & Baumgartner, H. (2004) Multicol-
Gonzalez, R., & Griffin, D. (2001). Testing parameters in linearity and measurement error in structural equation
structural equation modeling: Every “one” matters. Psy- models: Implications for theory testing. Marketing Sci-
chological Methods, 6(3), 258–269. ence, 23(4), 519–529.
Goodboy, A. K., & Kline, R. B. (2017). Statistical and prac- Grice, J. W. (2001). Computing and evaluating factor scores.
tical concerns with published communication research Psychological Methods, 6(4), 430–450.
featuring structural equation modeling. Communication Griffith, G. J., Morris, T. T., Tudball, M. J., Herbert, A., Man-
Research Reports, 34(1) 1–10. cano, G., Pike, L., Sharp, G. C., Sterne, J., Palmer, T. M.,
Gottfredson, N. C., Bauer, D. J., & Baldwin, S. A. (2014). Davey Smith, G., Tilling, K., Zuccolo, L., Davies, N. M.,
Modeling change in the presence of nonrandomly miss- & Hemani, G. (2020). Collider bias undermines our under-
ing data: Evaluating a shared parameter mixture model. standing of COVID-19 disease risk and severity. Nature
Structural Equation Modeling, 21(2), 196–209. Communications, 11(1), Article 5749.
Grace, J. B., & Bollen, K. A. (2005). Interpreting the results Grimm, K. J., & McArdle, J. J. (2023). Latent curve modeling
from multiple regression and structural equation mod- of longitudinal growth data. In R. H. Hoyle (Ed.), Hand-
els. Bulletin of the Ecological Society of America, 86(4), book of structural equation modeling (2nd ed., pp. 556–
283–295. 575). Guilford Press.
Grace, J. B., & Bollen, K. A. (2008). Representing general Grimm, K. J., Ram, N., & Estabrook, R. (2017). Growth
theoretical concepts in structural equation models: The modeling: Structural equation and multilevel modeling
role of composite variables. Environmental and Ecologi- approaches. Guilford Press.
cal Statistics, 15(2), 191–213. Grömping, U. (2015). Variable importance in regression mod-
Grace, J. B., Schoolmaster, D. R., Guntenspergen, G. R., Lit- els. WIREs Computational Statistics, 7, 137–152.
tle, A. M., Mitchell, B. R., & Miller, K. M. (2012). Guide- Gudergan, S. P., Ringle, C. M., Wende, S., & Will, A. (2008).
lines for a graph-theoretic implementation of structural Confirmatory tetrad analysis in PLS path modeling. Jour-
equation modeling. Ecosphere, 3(8), Article 73. nal of Business Research, 61(12), 1238–1249.
Graham, J. M., Guthrie, A. C., & Thompson, B. (2003). Guliyev, H. (2020). Determining the spatial effects of
Consequences of not interpreting structure coefficients in COVID-19 using the spatial panel data model. Spatial Sta-
published CFA research: A reminder. Structural Equation tistics, 38, Article 100443.
Modeling, 10(1), 142–153. Guo, Y., Lin, S., Guo, J., Lu, Z., & Shangguan, C. (2021).
Graham, J. W. (2003). Adding missing-data-relevant vari- Cross-cultural measurement invariance of divergent

References 451
thinking measures. Thinking Skills and Creativity, 41, wise latent growth models: Beyond modeling linear-linear
Article 41100852. processes. Behavior Research Methods, 53(2), 593–608.
Guttman, L. (1955). The determinacy of factor score matrices Hausman, J. A. (1978). Specification tests in econometrics.
with implications for five other basic problems of com- Econometrica, 46(6), 1251–1271.
mon-factor theory. British Journal of Statistical Psychol- Hayduk, L. A. (1996). LISREL issues, debates and strategies.
ogy, 8(2), 65–81. John Hopkins University Press.
Hair, J. F. (2021). Reflections on SEM: An introspective, idio- Hayduk, L., Cummings, G. G., Stratkotter, R., & Nimmo, M.
syncratic journey to composite-based structural equation (2003). Pearl’s d-separation: One more step into causal
modeling. SIGMIS Database, 52(SI), 101–113. thinking. Structural Equation Modeling, 10(2), 289–311.
Hair, J. F., Sarstedt, M., Pieper, T. M., & Ringle, C. M. (2012). Hayduk, L. A. (2006). Blocked-error-R2: A conceptually
The use of partial least squares structural equation mod- improved definition of the proportion of explained vari-
eling in strategic management research: a review of past ance in models containing loops or correlated residuals.
practices and recommendations for future applications. Quality & Quality, 40(4), 629–649.
Long Range Planning, 45(5), 320–340. Hayduk, L. A. (2014). Shame for disrespecting evidence: The
Hair, J. F., Jr., Howard, M. C., & Nitzl, C. (2020). Assessing personal consequences of insufficient respect for structural
measurement model quality in PLS-SEM using confirma- equation model testing. Medical Research Methodology,
tory composite analysis. Journal of Business Research, 14(1), Article 124.
109, 101–110. Hayduk, L. A. (2016). Improving measurement-invariance
Hair, J. F., Jr., Hult, G. T. M., Ringle, C. M., & Sarstedt, M. assessments: Correcting entrenched testing deficien-
(2022). A primer on partial least squares structural equa- cies. BMC Medical Research Methodology, 16(1), Article
tion modeling (PLS-SEM) (3rd ed.). Sage. 130.
Hair, J. F., Jr, Risher, J. J., Sarstedt, M., & Ringle, C. M. Hayduk, L. A., Cummings, G., Boadu, K., Pazderka–Robin-
(2019). When to use and how to report the results of PLS- son, H., & Boulianne, S. (2007). Testing! testing! one, two,
SEM. European Business Review, 31(1), 2–24. three—Testing the theory in structural equation models!
Haller, H., & Krauss, S. (2002). Misinterpretations of sig- Personality and Individual Differences, 42,(5), 841–850.
nificance: A problem students share with their teachers? Hayduk, L. A., & Glaser, D. N. (2000). Jiving the four-step,
Methods of Psychological Research, 7(1), 1–17. waltzing around factor analysis, and other serious fun.
Hammack-Brown, B., Fulmore, J. A., Keiffer, G. L., & Structural Equation Modeling, 7(1), 1–35.
Nimon, K. (2022). Finding invariance when noninvariance Hayduk, L. A., & Littvay, L. (2012). Should researchers use
is found: An illustrative example of conducting partial single indicators, best indicators, or multiple indicators
measurement invariance testing with the automation of in structural equation models? BMC Medical Research
the factor-ratio test and list-and-delete procedure. Human Methodology, 12(1), Article 159.
Resource Development Quarterly, 33(2), 179–203. Hayduk, L. A., Pazderka-Robinson, H., Cummings, G. G.,
Hancock, G. R., & Freeman, M. J. (2001). Power and sample Boadu, K., Verbeek, E. L., & Perks, T. A. (2007). The
size for the root mean square error of approximation of not weird world, and equally weird measurement models:
close fit in structural equation modeling. Educational and Reactive indicators and the validity revolution. Structural
Psychological Measurement, 61(5), 741–758. Equation Modeling, 14(2), 280–310.
Hancock, G. R., & French, B. F. (2013). Power analysis in Hayduk, L. A., Pazderka–Robinson, H., Cummings, G. C.,
structural equation modeling. In G. R. Hancock & R. O. Levers, M.–J. D., & Beres, M. A. (2005). Structural equa-
Mueller (Eds.), Structural equation modeling: A second tion model testing and the quality of natural killer cell
course (2nd ed., pp. 117–159). IAP. activity measurements. BMC Medical Research Method-
Hancock, G. R., & Liu, M. (2012). Bootstrapping standard ology, 5(1), Article PMC546216.
errors and data–model fit statistics in structural equation Hayes, A. F. (2022). Introduction to mediation, moderation,
modeling. In R. H. Hoyle (Ed.), Handbook of structural and conditional process analysis: A regression-based
equation modeling (pp. 277–295). Guilford Press. approach (3rd ed.). Guilford Press.
Hancock, G. R., & Mueller, R. O. (2011). The reliability para- Hayes, A. F., & Rockwood, N. J. (2020). Conditional process
dox in assessing structural relations within covariance analysis: Concepts, computation, and advances in model-
structure models. Educational and Psychological Mea- ing of the contingencies of mechanisms. American Behav-
surement, 71(2), 306–324. ioral Scientist, 64(1), 19–54.
Hardt, J., Herke, M., & Leonhart, R. (2012). Auxiliary vari- Heath, M. T. (2018). Scientific computing: An introductory
ables in multiple imputation in regression with missing X: survey (Rev. 2nd ed.). SIAM.
A warning against including too many in small sample Heck, R. H., & Thomas, S. L. (2015). An introduction to mul-
research. BMC Medical Research Methodology, 12, Arti- tilevel modeling techniques: MLM and SEM approacshes
cle 184. using Mplus. Routledge.
Harring, J. R., Strazzeri, M. M., & Blozis, S. A. (2021). Piece- Heene, M., Hilbert, S., Draxler, C., Ziegler, M., & Bühner,

452 References
M. (2011). Masking misfit in confirmatory factor analysis Mueller (Eds.), Structural equation modeling: A second
by increasing unique variances: A cautionary note on the course (2nd ed., pp. 3–39). IAP.
usefulness of cutoff values of fit indices. Psychological Herzog, W., Boomsma, A., & Reinecke, S. (2007). The
Methods, 16(3), 319–336. model-size effect on traditional and modified tests of
Heene, M., Hilbert, S., Freudenthaler, H. H., & Bühner, M. covariance structures. Structural Equation Modeling,
(2012). Sensitivity of SEM fit indexes with respect to vio- 14(3), 361–390.
lations of uncorrelated errors. Structural Equation Model- Hinson, V. K., Cubo, E., Comella, C. L., Goetz, C. G., &
ing, 19(1), 36–50. Leurgans, S. (2005). Rating Scale for Psychogenic Move-
Hejazi, N. S, Díaz, I., & Rudolph, K. (2022). medconout: Effi- ment Disorders: Scale development and clinimetric test-
cient natural and interventional casual mediation analy- ing. Movement Disorders, 20(12), 1592–1597.
sis (R package version 0.1.6). https://github.com/nhejazi/ Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.
medoutcon J. (2014). Robust misinterpretation of confidence intervals.
Hejazi, N. S., Rudolph, K. E., Van Der Laan, M. J., & Díaz, Psychonomic Bulletin and Review, 21(5), 1157–1164.
I. (2022). Nonparametric causal mediation analysis for Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power:
stochastic interventional (in)direct effects. Biostatistics. The pervasive fallacy of power calculations for data analy-
Advance online version. sis. American Statistician, 55(1), 1–6.
Henley, A. B., Shook, C. L., Peterson, M. (2006). The pres- Hollebeek, L. D., Glynn, M. S., & Brodie, R. J. (2014). Con-
ence of equivalent models in strategic management sumer brand engagement in social media: Conceptualiza-
research using structural equation modeling: Assessing tion, scale development and validation. Journal of Interac-
and addressing the problem. Organizational Research tive Marketing, 28(2), 149–165.
Methods, 9(4), 516–535. Howards, P. P., Schisterman, E. F., Poole, C., Kaufman, J. S.,
Henningsen, A., & Hamann, J. D. (2022). systemfit: Estimat- & Weinberg, C. R. (2012). “Toward a clearer definition
ing systems of simultaneous equations (R package 1.1-28). of confounding” revisited with directed acyclic graphs.
https://CRAN.R-project.org/package=systemfit American Journal of Epidemiology, 176(6), 506–511.
Henseler, J. (2010). On the convergence of the partial least Hoyle, R. H., & Isherwood, J. C. (2013). Reporting results
squares path modeling algorithm. Computational Statis- from structural equation modeling analyses in Archives of
tics, 25(1), 107–120. Scientific Psychology. Archives of Scientific Psychology,
Henseler, J. (2017). Bridging design and behavioral research 1, 14–22.
with variance-based structural equation modeling. Jour- Hu, L.-T., & Bentler, P. M. (1995). Evaluating model fit. In
nal of Advertising, 46(1), 178–192. R. H. Hoyle (Ed.), Structural equation modeling: Con-
Henseler, J. (2021). Composite-based structural equation cepts, issues, and applications (pp. 76–99). Sage.
modeling: Analyzing latent and emergent variables. Guil- Hu, L.-T., & Bentler, P. M. (1998). Fit indices in covariance
ford Press. structure modeling: Sensitivity to underparameterized
Henseler, J. (2022). ADANCO 2.3 [Computer software]. model misspecification. Psychological Methods, 3(4),
Composite Modeling. https://www.composite-modeling. 424–453.
com/ Hu, L.-T., & Bentler, P. M. (1999). Cutoff criteria for fit
Henseler, J., Dijkstra, T. K., Sarstedt, M., Ringle, C. M., Dia- indexes in covariance structure analysis: conventional cri-
mantopoulos, A., Straub, D. W., Ketchen, D. J., Hair, J. F., teria versus new alternatives. Structural Equation Model-
Hult, G. T. M., & Calantone, R. J. (2014). Common beliefs ing, 6(1), 1–55.
and reality about PLS: Comments on Rönkkö and Ever- Hubona, G. S., Schuberth, F., & Henseler, J. (2021). A clari-
mann (2013). Organizational Research Methods, 17(2), fication of confirmatory composite analysis (CCA). Inter-
182–209. national Journal of Information Management, 61, Article
Henseler, J., Ringle, C. M., & Sarstedt, M. (2016). Test- 102399.
ing measurement invariance of composites using partial Huck, S. W. (2016). Statistical misconceptions (Classic ed.).
least squares. International Marketing Review, 33(3), Routledge.
405–431. Huisman, S. M. H., Mahfouz, A., Batmanghelich, N. K.,
Henseler, J., & Schuberth, F. (2020). Using confirmatory com- Lelieveldt, B. P. F., Reinders, M. J. T. (2018). A structural
posite analysis to assess emergent variables in business equation model for imaging genetics using spatial tran-
research. Journal of Business Research, 120, 147–156. scriptomics. Brain Informatics, 5(2), Article 13.
Hershberger, S. L. (1994). The specification of equivalent Hulme, C., & Snowling, M. J. (2016). Reading disorders and
models before the collection of data. In A. von Eye & C. dyslexia. Current Opinion in Pediatrics, 28(6), 731–735.
C. Clogg (Eds.), Latent variables analysis (pp. 68–105). Hung, J., O’Neill, R. T., Bauer, P., & Kohne, K. (1997). The
Sage. behavior of the p-value when the alternative hypothesis is
Hershberger, S. L., & Marcoulides, G. A. (2013). The problem true. Biometrics, 53(1), 11–22.
of equivalent structural models. In G. R. Hancock & R. O. Hunsley, J., & Meyer, G. J. (2003). The incremental valid-

References 453
ity of psychological testing and assessment: Conceptual, Jarvis, C. B., MacKenzie, S. B., & Podsakoff, P. M. (2003).
methodological, and statistical issues. Psychological A critical review of construct indicators and measure-
Assessment, 15(4), 446–455. ment model misspecification in marketing and consumer
Hurlbert, S. H., Levine, R. A., & Utts, J. (2019). Coup de grâce research. Journal of Consumer Research, 30(2), 199–218.
for a tough old bull: “Statistically significant” expires. JASP Team (2022). JASP (Version 0.16.1) [Computer soft-
American Statistician, 73(Suppl. 1), 352–357. ware]. https://jasp-stats.org/
Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse Jia, F., & Wu, W. (2019). Evaluating methods for handling
of the Neyman-Pearson decision theory framework and missing ordinal data in structural equation modeling.
rise of the neoFisherian. Annales Zoologici Fennici, 46(5), Behavior Research Methods, 51(5), 2337–2355.
311–349. Jiang, G., & Yuan, K.-H. (2017). Four new corrected statistics
Hwang, H., & Takane, Y. (2015). Generalized structured for SEM with small samples and nonnormally distributed
component analysis: A component-based approach to data. Structural Equation Modeling, 24(4), 479–494.
structural equation modeling. Taylor & Francis Group Jin, S. (2022). On inconsistency of the overidentification test
Hyman, H. (1955). Survey design and analysis: Principles, for the model-implied instrumental variable approach.
cases and procedures. The Free Press. Structural Equation Modeling. Advance online publica-
Iacobucci, D., Saldanha, N., & Deng, X. (2007). A meditation tion.
on mediation: Evidence that structural equations models Jin, S., Luo, H., & Yang-Wallentin, F. (2016). A simulation
perform better than regressions. Journal of Consumer study of polychoric instrumental variable estimation in
Psychology, 17(2), 139–153. structural equation models. Structural Equation Model-
Igolkina, A. A., & Meshcheryakov, G. (2020). semopy: A ing, 23(5), 680–694.
Python package for structural equation modeling. Struc- Joanes, D. N., & Gill, C. A. (1998). Comparing measures of
tural Equation Modeling, 27(6), 952–963. sample skewness and kurtosis. Journal of the Royal Statis-
Igolkina, A. A., & Samsonova, M. G. (2018). SEM: Struc- tical Society: Series D (The Statistician), 47(1), 183–189.
tural equation modeling in molecular biology. Biophysics, John, L. K., Loewenstein, G., & Prelec, D. (2012). Measur-
63(2), 139–148. ing the prevalence of questionable research practices with
International Committee of Medical Journal Editors. (2021). incentives for truth telling. Psychological Science, 23(5),
Recommendations for the conduct, reporting, editing, and 524–532.
publication of scholarly work in medical journals. http:// Johnson, D. R., & Young, R. (2011). Toward best practices
www.icmje.org/recommendations/ in analyzing datasets with missing data: Comparisons and
Ioannidis, J. P. A. (2005). Why most published research find- recommendations. Journal of Marriage and Family, 73(5),
ings are false. PLoS Medicine, 2(8), Article 124. 926–945.
Jaccard, J., & Jacoby, J. (2020). Theory construction and Jöreskog, K. (1969). A general approach to maximum likeli-
model-building skills: A practical guide for social scien- hood confirmatory factor analysis. Psychometrika, 34(2),
tists (2nd ed.). Guilford Press. 183–204.
Jackson, D. L. (2003). Revisiting sample size and number of Jöreskog, K. G. (1993). Testing structural equation models. In
parameter estimates: Some support for the N:q hypothesis. K. A. Bollen & J. S. Lang (Eds.), Testing structural equa-
Structural Equation Modeling, 10(1), 128–141. tion models (pp. 294–316). Sage.
Jackson, D. L., & Gillaspy, J. A., Jr., & Purc–Stephenson, R. Jöreskog, K. G. (1999). How large can a standardized coeffi-
(2009). Reporting practices in confirmatory factor analy- cient be? https://www.statmodel.com/download/Joreskog.
sis: An overview and some recommendations. Psychologi- pdf
cal Methods, 14(1), 6–23. Jöreskog K. G., & Goldberger A. S. (1975). Estimation of a
Jacobucci, R. (2016). autoSEM: Specification search for SEM model with multiple indicators and multiple causes of a
using heuristic optimization (R package 0.1.0). https:// single latent variable. Journal of American Statistical
github.com/Rjacobucci/autoSEM Association, 70(351), 631–639.
Jacobucci, R. (2021). regsem: Regularized structural equa- Jöreskog, K. G., Olsson, U. H., & Wallentin, F. Y. (2016).
tion modeling (R package 1.8.0). https://CRAN.R-project. Multivariate analysis with LISREL. Springer.
org/package=regsem Jöreskog, K. G., & Sörbom, D. (1976). LISREL III: Estima-
Jacobucci, R., Grimm, K. J., & McArdle, J. J. (2016). Regu- tion of linear structural equation systems by maximum
larized structural equation modeling. Structural Equation likelihood methods. National Educational Resources.
Modeling, 23(4), 555–566. Jöreskog, K., & Sörbom, D. (1981). LISREL V: Analysis of lin-
James, L. R., & Brett, J. M. (1984). Mediators, moderators, ear structural relationships by maximum likelihood and
and test for mediation. Journal of Applied Psychology, least squares methods. International Education Services.
69(2), 307–321. Jöreskog, K. G., & Sörbom, D. (1996). LISREL 8 user’s refer-
James, W. (1917). The varieties of religious experience: A ence guide. Scientific Software.
study in human nature. Longmans, Green, and Co. Jöreskog, K. G., & Sörbom, D. (2018). LISREL 10 for Win-

454 References
dows [Computer software]. Scientific Software Interna- Kenny, D. A., & Kashy, D. A. (1992). Analysis of the multi-
tional. trait-multimethod matrix by confirmatory factor analysis.
Jöreskog, K. G., & Sörbom, D. (2021). LISREL 11 for Win- Psychological Bulletin, 112(1), 165–172.
dows [Computer software]. Scientific Software Interna- Kenny, D. A., Kashy, D. A., & Bolger, N. (1998). Data analy-
tional. https://ssicentral.com/ sis in social psychology. In D. Gilbert, S. Fiske, & G. Lin-
Jorgensen, T. D., Pornprasertmanit, S., Schoemann, A. M., & dzey (Eds.), The handbook of social psychology (Vol. 1,
Rosseel, Y. (2022). semTools: Useful tools for structural 4th ed., pp. 233–265). McGraw–Hill.
equation modeling (R package 0.5-6). https://CRAN.R- Kenny, D. A., & McCoach, D. B. (2003). Effect of the num-
project.org/package=semTools ber of variables on measures of fit in structural equation
Jose, P. E. (2013). Doing statistical mediation and modera- modeling. Structural Equation Modeling, 10(3), 333–351.
tion. Guilford Press. Kenny, D. A., & Milan, S. (2012). Identification: A nontechni-
Jung, E., & Yoon, M. (2016) Comparisons of three empirical cal discussion of a technical issue. In R. H. Hoyle, (Ed.),
methods for partial factorial invariance: Forward, back- Handbook of structural equation modeling (pp. 145–163).
ward, and factor-ratio tests. Structural Equation Model- Guilford Press.
ing, 23(4), 567–584. Kenny, D., Kaniskan, B., & McCoach, D. B. (2015). The per-
Kaiser, H. F., & Dickman, K. (1962). Sample and population formance of RMSEA in models with small degrees of free-
score matrices and sample correlation matrices from an dom. Sociological Methods & Research, 44(3), 486–507.
arbitrary population correlation matrix. Psychometrika, Kenward, M. G., & Molenberghs, G. (1998). Likelihood
27(2), 179–182. based frequentist inference when data are missing at ran-
Kaplan, D. (2009). Structural equation modeling: Founda- dom. Statistical Science, 13(3), 236–247.
tions and extensions (2nd ed.). Sage. Kerr, N. L. (1998). HARKing: Hypothesizing after the results
Kaplan, D., Harik, P., & Hotchkiss, L. (2001). Cross-sectional are known. Personality and Social Psychology Review,
estimation of dynamic structural equation models in dis- 2(3), 196–217.
equilibrium. In R. Cudeck, S. Du Toit, and D. Sörbom Kim, K. H. (2005). The relation among fit indexes, power, and
(Eds.), Structural equation modeling: Present and future. sample size in structural equation modeling. Structural
A Festschrift in honor of Karl Jöreskog (pp. 315–339). Sci- Equation Modeling, 12(3), 368–390.
entific Software International. Kim, M., Hsu, H.-Y., Kwok, O.-M., & Seo, S. (2018). The
Kaufman, A. S., & Kaufman, N. L. (1983). K-ABC adminis- optimal starting model to search for the accurate growth
tration and scoring manual. American Guidance Service. trajectory in latent growth models. Frontiers in Psychol-
Kaufman, J. S., MacLehose, R. F., & Kaufman, S. (2004). ogy, 9, Article 349.
A further critique of the analytic strategy of adjusting for Kim-Spoon, J., Herd, T., Brieant, A., Peviani, K., Deater-
covariates to identify biologic mediation. Epidemiologic Deckard, K., Lauharatanahirun, N., Lee, J., & King-
Perspectives & Innovations, 1(1), Article 4. Casas, B. (2021). Maltreatment and brain development:
Keith, T. Z. (1985). Questioning the K-ABC: What does it The effects of abuse and neglect on longitudinal trajecto-
measure? School Psychology Review, 14(1), 9–20. ries of neural activation during risk processing and cogni-
Kelley, K. (2022). MBESS: The MBESS R package. (R tive control. Developmental Cognitive Neuroscience, 48,
package version 4.9.2). https://CRAN.R-project.org/ Article 100939.
package=MBESS Kline, R. B. (2011). Principles and practice of structural
Kelley, K., & Lai, K. (2011). Accuracy in parameter estima- equation modeling (3rd ed.). Guilford Press.
tion for the Root Mean Square Error of Approximation: Kline, R. B. (2013a). Beyond significance testing: Statistics
Sample size planning for narrow confidence intervals. reform in the behavioral sciences (2nd ed.). American Psy-
Multivariate Behavioral Research, 46(1), 1–32. chological Association.
Kelley, K., & Preacher, K. J. (2012). On effect size. Psycho- Kline, R. B. (2013b). Exploratory and confirmatory fac-
logical Methods, 17(2), 137–152. tor analysis. In Y. Petscher, C. Schatsschneider, & D.
Kenny, D. (2011, September 6). Terminology and basics of L. Compton (Eds.), Applied quantitative analysis in the
SEM. http://www.davidakenny.net/cm/basics.htm social sciences (pp. 171–207). Routledge.
Kenny, D. (2018, September 15). Moderator variables: Intro- Kline, R. B. (2013c). Reverse arrow dynamics: Feedback
duction. http://davidakenny.net/cm/moderation.htm loops and formative measurement. In G. R. Hancock and
Kenny, D. A. (1979). Correlation and causation. Wiley. R. O. Mueller, (Eds.), Structural equation modeling: A
Kenny, D. A. (2012, September 4). Estimation with instru- second course (2nd ed., pp. 39–76). Information Age Pub-
mental variables. https://davidakenny.net/cm/iv.htm lishing.
Kenny, D. A. (2020, June 5). Measuring model fit. https:// Kline, R. B. (2015). The mediation myth. Basic and Applied
davidakenny.net/cm/fit.htm Social Psychology, 37(4), 202–213.
Kenny, D. A. (2021, May 4). Mediation. Retrieved from Kline, R. B. (2016). Principles and practice of structural
https://davidakenny.net/cm/mediate.htm#CI equation modeling (4th ed.). Guilford Press.

References 455
Kline, R. B. (2020a). Becoming a behavioral science empirical—significance tests are not. Theory & Psychol-
researcher: A guide to producing research that matters ogy, 22(1), 67–90.
(2nd ed.). Guilford Press. Lance, C. E., Noble, C. L., & Scullen, S. E. (2002). A critique
Kline, R. B. (2020b). Post p value education in graduate sta- of the correlated trait-correlated method and correlated
tistics: Preparing tomorrow’s psychology researchers for uniqueness models for multitrait-multimethod data. Psy-
a postcrisis future. Canadian Psychology, 61(4), 331–341. chological Methods, 7(2), 228–244.
Kline, R. B. (2023). Assumptions in structural equation mod- Lang, K. M., & Little, T. D. (2018). Principled missing data
eling. In R. Hoyle (Ed.), Handbook of structural equation treatments. Prevention Science, 19(3), 284–294.
modeling (2nd ed., pp. 128–144). Guilford Press. Lanza, S. T., & Rhoades, B. L. (2013). Latent class analysis:
Kline, R. B., Snyder, J., & Castellanos, M. (1996). Lessons an alternative perspective on subgroup analysis in preven-
from the Kaufman Assessment Battery for Children tion and treatment. Prevention Science, 14(2), 157–168.
(K-ABC): Toward a new assessment model. Psychological Latan, H., & Noonan, R. (Eds.). (2017). Partial least squares
Assessment, 8(1), 7–17. path modeling: Basic concepts, methodological issues
Kmetz, J. L. (2019). Correcting corrupt research: Recom- and applications. Springer.
mendations for the profession to stop misuse of p-values. Lawley, D. N. (1943). The application of the maximum like-
American Statistician, 73(Suppl. 1), 36–73. lihood method to factor analysis. British Journal of Psy-
Knight, C. R., & Winship, C. (2013). The causal implications chology, 33(3), 172–175.
of mechanistic thinking: Identification using directed acy- Lawton, M. P. (1975). The Philadelphia Geriatric Center
clic graphs (DAGs). In S. L. Morgan (Ed.), Handbook of Morale Scale: A revision. Journal of Gerontology, 30(1),
causal analysis for social research (pp. 275–299). Springer. 85–89.
Kock, N., & Hadaya, P. (2018). Minimum sample size estima- Lee, H., Cashin, A. G., Lamb, S. E., Hopewell, S., Vansteel-
tion in PLS-SEM: The inverse square root and gamma- andt, S., VanderWeele, T. J., MacKinnon, D. P., Mansell,
exponential methods. Information Systems Journal, 28(1), G., Collins, G. S., Golub, R. M., McAuley, J. H., & the
227–261. AGReMA group. (2021). A guideline for reporting media-
Kolenikov, S., & Bollen, K. A. (2012). Testing negative error tion analyses of randomized trials and observational stud-
variances: Is a Heywood case a symptom of misspecifica- ies: The AGReMA statement. Journal of the American
tion? Sociological Methods & Research, 41(1) 124–167. Medical Association, 326(11), 1045–1056.
Koziol, N. A. (2023). Confirmatory measurement models for Lee, J., & Stankov, L. (2013). Higher-order structure of non-
dichotomous and ordered polytomous indicators. In R. H. cognitive constructs and prediction of PISA 2003 mathe-
Hoyle (Ed.), Handbook of structural equation modeling matics achievement. Learning and Individual Differences,
(2nd ed., pp. 277–295). Guilford Press. 26, 119–130.
Kraft, P., Zeggini, E., & Ioannidis, J. P. A. (2009). Replication Lee, J., Tan, C. S., & Chia, K. S. (2009). A practical guide for
in genome-wide association studies. Statistical Science, multivariate analysis of dichotomous outcomes. Annals of
24(4), 561–573. the Academy of Medicine, Singapore, 38(8), 714–719.
Kühnel, S. (2001). The didactical power of structural equa- Lee, S., & Hershberger, S. (1990). A simple rule for generat-
tion modeling. In R. Cudeck, S. du Toit, & D. Sörbom ing equivalent models in covariance structure modeling.
(Eds.), Structural equation modeling: Present and future. Multivariate Behavioral Research, 25(3), 313–334.
A Festschrift in honor of Karl Jöreskog (pp. 79–96). Sci- Lee, S.-Y., Poon, W. Y., & Bentler, P. M. (1995). A two-stage
entific Software International. estimation of structural equation models with continuous
Kwan, J. L. Y., & Chan, W. (2011). Comparing standardized and polytomous variables. British Journal of Mathemati-
coefficients in structural equation modeling: A model cal and Statistical Psychology, 48(2), 339–358.
reparameterization approach. Behavior Research Meth- Lefcheck, J. S. (2016). piecewiseSEM: Piecewise structural
ods, 43(3), 730–745. equation modelling in R for ecology, evolution, and sys-
Lachowicz, M. J., Preacher, K. J., & Kelley, K. (2018). A tematics. Methods in Ecology and Evolution, 7(5), 573–
novel measure of effect size for mediation analysis. Psy- 579.
chological Methods, 23(2), 244–261. Lefcheck, J. S. (2020). piecewise SEM: Piecewise for
Lai, K., Green, S. B., Levy, R., Reichenberg, R. E., Xu, Y., structural equation modeling (R package 2.1.2). https://
Thompson, M. S., Yel, N., Eggum-Wilkens, N. D., Kunze, CRAN.R-project.org/package=piecewiseSEM
K. L., & Iida, M. (2016). Assessing model similarity in Lei, M., & Lomax, R. G. (2005). The effect of varying
structural equation modeling. Structural Equation Model- degrees of nonnormality in structural equation modeling.
ing, 23(4), 491–506. Structural Equation Modeling, 12(1), 1–27.
Lai, K., Green, S. B., & Levy, R. (2017). Graphical displays Leite, W. L., Bandalos, D. L., & Shen, Z. (2023). Simulation
for understanding SEM model similarity. Structural methods in structural equation modeling. In R. H. Hoyle
Equation Modeling, 24(6), 803–818. (Ed.), Handbook of structural equation modeling (2nd ed.,
Lambdin, C. (2012). Significance tests as sorcery: Science is pp. 110–127). Guilford Press.

456 References
Lek, K., Oberski, D., Davidov, E., Cieciuch, J., Seddig, D., & (Eds.), Modeling contextual effects in longitudinal studies
Schmidt, P. (2018). Approximate measurement invariance. (pp. 121–147). Erlbaum.
In T. P. Johnson, B.-E. Pennell, I. A. Stoop, & B. Dorer Little, T. D., Lindenberger, U., & Nesselroade, J. R. (1999).
(Eds.), Advances in comparative survey methods: Multi- On selecting indicators for multivariate measurement and
national, multiregional, and multicultural contexts (3MC) modeling with latent variables: When “good” indicators
(pp. 911–932). Wiley. are bad and “bad” indicators are good. Psychological
Levy, R., & Hancock, G. R. (2007). A framework of statis- Methods, 4(2), 192–211.
tical tests for comparing mean and covariance structure Little, T. D., Preacher, K. J., Card, N. A., & Selig, J. P. (2007).
models. Multivariate Behavioral Research, 42(1), 33–66. New developments in latent variable panel analyses of lon-
Levy, R., & Hancock, G. R. (2011). An extended model gitudinal data. International Journal of Behavioral Devel-
comparison framework for covariance and mean struc- opment, 31(4), 357–365.
ture models, accommodating multiple groups and latent Little, T. D., Slegers, D. W., & Card, N. A. (2006). A non-arbi-
mixtures. Sociological Methods & Research, 40(2), trary method of identifying and scaling latent variables in
256–278. SEM and MACS models. Structural Equation Modeling,
Leys, C., Klein, O., Dominicy, Y., & Ley, C. (2018). Detecting 13(1), 59–72.
multivariate outliers: Use a robust variant of the Mahala- Liu, M., Harbaugh, A. G., Harring, J. R., & Hancock, G. R.
nobis distance. Journal of Experimental Social Psychol- (2017). The effect of extreme response and non-extreme
ogy, 74, 150–156. response styles on testing measurement invariance. Fron-
Li, C.-H. (2016). Confirmatory factor analysis with ordinal tiers in Psychology, 8, Article 726.
data: Comparing robust maximum likelihood and diago- Liu, X. (2016). Methods and applications of longitudinal
nally weighted least squares. Behavior Research Methods, data analysis. Academic Press.
48(3), 936–949. Liu, Y., Millsap, R. E., West, S. G., Tein, J.-Y., Tanaka, R., &
Li, C.-H. (2021). Statistical estimation of structural equa- Grimm, K. J. (2017). Testing measurement invariance in
tion models with a mixture of continuous and categorical longitudinal data with ordered-categorical measures. Psy-
observed variables. Behavior Research Methods, 53(5), chological Methods, 22(3), 486–506.
2191–2213. Llabre, M. M., Spitzer, S., Siegel, S., Saab, P. G., & Schnei-
Li, L., & Bentler, P. (2006). Robust statistical tests for derman, N. (2004). Applying latent growth curve model-
evaluating the hypothesis of close fit of misspecified ing to the investigation of individual differences in cardio-
mean and covariance structural models. Department of vascular recovery from stress. Psychosomatic Medicine,
Statistics, University of California, Los Angeles. https:// 66(1), 29–41.
escholarship.org/uc/item/4t29r830 Loehlin, J. C., & Beaujean, A. A. (2017). Latent variable
Liang, X., & Jacobucci, R. (2020). Regularized structural models: An introduction to factor, path, and structural
equation modeling to detect measurement bias: Evaluation equation analysis (5th ed.). Routledge.
of lasso, adaptive lasso, and elastic net. Structural Equa- Loehlin, J. C., Horn, J. M., & Willerman, L. (1990). Hered-
tion Modeling, 27(5), 722–734. ity, environment, and personality change: Evidence from
Likert, R. (1932). A technique for the measurement of atti- the Texas Adoption Project. Journal of Personality, 58,
tudes. Archives of Scientific Psychology, 22(140), 5–53. 221–243.
Lin, L.-C., Huang, P.-H., & Weng, L.-J. (2017). Selecting path Loh, W. W., Moerkerke, B., Loeys, T., & Vansteelandt, S.
models in SEM: A comparison of model selection criteria. (2022). Disentangling indirect effects through multiple
Structural Equation Modeling, 24(6), 855–869. mediators without assuming any causal structure among
Little, R. J. A. (1988). A test of missing completely at random the mediators. Psychological Methods, 27(6), 982–999.
for multivariate data with missing values. Journal of the Lohmöller, J.-B. (1984). LVPLS 1.6 program manual: Latent
American Statistical Association, 83(404), 1198–1202. variable path analysis with partial least squares [Com-
Little, R. J. A., D’Agostino, R., Cohen, M. L., Dickersin, K., puter software]. Universität zu Köhn, Zentralarchiv für
Emerson, S. S., Farrar, J. T., Frangakis, C., Hogan, J. W., Empirishce Sozialforschung.
Molenberghs, G., Murphy, S. A., Neaton, J. D., Rotnitzky, Lohmöller, J.-B. (1989). Latent variable path modeling with
A., Scharfstein, D., Shih, Weichung, J., Siegel, J. P., & partial least squares. Physica Heidelberg
Stern, H. (2012). The prevention and treatment of missing Lubke, G. H., & Campbell, I. (2016). Inference based on the
data in clinical trials. New England Journal of Medicine, best-fitting model can contribute to the replication crisis:
367(14), 1355–1360. Assessing model selection uncertainty using a bootstrap
Little, T. D. (2013). Longitudinal structural equation model- approach. Structural Equation Modeling, 23(4), 479–490.
ing. Guilford Press. Lubke, G. H., Campbell, I., McArtor, D., Miller, P., Luning-
Little, T. D., Card, N. A., Slegers, D. W., & Ledford, E. C. ham, J., & van den Berg, S. M. (2017). Assessing model
(2007). Representing contextual effects in multiple-group selection uncertainty using a bootstrap approach: An
MACS models. In T. D. Little, J. A. Bovaird, & N. A. Card update. Structural Equation Modeling, 24(2), 230–245.

References 457
Lúcio, P. S., Salum, G., Swardfager, W., Mari, J. de J., Pan, S. G., & Sheets, V. (2002). A comparison of methods to
P. M., Bressan, R. A., Gadelha, A., - Rohde, L. A., & test mediation and other intervening variable effects. Psy-
Cogo-Moreira, H. (2017). Testing measurement invariance chological Methods, 7(1), 83–104.
across groups of children with and without Attention-Defi- Madans, J. H., Kleinman, J. C., Cox, C. S., Barbano, H. E.,
cit/Hyperactivity Disorder: Applications for word recogni- Feldman, J. J., Cohen, B., Finucane, F. F., & Cornoni–
tion and spelling tasks. Frontiers in Psychology, 8, Article Huntley, J. (1986). 10 years after NHANES I—Report of
1891. initial followup, 1982–84. Public Health Reports, 101(5),
Luo, S., Song, R., Styner, M., Gilmore, J. H., & Zhu, H. 465–473.
(2019). FSEM: Functional structural equation models for Mai, R., Niemand, T., & Kraus, S. (2021). A tailored-fit model
twin functional data. Journal of the American Statistical evaluation strategy for better decisions about structural
Association, 114(525), 344–357. equation models. Technological Forecasting & Social
Lynam, D. R., Moffitt, T., & Stouthamer–Loeber, M. (1993). Change, 173, Article 121142.
Explaining the relation between IQ and delinquency: Mai, Y., Xu, Z., Zhang, Z., & Yuan, K. (2022). An open source
Class, race, test motivation, or self-control? Journal of WYSIWYG web application for drawing path diagrams of
Abnormal Psychology, 102(2), 187–196. structural equation models. https://semdiag.psychstat.org/
Lynch, K. G., Cary, M., Gallop, R., & Ten Have, T. R. (2008). Majewska, J. (2015). Identification of multivariate outliers
Causal mediation analyses for randomized trials. Health problems and challenges of visualization methods. Studia
Services and Outcomes Research Methodology, 8(2), 57–76. Ekonomiczne, 247, 69–83.
MacCallum, R. C. (1986). Specification searches in covari- Malhotra, N. K., Agarwal, J., & Shainesh, G. (2018). Does
ance structure modeling. Psychological Bulletin, 100(1), country or culture matter in global marketing? An empiri-
107–120. cal investigation of service quality and satisfaction model
MacCallum, R. C. (1995). Model specification: Procedures, with moderators in three countries. In J. Agarwal & T. Wu
strategies, and related issues. In R. H. Hoyle (Ed.), Struc- (Eds,), Emerging issues in global marketing (pp. 61–91).
tural equation modeling: Concepts, issues, and applica- Springer.
tions (pp. 16–36). Sage. Manly, C. A., & Wells, R. S. (2015). Reporting the use of
MacCallum, R. C., & Austin, J. T. (2000). Applications of multiple imputation for missing data in higher education
structural equation modeling in psychological research. research. Research in Higher Education, 56(4), 397–409.
Annual Review of Psychology, 51(1), 201–236. Maraun, M. D., & Halpin, P. F. (2008). Manifest and latent
MacCallum, R. C., & Browne, M. W. (1993). The use of variates. Measurement, 6(1–2), 113–117.
causal indicators in covariance structure models: Some Marchetti, G. M., Drton, M., & Sadeghi, K. (2020). ggm:
practical issues. Psychological Bulletin, 114(3), 533–541. Graphical Markov models with mixed graphs (R package
MacCallum, R. C., Browne, M. W., & Sugawara, H. M. 2.5). https://CRAN.R-project.org/package=ggm
(1996). Power analysis and determination of sample size Marcoulides, G. A., & Ing, M. (2012). Automated structural
for covariance structure modeling. Psychological Meth- equation modeling strategies. In R. H. Hoyle (Ed.), Hand-
ods, 1(2), 130–149. book of structural equation modeling (pp. 690–704). Guil-
MacCallum, R. C., Wegener, D. T., Uchino, B, N., & Fabrigar, ford Press.
L. R. (1993). The problem of equivalent models in applica- Marcoulides, K. M., & Falk, C. F. (2018). Model specification
tions of covariance structure analysis. Psychological Bul- searches in structural equation modeling with R. Struc-
letin, 114(1), 185–199. tural Equation Modeling, 25(3), 484–491.
MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. Marcoulides, K. M., & Yuan, K.-H. (2017). New ways to eval-
(2002). On the practice of dichotomization of quantitative uate goodness of fit: A note on using equivalence testing
variables. Psychological Methods, 7(1), 19–40. to assess structural equation models. Structural Equation
MacKenzie, S. B., Podsakoff, P. M., & Podsakoff, N. P. (2011). Modeling, 24(1), 148–153.
Construct measurement and validation procedures in MIS Marcoulides, K. M., Yuan, K.-H., & Deng, L. (2023). Struc-
and behavioral research: Integrating new and existing tural equation modeling with small samples and many
techniques. MIS Quarterly, 35(2), 293–334. variables. In R. H. Hoyle (Ed.), Handbook of struc-
MacKinnon, D. P. (2008). Introduction to statistical media- tural equation modeling (2nd ed., pp. 525–542). Guilford
tion analysis. Erlbaum. Press.
MacKinnon, D. P., Fairchild, A. J., & Fritz, M. S. (2007). Mardia, K. V. (1970). Measures of multivariate skewness and
Mediation analysis. Annual Review of Psychology, 58, kurtosis with applications. Biometrika, 57(3), 519–530.
593–614. Markland, D. (2007). The golden rule is that there are no
MacKinnon, D. P., Krull, J. L., & Lockwood, C. M. (2000). golden rules: A commentary on Paul Barrett’s recom-
Equivalence of the mediation, confounding and suppres- mendations for reporting model fit in structural equation
sion effect. Prevention Science, 1(4), 173–181. modelling. Personality and Individual Differences, 42(5),
MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, 851–858.

458 References
Markus, K. A. (2018). Three conceptual impediments to Maydeu-Olivares, A., & Shi, D. (2017). Effect sizes of
developing scale theory for formative scales. Methodol- model misfit in structural equation models. Methodology,
ogy, 14(4), 156–163. 13(Suppl.), 23–30.
Marsh, H. W., & Bailey, M. (1991). Confirmatory factor Maydeu-Olivares, A., Shi, D., & Rosseel, Y. (2018). Assess-
analysis of multitrait-multimethod data: A comparison of ing fit in structural equation models: A Monte-Carlo
alternative models. Applied Psychological Measurement, evaluation of RMSEA versus SRMR confidence intervals
15(1), 47–70. and tests of close fit. Structural Equation Modeling, 25(3),
Marsh, H. W., & Balla, J. (1994). Goodness of fit in confirma- 389–402.
tory factor analysis: The effects of sample size and model McArdle, J. J. (2009). Latent variable modeling of differences
parsimony. Quality and Quantity, 28(2), 185–217. and changes with longitudinal data. Annual Review of Psy-
Marsh, H. W., Balla, J. R., & Hau, K.–T. (1996). An evalu- chology, 60(1), 577–605.
ation of incremental fit indices: A clarification of math- McArdle, J. J., & Epstein, D. (1987). Latent growth curves
ematical and empirical properties. In G. A. Marcoulides within developmental structural equation models. Child
& R. E. Schumaker (Eds.), Advanced structural equation Development, 58(1), 110–133.
modeling (pp. 315–353). Erlbaum. McArdle, J. J., & McDonald, R. P. (1984). Some algebraic
Marsh, H. W., & Grayson, D. (1995). Latent variable mod- properties of the Reticular Action Model for moment
els of multitrait-multimethod data. In R. H. Hoyle (Ed.), structures. British Journal of Mathematical and Statisti-
Structural equation modeling (pp. 177–198). Sage. cal Psychology, 37(2), 234–251.
Marsh, H. W., & Hau, K.–T. (1999). Confirmatory factor McCoach, D. B., Black, A. C., & O’Connell, A. A. (2007).
analysis: Strategies for small sample sizes. In R. H. Hoyle Errors of inference in structural equation modeling. Psy-
(Ed.), Statistical strategies for small sample research chology in the Schools, 44(5), 461–470.
(pp. 252–284). Sage. McDonald, R. P., & Ho, M.-H. R. (2002). Principles and prac-
Marsh, H. W., Morin, A. J. S., Parker, P. D., & Kaur, G. (2014). tice in reporting structural equation analyses. Psychologi-
Exploratory structural equation modeling: An integration cal Methods, 7(1), 64–82.
of the best features of exploratory and confirmatory fac- McIntosh, C. N. (2012). Improving the evaluation of model fit
tor analysis. Annual Review of Clinical Psychology, 10(1), in confirmatory factor analysis: A commentary on Gundy,
85–110. C. M., Fayers, P. M., Groenvold, M., Petersen, M. Aa.,
Mastrotheodoros, S., Kornienko, O., Umaña-Taylor, A., Scott, N. W., Sprangers, M. A. J., Velikov, G., Aaronson, N.
& Motti-Stefanidi, F. (2021). Developmental interplay K. (2011). Comparing higher-order models for the EORTC
between ethnic, national, and personal identity in immi- QLQ-C30. Quality of Life Research, doi:10.1007/s11136-
grant adolescents. Journal of Youth and Adolescence, 011-0082-6. Quality of Life Research, 21, 1619–1621.
50(6), 1126–1139. McNeish, D. (2016). On using Bayesian methods to address
Mathworks, Inc. (2022). MATLAB (Version 9.12.0.1884302 small sample problems. Structural Equation Modeling,
(R2022a)) [Computer software]. https://www.mathworks. 23(5), 750–773.
com/ McNeish, D. (2020). Relaxing the proportionality assumption
Matsunaga, M. (2008). Item parceling in structural equation in latent basis models for nonlinear growth. Structural
modeling: A primer. Communication Methods and Mea- Equation Modeling, 27(5), 817–824.
sures, 2(4), 260–293. McNeish, D., & Wolf, M. G. (2020). Dynamic fit index cut-
Mauro, R. (1990). Understanding L.O.V.E. (left out variables offs for confirmatory factor analysis models [Unpublished
error): A method for estimating the effects of omitted vari- manuscript]. PsyArXiv Preprints. https://psyarxiv.com/
ables. Psychological Bulletin, 108(2), 314–329. v8yru/
Maxwell, S. E., & Cole, D. A. (2007). Bias in cross-sectional McShane, B. B., & Gal, D. (2016). Blinding us to the obvi-
analyses of longitudinal mediation. Psychological Meth- ous? The effect of statistical training on the evaluation of
ods, 12(1), 23–44. evidence. Management Science, 62(6), 1707–1718.
Maxwell, S. E., Cole, D. A., & Mitchell, M. A. (2011). Bias in Meade, A. W., & Bauer, D. J. (2007). Power and precision in
cross-sectional analyses of longitudinal mediation: Partial confirmatory factor analytic tests of measurement invari-
and complete mediation under an autoregressive model. ance. Structural Equation Modeling, 14(4), 611–635.
Multivariate Behavioral Research, 46(5), 816–841. Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power
Maydeu-Olivares, A. (2017a). Assessing the size of model and sensitivity of alternative fit indices in tests of measure-
misfit in structural equation models. Psychometrika, ment invariance. Journal of Applied Psychology, 93(3),
82(3), 533–558. 568–592.
Maydeu-Olivares, A. (2017b). Maximum likelihood estima- Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of
tion of structural equation models for continuous data: item response theory and confirmatory factor analytic meth-
Standard errors and goodness of fit. Structural Equation odologies for establishing measurement equivalence/invari-
Modeling, 24(3), 383–394. ance. Organizational Research Methods, 7(4), 361–388.

References 459
Mehta, P. D., & Neale, M. C. (2005). People are variables too: Morin, A. J. S. (2023). Exploratory structural equation mod-
Multilevel structural equation modeling. Psychological eling. In R. H. Hoyle (Ed.), Handbook of structural equa-
Methods, 10(3), 259–284. tion modeling (2nd ed., pp. 503–524). Guilford Press.
Mellenbergh, G. J. (1989). Item bias and item response theory. Morrison, T. G., Morrison, M. A., & McCutcheon, J. M.
International Journal of Educational Research, 13(2), (2017). Best practice recommendations for using structural
127–143. equation modelling in psychological research. Psychol-
Meredith, W., & Tisak, J. (1990). Latent curve analysis. Psy- ogy, 8, 1326–1341.
chometrika, 55(1), 107–122. Moshagen, M. (2012). The model size effect in SEM: Inflated
Merkle, E. C., You, D., & Preacher, K. J. (2016). Testing non- goodness-of-fit statistics are due to the size of the covari-
nested structural equation models. Psychological Meth- ance matrix. Structural Equation Modeling, 19(1), 86–98.
ods, 21(2), 151–163. Moss, T. P., Lawson, V., & White, P., & The Appearance
Meshcheryakov, G., Igolkina, A. A., & Samsonova, M. G. Research Collaboration (2015). Identification of the under-
(2021). semopy 2: A structural equation modeling pack- lying factor structure of the Derriford Appearance Scale
age with random effects in Python. arXiv. https://arxiv. 24. PeerJ, 3, Article 1070.
org/abs/2106.01140 Mowbray, F. I., Fox-Wasylyshyn, S. M., & El-Masri, M. M.
Michell, J. (2013). Constructs, inferences, and mental mea- (2019). Univariate outliers: A conceptual overview for the
surement. New Ideas in Psychology, 31(1), 13–21. nurse researcher. Canadian Journal of Nursing Research,
Mikis, D. (Ed.). (2012). The annotated Emerson. Belknap 51(1), 31–37.
Press. Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E.
Miles, J., & Shevlin, M. (2007). A time and a place for incre- (2012). Setting an optimal a that minimizes errors in null
mental fit indices. Personality and Individual Differences, hypothesis significance tests. PLoS ONE, 7(2), Article
42(5), 869–874. e32734.
Millsap, R. E. (2007). Structural equation modeling made Mueller, R, O., & Hancock, G. R. (2008). Best practices in
difficult. Personality and Individual Differences, 42(5), structural equation modeling. In J. W. Osborne (Ed.), Best
875–881. practices in quantitative methods (pp. 488–508). Sage.
Millsap, R. E. (2011). Statistical approaches to measurement Mulaik, S. A. (2009a). Foundations of factor analysis (2nd
invariance. Routledge. ed.). Chapman & Hall/CRC.
Millsap, R. E., & Olivera-Aguilar, M. (2012). Investigating Mulaik, S. A. (2009b). Linear causal modeling with struc-
measurement invariance using confirmatory factor analy- tural equations. CRC Press.
sis. In R. H. Hoyle (Ed.), Handbook of structural equation Mulaik, S. A., Millsap, R. E. (2000) Doing the four-step right.
modeling (pp. 380–392). Guilford Press. Structural Equation Modeling, 7(1), 36–73.
Mohan, K., & Pearl, J. (2021). Graphical models for pro- Munafò, M. R., Tilling, K., Taylor, A. E., Evans, D. M., &
cessing missing data. Journal of the American Statistical Smith, G. D. (2018). Collider scope: When selection bias
Association, 116(534), 1023–1037. can substantially influence observed associations. Interna-
Molenaar, D., Dolan, C. V., Wicherts, J. M., & van der Maas, tional Journal of Epidemiology, 47(1), 226–235.
H. L. J. (2010). Modeling differentiation of cognitive abili- Murphy, S. A., Chung, I.-J., & Johnson, L. C. (2002). Patterns
ties within the higher-order factor model using moderated of mental distress following the violent death of a child
factor analysis. Intelligence, 38(6), 611–624. and predictors of change over time. Research in Nursing
Molina, K. M., Alegría, M., & Mahalingam, R. (2013). A & Health, 25(6), 425–437.
multiple-group path analysis of the role of everyday dis- Muthén, B. (2001). Second-generation structural equation
crimination on self-rated physical health among Latina/os modeling with a combination of categorical and continu-
in the USA. Annals of Behavioral Medicine, 45(1), 33–45. ous latent variables: New opportunities for latent class/
Mooijaart, A., & Satorra, A. (2009). On insensitivity of the chi- latent growth modeling. In Collins, L. M., & Sayer, A.
square model test to nonlinear misspecification in structural (Eds.), New methods for the analysis of change (pp. 291–
equation models. Psychometrika, 74(3), 443–455. 322). American Psychological Association.
Moon, K.-W. (2019). R package processR (R package 0.2.7). Muthén, B. O. (1984). A general structural equation model
https://rpubs.com/cardiomoon/468602 with dichotomous, ordered categorical, and continuous
Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & latent variable indicators. Psychometrika, 49(1), 115–132.
Wagenmakers, E.-J. (2016). The fallacy of placing con- Muthén, B. O. (1991). Analysis of longitudinal data using
fidence in confidence intervals. Psychonomic Bulletin & latent variable models with varying parameters. In L. M.
Review, 23(1), 103–123. Collins & J. L. Horn (Eds.), Best methods for analysis of
Morikawa, K., Kim, J. K., & Kano, Y. (2017). Semipara- change: Recent advances, unanswered questions, future
metric maximum likelihood estimation with data miss- directions (pp. 1–17). American Psychological Associa-
ing not at random. Canadian Journal of Statistics, 45(4), tion.
393–409. Muthén, B., & Asparouhov, T. (2002). Latent variable analy-

460 References
sis with categorical outcomes: Multiple-group and growth Nilsson, A., Bonander, C., Strömberg, U., & Björk, J. (2021).
modeling in Mplus. https://www.statmodel.com/down- A directed acyclic graph for interactions. International
load/webnotes/CatMGLong.pdf Journal of Epidemiology, 50(2), 613–619.
Muthén, B., & Asparouhov, T. (2018). Recent methods for Nimon, K., & Reio, T. G, Jr. (2011). Measurement invariance:
the study of measurement invariance with many groups: A foundational principle for quantitative theory building.
alignment and random effects. Sociological Methods & Human Resource Development Review, 10(2) 198–214.
Research, 47(4), 637–664. Nitzl, C., & Chin, W. W. (2017). The case of partial least
Muthén, B., Asparouhov, T., Hunter, A. M., & Leuchter, squares (PLS) path modeling in managerial accounting
A. F. (2011). Growth modeling with nonignorable dropout: research. Journal of Management Control, 28(2), 137–156.
Alternative analyses of the STAR*D antidepressant trial. Noonan, R., & Wold, H. (1982). PLS path modeling with indi-
Psychological Methods, 16(1), 17–33. rectly observed variables. In K. G. Jöreskog & H. Wold,
Müthen, L. K., & Müthen, B. O. (1998–2017). Mplus user’s (Eds.), Systems under indirect observation: Causality,
guide (8th ed.). Muthén & Muthén. structure, prediction, Part II. North-Holland.
Myers, N. D., Ahn, S., & Jin, Y. (2011). Sample size and Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellora, D.
power estimates for a confirmatory factor analytic model T. (2018). The preregistration revolution. PNAS, 115(11),
in exercise and sport: A Monte Carlo approach. Research 2600–2606.
Quarterly for Exercise and Sport, 82(3), 412–423. Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M.,
Naimi, A. I., Kaufman, J. S., & MacLehose, R. F. (2014). Epskamp, S., & Wicherts, J. M. (2016). The prevalence
Mediation misgivings: Ambiguous clinical and public of statistical reporting errors in psychology (1985–2013).
health interpretations of natural direct and indirect effects. Behavior Research Methods, 48(4), 1205–1226.
International Journal of Epidemiology, 43(5), 1656–1661. Nunkoo, R., Ramkissoon, H., and Gursoy, D. (2013). Use of
National Institutes of Health. (2018, November). Resources structural equation modeling in tourism research: Past,
for preparing your application. U.S. Department of Health present, and future. Journal of Travel Research, 52(6),
& Human Services, National Institutes of Health. https:// 759–771.
grants.nih.gov/policy/reproducibility/resources.htm Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric the-
Neale, M. C., Hunter, M. D., Pritikin, J. N., Zahery, M., ory (3rd ed.). McGraw–Hill.
Brick, T. R., Kirkpatrick, R. M., Estabrook, R., Bates, Nye, C. D., & Drasgow, F. (2011). Assessing goodness of fit:
T. C., Maes, H. H., & Boker, S. M. (2016). OpenMx 2.0: Simple rules of thumb simply do not work. Organizational
Extended structural equation and statistical modeling. Research Methods, 14(3), 548–570.
Psychometrika, 81(2), 535–549. O’Boyle, E. H., Jr., & Williams, L. J. (2011). Decompos-
Nestler, S. (2013). A Monte Carlo study comparing PIV, ULS ing model fit: Measurement vs. theory in organizational
and DWLS in the estimation of dichotomous confirmatory research using latent variables. Journal of Applied Psy-
factor analysis. British Journal of Mathematical and Sta- chology, 96(1), 1–12.
tistical Psychology, 66(1), 127–143. O’Brien, R. M. (1994). Identification of simple measure-
Nevitt, J., & Hancock, G. R. (2000). Improving the root mean ment models with multiple latent variables and correlated
square error of approximation for nonnormal conditions errors. Sociological Methodology, 24, 137–170.
in structural equation modeling. Journal of Experimental O’Brien, R. M. (2007). A caution regarding rules of thumb
Education, 68(3), 251–268. for variance inflation factors. Quality & Quantity, 41(5),
Nevitt, J., & Hancock, G. R. (2001). Performance of boot- 673–690.
strapping approaches to model test statistics and param- O’Laughlin, K. D., Martin, M. J., & Ferrer, E. (2018). Cross-
eter standard error estimation in structural equation mod- sectional analysis of longitudinal mediation processes.
eling. Structural Equation Modeling, 8(3), 353–377. Multivariate Behavioral Research, 53(3), 375–402.
Newsom, J. T. (2015). Longitudinal structural equation mod- O’Rourke, N., & Hatcher, L. (2013). A step-by-step approach
eling: A comprehensive introduction. Routledge. to using SAS for factor analysis and structural equation
Nezlek, J. B. (2008). An introduction to multilevel modeling modeling (2nd ed.). SAS Institute Inc.
for social and personality psychology. Social and Person- Oakes, M. (1986). Statistical inference. Wiley.
ality Psychology Compass, 2(2), 842–860. Oberski, D. L., & Satorra, A. (2013). Measurement error
Nguyen, T. Q., Schmid, I., & Stuart, E. A. (2021). Clarify- models with uncertainty about the error variance. Struc-
ing causal mediation analysis for the applied researcher: tural Equation Modeling, 20(3), 409–428.
Defining effects based on what we want to learn. Psycho- Ockey, G. J., & Choi, I. (2015). Structural equation model-
logical Methods, 26(2), 255–271. ing reporting practices for language assessment. Language
Niemand, T., & Mai, R. (2018). Flexible cutoff values for fit Assessment Quarterly, 12(3), 305–319.
indices in the evaluation of structural equation models. Ogden, C. L., Fryar, C. D., Carroll, M. D., & Flegal, K. M.
Journal of the Academy of Marketing Science, 46(6), (2004). Mean body weight, height, and body mass index,
1148–1172. United States 1960–2002. Advance Data, 2004(347), 1–17.

References 461
Olaru, G., Schroeders, U., Hartung, J., & Wilhelm, O. (2019). Pearl, J. (2009). Causality: Models, reasoning, and inference
A tutorial on novel item and person sampling procedures (2nd ed.). Cambridge University Press.
for personality research. European Journal of Personality, Pearl, J. (2012). The causal foundations of structural equation
33(3), 400–419. modeling. In R. H. Hoyle (Ed.), Handbook of structural
Oldenburg, R. (2020). Structural equation modeling: Com- equation modeling (pp. 68–91). Guilford Press.
paring two approaches. Mathematica Journal, 22. 1–17. Pearl, J. (2014). Interpretation and identification of causal
Olsson, U. H., Foss, T., & Breivik, E. (2004). Two equivalent mediation. Psychological Methods, 19(4), 459–481.
discrepancy functions for maximum likelihood estima- Pearl, J. (2023). The causal foundations of structural equation
tion: Do their test statistics follow a non-central chi-square modeling. In R. H. Hoyle (Ed.), Handbook of structural
distribution under model misspecification. Sociological equation modeling (2nd ed., pp. 49–75). Guilford Press.
Methods and Research, 32(4), 453–500. Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement,
Olsson, U. H., Foss, T., Troye, S. V., & Howell, R. D. (2000). design, and analysis: An integrated approach. Erlbaum.
The performance of ML, GLS, and WLS estimation in Pek, J., & Hoyle, R. H. (2016). On the (in)validity of tests of
structural equation modeling under conditions of mis- simple mediation: Threats and solutions. Social and Per-
specification and non-normality. Structural Equation sonality Psychology Compass, 10(3), 150–163.
Modeling, 7(4) 557–595. Peng, H.-L. Hsueh, H.-W., Chang, Y.-H., & Li, R.-H. (2021).
Ondé, D., & Alvarado, J. M. (2020). Reconsidering the condi- The mediation and suppression effect of demoralization
tions for conducting confirmatory factor analysis. Spanish in breast cancer patients after primary therapy: A struc-
Journal of Psychology, 23, Article E55. tural equation model. Journal of Nursing Research, 29(2),
Oppong, F. B., & Agbedra, S. Y. (2016). Assessing univariate Article e144.
and multivariate normality: A guide for non-statisticians. Perry, J. (2021). Trust in public institutions: Trends and
Mathematical Theory and Modeling, 6(2), 26–33. implications for economic security (Policy brief
Osborne, J. W. (2013). Best practices in data screening. 108). United Nations Department of Economic and
Sage. Social Affairs. https://www.un.org/development/desa/
Osborne, J. W., & Fitzpatrick, D. C. (2012). Replication dspd/2021/07/trust-public-institutions/
analysis in exploratory factor analysis: What it is and Peters, C. L. O., & Enders, C. (2002). A primer for the estima-
why it makes your analysis better. Practical Assessment, tion of structural equation models in the presence of miss-
Research & Evaluation, 17, Article 15. ing data. Journal of Targeting, Measurement and Analysis
Osborne, J., & Waters, E. (2002). Four assumptions of for Marketing, 11(1), 81–95.
multiple regression that researchers should always test. Petersen, M. L., Sinisi, S. E., & van der Laan, M. J. (2006).
Practical Assessment, Research & Evaluation, 8(2). Estimation of direct causal effects. Epidemiology, 17(3),
http://pareonline.net/getvn.asp?v=8&n=2 276–284.
Ou, L., Chow, S.-M., Ji, L., & Molenaar, P. C. M. (2017). (Re) Pett, M. A., Lackey, N. R., & Sullivan, J. J. (2003). Making
evaluating the implications of the autoregressive latent sense of factor analysis: The use of factor analysis for
trajectory model through likelihood ratio tests of its ini- instrument development. Sage.
tial conditions. Multivariate Behavioral Research, 52(2), Petter, S. (2018). “Haters gonna hate”: PLS and information
178–199. systems research. SIGMIS Database, 49(2), 10–13.
Panwar, M. S., Yadav, C. P., Singh, H., Jawa, T. M., & Sayed- Phahladira, L., Asmal, L., Kilian, S., Chiliza, B., Scheffler,
Ahmed, N. (2022). Latent growth curve modeling for F., Luckhoff, H. K., du Plessis, S., & Emsley, R. (2019).
COVID-19 cases in presence of time-variant covariate. Changes in insight over the first 24 months of treatment
Computational Intelligence and Neuroscience, 2022, in schizophrenia spectrum disorders. Schizophrenia
Article 3538866. Research, 206, 394–399.
Patrician, P. A. (2002). Multiple imputation for missing data. Pietrzak, R. H., Olver, J., Norman, T., Piskulic, D., Maruff,
Research in Nursing & Health, 25(1), 76–84. P., & Snyder, P. J. (2009). A comparison of the CogState
Pavlov, G., Shi, D., & Maydeu-Olivares, A. (2020) Chi-square Schizophrenia Battery and the measurement and treat-
difference tests for comparing nested models: An evalua- ment research to improve cognition in schizophrenia
tion with non-normal data. Structural Equation Modeling, (MATRICS) battery in assessing cognitive impairment in
27(6), 908–917. chronic schizophrenia. Journal of Clinical Experimental
Paxton, P., Hipp, J. R., & Marquart-Pyatt, S. (2011). Nonre- Neuropsychology, 31(7), 848–859.
cursive models: Endogeneity, reciprocal relationships, Pilgrim, C. C., Schulenberg, J. E., O’Malley, P. M., Bachman,
and feedback loops. Sage. J. G., & Johnston, L. D. (2006). Mediators and moderators
Pearl, J. (1995). Causal diagrams for empirical research. of parental involvement on substance use: A national study
Biometrika, 82(4), 669–688. of adolescents. Prevention Science, 7(1), 75–89.
Pearl, J. (2000). Causality: Models, reasoning, and inference. Pinter, J. (1996). Continuous global optimization software: A
Cambridge University Press. brief review. Optima, 52, 1–8.

462 References
Pituch, K. A., & Stevens, J. P. (2016). Applied multivariate R Core Team (2022). R: A language and environment for sta-
statistics for the social sciences: Analyses with SAS and tistical computing. R Foundation for Statistical Comput-
IBM’s SPSS (6th ed.). Routledge. ing. https://www.R-project.org/
Podsakoff, P. M., MacKenzie, S. B., Lee, J.-Y., & Podsakoff, N. Raborn, A., & Leite, W. (2020). ShortForm: Automatic short
P. (2003). Common method biases in behavioral research: form creation (R package 0.4.6). https://CRAN.R-project.
A critical review of the literature and recommended rem- org/package=ShortForm
edies. Journal of Applied Psychology, 88(5), 879–903. Rademaker, M. (2022, November 24). Postestimation:
Pohl, S., & Steyer, R. (2010). Modeling common traits and Assessing a model. https://cran.r-project.org/web/
method effects in multitrait-multimethod analysis. Multi- packages/cSEM/vignettes/Using-assess.html#df
variate Behavioral Research, 45(1), 45–72. Rademaker, M. E., & Schuberth, F. (2022). cSEM: Compos-
Porte, G. (Ed.). (2012). Replication research in applied lin- ite-based structural equation modeling. (R package 0.5.0).
guistics. Cambridge University Press. https://CRAN.R-project.org/package=cSEM
Preacher, K. J. (2006). Quantifying parsimony in structural Rademaker, M. E., Schuberth, F., & Dijkstra, T. K. (2019).
equation modeling. Multivariate Behavioral Research, Measurement error correlation within blocks of indicators
41(3), 227–259. in consistent partial least squares: Issues and remedies.
Preacher, K. J., & Hayes, A. F. (2004). SPSS and SAS pro- Internet Research, 29(3), 448–463.
cedures for estimating indirect effects in simple media- Radloff, L. S. (1977). The CES-D scale: A self-report depres-
tion models. Behavior Research Methods, Instruments, & sion scale for research in the general populations. Applied
Computers, 36(4), 717–731. Psychological Measurement, 1(3), 385–401.
Preacher, K. J., & Hayes, A. F. (2008). Asymptotic and resa- Raffard, S., Trouillet, R., Capdevielle, D., Gely-Nargeot, M.
mpling strategies for assessing and comparing indirect C., Bayard, S., Laroi, F., & Boulenger, J. P. (2010). French
effects in multiple mediator models. Behavior Research adaptation and validation of the scale to assess unaware-
Methods, Instruments, & Computers, 40(3), 879–891. ness of mental disorder. Canadian Journal of Psychiatry,
Preacher, K. J., & Kelley, K. (2011). Effect size measures for 55(8), 523–531.
mediation models: Quantitative strategies for commu- Raftery, A. E. (1993). Bayesian model selection in structural
nicating indirect effects. Psychological Methods, 16(2), equation models. In K. A. Bollen & J. S. Long (Eds.), Test-
93–115. ing structural equation models (pp. 163–180). Sage.
Preacher, K. J., & Merkle, E. C. (2012). The problem of model Ray, S., Danks, N. P., & Calero Valdez. A., (2022).
selection uncertainty in structural equation modeling. Psy- seminr: Building and estimating structural equation
chological Methods, 17(1), 1–14. models (R package 2.3.2). https://CRAN.R-project.org/
Preacher, K. J., Rucker, D. D., & Hayes, A. F. (2007). Address- package=seminr
ing moderated mediation hypotheses: Theory, methods, Raykov, T. (2001). Approximate confidence interval for dif-
and prescriptions. Multivariate Behavioral Research, ference in fit of structural equation models. Structural
42(1), 185–227. Equation Modeling, 8(3), 458–469.
Preacher, K. J., Wichman, A. L., MacCallum, R. C., & Briggs, Raykov, T. (2004). Behavioral scale reliability and measure-
N. E. (2008). Latent growth curve modeling. Sage. ment invariance evaluation using latent variable modeling.
Preacher, K. J., & Yaremych, H. E. (2023). Model selection in Behavior Therapy, 35(2), 299–331.
structural equation modeling. In R. H. Hoyle (Ed.), Hand- Raykov, T., & Marcoulides, G. A. (2001). Can there be infi-
book of structural equation modeling (2nd ed., pp. 206– nitely many models equivalent to a given covariance struc-
222). Guilford Press. ture? Structural Equation Modeling 8(1), 142–149.
Preacher, K. J., Zhang, G., Kim, C., & Mels, G. (2013). Raykov, T., & Marcoulides, G. A. (2006). A first course in
Choosing the optimal number of factors in exploratory structural equation modeling (2nd ed.). Erlbaum.
factor analysis: A model selection perspective. Multivari- Raykov, T., & Penev, S. (1999). On structural equation model
ate Behavioral Research, 48(1), 28–56. equivalence. Multivariate Behavioral Research, 34(2),
Pritikin, J. N., Brick, T. R., & Neale, M. C. (2018). Multi- 199–244.
variate normal maximum likelihood with both ordinal and Raykov, T., & Penev, S. (2001). The problem of equivalent
continuous variables, and data missing at random. Behav- structural equation models: An individual perspective. In
ior Research Methods, 50(2), 490–500. G. A. Marcoulides & R. E. Schumaker (Eds.), New devel-
Putnick, D. L., & Bornstein, M. H. (2016). Measurement opments in structural equation modeling (pp. 297–312).
invariance conventions and reporting: The state of the art Erlbaum.
and future directions for psychological research. Develop- Rebueno, M. C. D. R., Tiongco, D. D. D., Macindo, J. R. B.
mental Review, 41, 71–90. (2017). A structural equation model on the attributes of a
Qiu, W. (2021). powerMediation: Power/sample size calcu- skills enhancement program affecting clinical competence
lation for mediation analysis (R package 0.3.4). https:// of pre-graduate nursing students. Nurse Education Today,
CRAN.R-project.org/package=powerMediation 49, 180–186.

References 463
Reinartz, W. J., Haenlein, M., & Henseler, J. (2009). An Rigdon, E. E. (2013). Partial least squares path modeling. In
empirical comparison of the efficacy of covariance- G. R. Hancock & R. O. Mueller (Eds.), Structural equa-
based and variance-based SEM. International Journal of tion modeling: A second course (2nd ed., pp. 81–116).
Research in Marketing, 26(4), 332–344. Information Age Publishing.
Reise, S. P., Mansolf, M., Haviland, M. G. (2023). Bifactor Rigdon, E. E. (2016). Choosing PLS path modeling as ana-
measurement models. In R. H. Hoyle (Ed.), Handbook lytical method in European management research: A real-
of structural equation modeling (2nd ed., pp. 329–348). ist perspective. European Management Journal, 34(6),
Guilford Press. 598–605.
Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor Rigdon, E. E., Becker, J.-M., & Sarstedt, M. (2019). Factor
models and rotations: Exploring the extent to which mul- indeterminacy as metrological uncertainty: Implications
tidimensional data yield univocal scale scores. Journal of for advancing psychological measurement. Multivariate
Personality Assessment, 92(6), 544–559. Behavioral Research, 54(3), 429–443.
Rensvold, R. B., & Cheung, G. W. (1999). Identification of Rigdon, E. E., Preacher, K. J., Lee, N., Howell, R. D., Franke,
influential cases in structural equation models using the G. R., & Borsboom, D. (2011). Avoiding measurement
jackknife method. Organizational Research Methods, dogma: A response to Rossiter. European Journal of Mar-
2(3), 293–308. keting, 45(11–12), 1589–1600.
Revelle, W. (2022). psych: Procedures for psychological, psy- Rigdon, E. E., Sarstedt, M., & Ringle, C. M. (2017). On com-
chometric, and personality research (R package 2.2.5). paring results from CB-SEM and PLS-SEM. Journal of
https://CRAN.R-project.org/package=psych Research and Management, 39(3), 4–16.
Reynolds, C. R., & Suzuki, L. A. (2013). Bias in psycho- Rindskopf, D. (1984). Structural equation models: Empiri-
logical assessment: An empirical review and recommen- cal identification, Heywood cases, and related problems.
dations. In J. R. Graham, J. A. Naglieri, & I. B. Weiner Sociological Methods & Research, 13(1), 109–119.
(Eds.), Handbook of psychology: Assessment psychology Ringle, C. M., Wende, S., & Becker, J.-M. (2022). Smart-
(pp. 82–113). Wiley. PLS (Version 4) [Computer software]. SmartPLS GmbH.
Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). https://www.smartpls.com
When can categorical variables be treated as continuous? Rioux, C., Stickley, Z. L., Odejimi, O. A., & Little, T.
A comparison of robust continuous and categorical SEM D. (2020). Item parcels as indicators: Why, when, and
estimation methods under suboptimal conditions. Psycho- how to use them in small sample research. In R. van de
logical Methods, 17(3), 354–373. Schoot & M. Miočević (Eds.), (2020). Small sample size
Rhemtulla, M., van Bork, R., & Borsboom, D. (2020). Worse solutions: A guide for applied researchers (pp. 203–
than measurement error: Consequences of inappropriate 214). Routledge.
latent variable measurement models. Psychological Meth- Robbins, B. G. (2012). Institutional quality and generalized
ods, 25(1), 30–45. trust: A nonrecursive causal model. Social Indicators
Richardson, H. A., Simmering, M. J., & Sturman, M. C. Research, 107(2), 235–258.
(2009). A tale of three perspectives: Examining post hoc Rogosa, D. R. (1988). Ballad of the casual modeler [Song].
statistical techniques for detection and correction of com- https://web.stanford.edu/class/ed260/ballad.mp3
mon method variance. Organizational Research Methods, Roid, G. H. (2003). Stanford-Binet Intelligence Scales, Fifth
12(4), 762–800. Edition. Riverside Publishing.
Richiardi, L., Bellocco, R., & Zugna, D. (2013). Mediation Romney, D. M., Jenkins, C. D., & Bynner, J. M. (1992). A
analysis in epidemiology: Methods, interpretation and bias. structural analysis of health-related quality of life dimen-
International Journal of Epidemiology, 42(1), 1511–1519. sions. Human Relations, 45(2), 165–176.
Richter, N. F., Sinkovics, R. R., Ringle, C. M., & Schlägel, C. Rönkkö, M., & Cho, E. (2022). An updated guideline for
(2016). A critical look at the use of SEM in international assessing discriminant validity. Organizational Research
business research. International Marketing Review, 33(3), Methods, 25(1) 6–47.
376–404. Rönkkö, M., & Evermann, J. (2013). A critical examination of
Riddles, M. K., Kim, J. K., & Im, J. (2016). A propensity- common beliefs about partial least squares path modeling.
score-adjustment method for nonignorable nonresponse. Organizational Research Methods, 16(3), 425–448.
Journal of Survey Statistics and Methodology, 4(2), 215– Roos, J, M., & Bauldry, S. (2022). Confirmatory factor analy-
245. sis. Sage.
Rigdon, E. E. (1995). A necessary and sufficient identification Ropovik, I. (2015). A cautionary note on testing latent vari-
rule for structural models estimated in practice. Multivari- able models. Frontiers in Psychology, 6, Article 1715.
ate Behavioral Research, 30(3), 359–383. Rosseel, Y. (2020). Small sample solutions for structural
Rigdon, E. E. (2012). Rethinking partial least squares path equation modeling. In R. van de Schoot & M. Miočević
modeling: In praise of simple methods. Long Range Plan- (Eds.), (2020). Small sample size solutions: A guide for
ning, 45(5–6), 341–358. applied researchers (pp. 203–214). Routledge.

464 References
Rosseel, Y., & Loh, W. W. (2021). A structural-after-measure- different groups in single-group and multi-group structural
ment (SAM) approach to SEM. https://osf.io/pekbm/ equation models. Frontiers in Psychology, 8, Article 747.
Rosseel, Y., Jorgensen, T. D., & Rockwood, N. (2023). Sabatelli, R. M., & Bartle-Haring, S. (2003). Family-of-ori-
lavaan: Latent variable analysis (R package 0.6-13). gin experiences and adjustment in married couples. Jour-
https://CRAN.R-project.org/package=lavaan nal of Marriage and Family, 65(1), 159–169.
Rossiter, J. R. (2011). Marketing measurement revolution: Sagan, C. (1996). The demon-haunted world: Science as a
The C-OAR-SE method and why it must replace psy- candle in the dark. Random House.
chometrics. European Journal of Marketing, 45(11–12), Salzberger, T., Sarstedt, M., & Diamantopoulos, A. (2016).
1561–1588. Measurement in the social sciences: Where C-OAR-SE
Roth, D. L., Wiebe, D. J., Fillingim, R. B., & Shay, K. A. (1989). delivers and where it does not. European Journal of Mar-
Life events, fitness, hardiness, and health: A simultaneous keting, 50(11), 1942–1952.
analysis of proposed stress-resistance effects. Journal of Sargan, J. D. (1958). The estimation of economic relation-
Personality and Social Psychology, 57(1), 136–142. ships using instrumental variables. Econometrica, 26(3),
Rothman, K. J., & Greenland, S. (2018). Planning study 393–415.
size based on precision rather than power. Epidemiology, Saris, W. E., & Satorra, A. (1993). Power evaluations in struc-
29(5), 599–603. tural equation models. In K. A. Bollen & J. S. Long (Eds.),
Rousseeuw, P. J., & Hubert, M. (2018). Anomaly detection Testing structural equation models (pp. 181–204). Sage.
by robust statistics. WIREs Data Mining and Knowledge Sarstedt, M., Hair, J. F., Ringle, C. M., Thiele, K. O., & Guder-
Discovery, 8(2), Article e1236. gan, S. P. (2016). Estimation issues with PLS and CBSEM:
Rubin, D. B. (1976). Inference and missing data. Biometrika, Where the bias lies! Journal of Business Research, 69(10),
63(3), 581–592. 3998–4010.
Rubin, D. B. (2005). Causal inference using potential out- SAS Institute Inc. (2022). SAS/STAT user’s guide: The CALIS
comes: Design, modeling, decisions. Journal of the Ameri- procedure.
can Statistical Association, 100(469), 322–331. Sass, D. A., Schmitt, T. A., & Marsh, H. W. (2014). Evaluating
Rubin, D. B. (2009). Should observational studies be designed model fit with ordered categorical data within a measure-
to allow lack of balance in covariate distributions across ment invariance framework: A comparison of estimators.
treatment groups? Statistics in Medicine, 28(9), 1420– Structural Equation Modeling, 21(2), 167–180.
1423. Satorra, A., & Bentler, P. M. (1988). Scaling corrections for
Rubin, D. B., & Little, R. J. A. (2020). Statistical analysis chi-square statistics in covariance structure analysis. In
with missing data (3rd ed.). Wiley. ASA 1988 Proceedings of the Business and Economic Sta-
Rucker, D. D., McShane, B. B., & Preacher, K. J. (2015). tistics Section (pp. 308–313). American Statistical Asso-
A researcher’s guide to regression, discretization, and ciation.
median splits of continuous variables. Journal of Con- Satorra, A., & Bentler, P. M. (1994). Corrections to test sta-
sumer Psychology, 25(4), 666–678. tistics and standard errors on covariance structure analy-
Rucker, D. D., Preacher, K. J., Tormala, Z. L., & Petty, R. E. sis. In A. von Eye & C. C. Clogg (Eds.), Latent variables
(2011). Mediation analysis in social psychology: Current analysis (pp. 399–419). Sage.
practices and new recommendations. Social and Person- Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-
ality Psychology Compass, 5(6), 359–371. square test statistic for moment structure analysis. Psy-
Rust, R. T., Lee, C., & Valente, E. (1995). Comparing chometrika, 66(4), 507–514.
covariance structure models: A general methodology. Satorra, A., & Bentler, P. M. (2010). Ensuring positiveness of
International Journal of Research in Marketing, 12(4), the scaled chi-square test statistic. Psychometrika, 75(2),
279–291. 243–248.
Rutkowski, L., & Svetina, D. (2014). Assessing the hypothesis Satorra, A., & Saris, W. E. (1985). Power of the likelihood
of measurement invariance in the context of large-scale ratio test in covariance structure analysis. Psychometrika,
international surveys. Educational and Psychological 50(1), 83–90.
Measurement, 74(1), 31–57. Satterthwaite, F. E. (1941). Synthesis of variance. Psy-
Ryder, A. G., Yang J., Zhu, X., Yao, S., Yi, J., Heine, S. J., & chometrika, 6(5), 309–316.
Bagby, R. M. (2008). The cultural shaping of depression: Sauvé, G., Kline, R. B., Shah, J. L., Joober, R., Malla, A.,
Somatic symptoms in China, psychological symptoms in Brodeur, M. B., & Lepage, M. (2019). Cognitive capac-
North America? Journal of Abnormal Psychology, 117(2), ity similarly predicts insight into symptoms in first- and
300–313. multiple-episode psychosis. Schizophrenia Research, 206,
Ryu, E. (2015). Multiple-group analysis approach to testing 236–243.
group difference in indirect effects. Multivariate Behav- Savalei, V. (2010). Expected versus observed information in
ioral Research, 47(2), 484–493. SEM with incomplete normal and nonnormal data. Psy-
Ryu, E., & Cheong, J. (2017). Comparing indirect effects in chological Methods, 15(4), 352–367.

References 465
Savalei, V. (2014). Understanding robust corrections in struc- of composites in structural equation modeling: A tutorial.
tural equation modeling. Structural Equation Modeling, Psychological Methods. Advance online publication.
21(1), 149–160. Schuberth, F., Henseler, J., & Dijkstra, T. K. (2018). Confir-
Savalei, V. (2018). A comparison of several approaches for matory composite analysis. Frontiers in Psychology, 9,
controlling measurement error in small samples. Psycho- Article 2541.
logical Methods, 24(3), 352–370. Schwarz, G. (1978). Estimating the dimension of a model.
Savalei, V. (2021). Improving fit indices in structural equation Annals of Statistics, 6(2), 461–464.
modeling with categorical data. Multivariate Behavioral Schweizer, K., Troche, S. J., & DiStefano, C. (2019). Scaling
Research, 56(3), 390–407. the variance of a latent variable while assuring constancy
Savalei, V., & Rhemtulla, M. (2012). On obtaining estimates of the model. Frontiers in Psychology, 10, Article 887.
of the fraction of missing information from full informa- Seixas, A. A., Vallon, J., Barnes-Grant, A., Butler, M., Lang-
tion maximum likelihood. Structural Equation Modeling, ford, A. T., Grandner, M. A., Schneeberger, A. R., Huth-
19(3), 477–494. chinson, J., Zizi, F., & Jean-Louis, G. (2018). Mediating
Savalei, V., & Rhemtulla, M. (2013). The performance of effects of body mass index, physical activity, and emo-
robust test statistics with categorical data. British Jour- tional distress on the relationship between short sleep and
nal of Mathematical and Statistical Psychology, 66(2), cardiovascular disease. Medicine, 97(37), Article e11939.
201–223. Selig, J. P., & Preacher, K. J. (2009). Mediation models for
Schafer, J. L. (1999). Multiple imputation: A primer. Statisti- longitudinal data in developmental research. Research in
cal Methods in Medicine, 8(1), 3–15. Human Development, 6(2–3), 144–164.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our Shah, R., & Goldstein, S. M. (2006). Use of structural equa-
view of the state of the art. Psychological Methods, 7(2), tion modeling in operations management research: Look-
147–177. ing back and forward. Journal of Operations Manage-
Schaffner, K. F. (1969). Correspondence rules. Philosophy of ment, 24(2), 148–169.
Science, 36(3), 280–290. Sharma, S., Mukherjee, S., Kumar, A., & Dillon, W. R.
Schamberger, T., Schuberth, F., & Henseler, J., & Dijkstra, (2005). A simulation study to investigate the use of cutoff
T. K. (2020). Robust partial least squares path modeling. values for assessing model fit in covariance structure mod-
Behaviormetrika, 47, 307–334. els. Journal of Business Research, 58(7), 935–943.
Schmid, J., & Leiman, J. M. (1957). The development of hier- Sheather, S. J. (2009). A modern approach to regression with
archical factor solutions. Psychometrika, 22(1), 53–61. R. Springer.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psy- Shen, B.-J., & Takeuchi, D. T. (2001). A structural model of
chological Assessment, 8(4), 350–353. acculturation and mental health status among Chinese
Schmitt, N., & Bedeian, A. G. (1982). A comparison of LIS- Americans. American Journal of Community Psychology,
REL and two-stage least squares analysis of a hypothe- 29(3), 387–418.
sized life–job satisfaction reciprocal relationship. Journal Shi, D., Lee, T., & Maydeu-Olivares, A. (2019). Understand-
of Applied Psychology, 67(6), 806–817. ing the model size effect on SEM fit indices. Educational
Schmittmann, V. D., Cramer, A. O. J., Waldorp, L. J., and Psychological Measurement, 79(2), 310–334.
Epskamp, S., Kievit, R. A., & Borsboom, D. (2013). Shi, D., Maydeu-Olivares, A., & Rosseel, Y. (2020) Assessing
Deconstructing the construct: A network perspective fit in ordinal factor analysis models: SRMR vs. RMSEA.
on psychological phenomena. New Ideas in Psychology, Structural Equation Modeling, 27(1), 1–15.
31(1), 43–53. Shipley, B. (2000). A new inferential test for path models
Schneeweiss, S., Setoguchi, S., Brookhart, A., Dormuth, C., based on directed acyclic graphs. Structural Equation
& Wang, P. S. (2007). Risk of death associated with the use Modeling, 7(2), 206–218.
of conventional versus atypical antipsychotic drugs among Shipley, B. (2003). Testing recursive path models with corre-
elderly patients. Canadian Medical Association Journal, lated errors using d-separation. Structural Equation Mod-
176(5), 627–632. eling, 10(2), 214–221.
Schneider, E. B., (2020). Collider bias in economic history Shipley, B. (2009). Confirmatory path analysis in a general-
research. Explorations in Economic History, 78, Article ized multilevel context. Ecology, 90(2), 363–368.
101356. Shipley, B. (2017). CauseAndCorrelation: Functions for path
Schreiber, J. B. (2008). Core reporting practices in structural analysis and SEM (R package 0.1). http://github.com/
equation modeling. Research in Social and Administrative BillShipley/CauseAndCorrelation
Pharmacy, 4(2), 83–97. Shipley, B., & Douma, J. C. (2020). Generalized AIC and chi-
Schreiber, J. B. (2017). Update to core reporting practices squared statistics for path models consistent with directed
in structural equation modeling. Research in Social and acyclic graphs. Ecology, 101(3), Article e02960.
Administrative Pharmacy, 13(3), 634–643. Shipley, B., & Douma, J. C. (2021). Testing piecewise struc-
Schuberth, F. (2021). The Henseler–Ogasawara specification tural equations models in the presence of latent variables

466 References
and including correlated errors. Structural Equation Mod- ing measurement invariance in cross-national consumer
eling, 28(4), 582–589. research. Journal of Consumer Research, 25(1) 78–107.
Shrier, I., & Platt, R. W. (2008). Reducing bias through Steiger, J. H. (1990). Structural model evaluation and modi-
directed acyclic graphs. BMC Medical Research Method- fication: An interval estimation approach. Multivariate
ology, 8(1), Article 70. Behavioral Research, 25(2), 173–180.
Silvia, E. S. M., & MacCallum, R. C. (1988). Some fac- Steiger, J. H. (2000). Point estimation, hypothesis testing,
tors affecting the success of specification searches in and interval estimation: Some comments and a reply to
covariance structure modeling. Multivariate Behavioral Hayduk and Glaser. Structural Equation Modeling, 7(2),
Research, 23(3), 297–326. 164–182.
Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-pos- Steiger, J. H. (2001). Driving fast in reverse: The relationship
itive psychology: Undisclosed flexibility in data collection between software development, theory, and education in
and analysis allow presenting anything as significant. Psy- structural equation modeling. Journal of the American
chological Science, 22(11), 1359–1366. Statistical Association, 96(453), 331–338.
Simms, L. J., Zelazny, K., Williams, T. F., & Bernstein, L. Steiger, J. H. (2002). When constraints interact: A caution
(2019). Does the number of response options matter? Psy- about reference variables, identification constraints, and
chometric perspectives using personality questionnaire scale dependencies in structural equation modeling. Psy-
data. Psychological Assessment, 31(4), 557–566. chological Methods, 7(2), 210–227.
Smid, S. C., & Rosseel, Y. (2020). SEM with small samples: Steiger, J. H. (2007). Understanding the limitations of global
Two-step modeling and factor score regression versus fit assessment in structural equation modeling. Personality
Bayesian estimation with informative priors. In R. van de and Individual Differences, 42(5), 893–898.
Schoot & M. Miočević (Eds.), (2020). Small sample size Steiger, J. H., & Schönemann, P. H. (1978). A history of factor
solutions: A guide for applied researchers (pp. 239–254). indeterminacy. In S. Shye (Ed.), Theory construction and
Routledge. data analysis in the behavioral sciences (pp. 136–178).
Sobel, M. E. (1982). Asymptotic intervals for indirect effects Jossey-Bass.
in structural equations models. Sociological Methodology, Steinmetz, H. (2013). Analyzing observed composite differ-
13, 290–312. ences across groups: Is partial measurement invariance
Sörbom, D. (2001). Karl Jöreskog and LISREL: A personal enough? Methodology, 9(1), 1–12.
story. In R. Cudeck, S. Du Toit, and D. Sörbom (Eds.), Sterba, S. K. (2014). Fitting nonlinear latent growth curve
Structural equation modeling: Present and future. A Fest- models with individually varying time points. Structural
schrift in honor of Karl Jöreskog (pp. 1–10). Scientific Equation Modeling, 21(4), 630–647.
Software International. Sterba, S. K., & Rights, J. D. (2023). Item parceling in SEM:
Spearman, C. (1904). General intelligence, objectively deter- A researcher degree-of-freedom ripe for opportunistic use.
mined and measured. American Journal of Psychology, In R. H. Hoyle (Ed.), Handbook of structural equation
15(2), 201–293. modeling (2nd ed., pp. 296–315). Guilford Press
Spirtes, P. (1995). Directed cyclic graphical representations Stevens, G., & Featherman, D. L. (1981). A revised socio-
of feedback models. In P. Besnard & S. Hanks (Eds.), Pro- economic index of occupational status. Social Science
ceedings of the Eleventh Conference on Uncertainty in Research, 10(4), 364–395.
Artificial Intelligence (pp. 491–498). Morgan Kaufmann. Steyer, R., Geiser, C., & Loßnitzer, C. (2023). Latent state-
Spirtes, P., Scheines, R., Ramsey, J., & Glymour, C. (2022). trait models. In H. Cooper, M. Coutanche, L. McMullen,
TETRAD (Version 7.1.0) [Computer software]. https:// A. Panter, D. Rindskopf, & K. J. Sher (Eds.), APA Hand-
github.com/cmu-phil/tetrad book of research methods in psychology (2nd ed., Vol. 3,
Spruill, J., & Beck, B. (1986). Relationship between the pp. 297–316). American Psychological Association.
WAIS-R and Wide Range Achievement Test-Revised. Stoel, R. D., van den Wittenboer, G., & Hox, J. (2004). Includ-
Educational and Psychological Measurement, 46(4), ing time-invariant covariates in the latent growth curve
1037–1040. model. Structural Equation Modeling, 11(2), 155–167.
Stanovich, K. E. (1986) Matthew effects in reading: Some Streiner, D. L. (2003). Starting at the beginning: An introduc-
consequences of individual differences in the acquisition tion to coefficient alpha and internal consistency. Journal
of literacy. Reading Research Quarterly, 21(4), 360–407. of Personality Assessment, 80(1), 99–103.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detect- Stuart, E. A., Schmid, I., Nguyen, T., Sarker, E., Pittman, A.,
ing differential item functioning with confirmatory factor Benke, K., Rudolph, K., Badillo-Goicoechea, E., & Leout-
analysis and item response theory: Toward a unified strat- sakos, J.-M. (2021). Assumptions not often assessed or sat-
egy. Journal of Applied Psychology, 91(6), 1292–1306. isfied in published mediation analyses in psychology and
StataCorp LLC (1985–2021). Stata structural equation mod- psychiatry. Epidemiologic Reviews, 43(1), 48–52.
eling: Release 17. Stata Press. Stürmer, T., Glynn, R. J., Rothman, K. J., Avorn, J., &
Steenkamp, J.-B. E. M., & Baumgartner, H. (1998). Assess- Schneeweiss, S. (2007). Adjustments for unmeasured

References 467
confounders in pharmacoepidemiologic database studies using directed acyclic graphs: The R package “dagitty.”
using external information. Medical Care, 45(Suppl. 2), International Journal of Epidemiology, 45(6), 1887–1894.
S158–S165. Thelwall, M., & Wilson, P. (2016). Does research with sta-
Sun, J. (2005). Assessing goodness of fit in confirmatory fac- tistics have more impact? The citation rank advantage of
tor analysis. Measurement and Evaluation in Counseling structural equation modeling. Journal of the Association
and Development, 37(4), 240–256. for Information Science and Technology, 67(5), 1233–
Svetina, D., Rutkowski, L., & Rutkowski, D. (2020) Multiple- 1244.
group invariance with categorical outcomes using updated Thoemmes, F. (2015). Reversing arrows in mediation models
guidelines: An illustration using Mplus and the lavaan/ does not distinguish plausible models. Basic and Applied
semTools packages. Structural Equation Modeling, 27(1), Social Psychology, 37(4), 226–234.
111–130. Thoemmes, F., MacKinnon, D. P., & Reiser, M. R. (2010).
Swanson, S. A., & Hernán, M. A. (2013). How to report Power analysis for complex mediational designs using
instrumental variable analyses (suggestions welcome). Monte Carlo methods. Structural Equation Modeling,
Epidemiology, 24(3), 370–374. 17(3), 510–534.
Systat Software Inc. (2018). SYSTAT (Version 13.2) [Com- Thoemmes, F., & Rose, N. (2014). A cautious note on auxil-
puter software]. https://systatsoftware.com/ iary variables that can increase bias in missing data prob-
Szucs, D., & Ioannidis, J. P. A. (2017). When null hypothesis lems. Multivariate Behavioral Research, 49(5), 443–459.
significance testing is unsuitable for research: A reassess- Thoemmes, F., & Rosseel, Y. (2018). Local fit evaluation of
ment. Frontiers in Human Neuroscience, 11, Article 390. structural equation models using graphical criteria. Psy-
Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate chological Methods, 23(1), 27–41.
statistics (6th ed.). Pearson. Thompson, G. C., Kim, R. S., Aloe, A. M., & Becker, B.
Tabachnick, B. G., & Fidell, L. S. (2019). Using multivariate J. (2017). Extracting the variance inflation factor and
statistics (7th ed.). Pearson. other multicollinearity diagnostics from typical regres-
Takezawa, K. (2006). Introduction to nonparametric regression results. Basic and Applied Social Psychology, 39(2),
sion. Wiley. 81–90.
Tan, Q., Zou, J., & Kong, F. (2021). Longitudinal and gen- Thompson, Y. T., Song, H., Shi, D., Liu, Z. (2021). It matters:
der measurement invariance of the gratitude questionnaire Reference indicator selection in measurement invariance
in Chinese adolescents. Psychological Reports. Advance tests. Educational and Psychological Measurement, 81(1),
online publication. 5–38.
Tanaka, J. S. (1993). Multifaceted conceptions of fit in struc- TIBCO Statistica. (2022). TIBCO Statistica user’s guide:
tural equation models. In K. A. Bollen & J. S. Long (Eds.), Version 14.0.1. https://docs.tibco.com/
Testing structural equation models (pp. 10–39). Sage. Tikka, S. (2022). causaleffect: Deriving expressions of
Tang, N., & Ju, Y. (2018) Statistical inference for nonignor- joint interventional distributions and transport formulas
able missing-data problems: A selective review. Statistical in causal models (R package 1.3.15). https://CRAN.R-
Theory and Related Fields, 2(2), 105–133. project.org/package=causaleffect
Taris, T. W., & Kompier, M. A. J. (2014). Cause and effect: Tikka, S., & Karvanen, J. (2017). Identifying causal effects
Optimizing the designs of longitudinal studies in occupa- with the R package causaleffect. Journal of Statistical
tional health psychology. Work & Stress, 28(1), 1–8. Software, 76(1), 1–30.
Tate, C. U. (2015). On the overuse and misuse of mediation Tobak, S. (2015). 10 behaviors of real leaders. Entrepreneur.
analysis: It may be a matter of timing. Basic and Applied https://www.entrepreneur.com/article/249205
Social Psychology, 37(4), 235–246. Tomarken, A. J., & Waller, N. G. (2005). Structural equation
Taylor, J. M. (2019). Overview and illustration of Bayes- modeling: Strengths, limitations, and misconceptions.
ian confirmatory factor analysis with ordinal indicators. Annual Review of Clinical Psychology, 1(1), 31–65.
Practical Assessment, Research & Evaluation, 24(4), Torres, M. (2020). Estimating controlled direct effects
Article 4. through marginal structural models. Political Science
Teitcher, J. E., Bockting, W. O., Bauermeister, J. A., Hoefer, Research and Methods, 8(3), 391–408.
C. J., Miner, M. H., & Klitzman, R. L. (2015). Detecting, Trafimow, D. (2015). Introduction to the special issue on
preventing, and responding to “fraudsters” in internet mediation analyses: What if planetary scientists used
research: Ethics and tradeoffs. Journal of Law, Medicine mediation analysis to infer causation? Basic and Applied
& Ethics, 43(1), 116–133. Social Psychology, 37(4), 197–201.
Textor, J., van der Zander, B., & Ankan, A. (2021). dagitty: Trafimow, D. (2021). The underappreciated effects of unre-
Graphical analysis of structural causal models (R package liability on multiple regression and mediation. Applied
0.3-1.) https://CRAN.R-project.org/package=dagitty Finance and Accounting, 7(2), Article 5292.
Textor, J., van der Zander, B., Gilthorpe, M. K., Liskiewicz, Tsai, T.-L., Shau, W.-Y., & Hu, F.-C. (2006). Generalized path
M., & Ellison, G. T. H. (2016). Robust causal inference analysis and generalized simultaneous equations model

468 References
for recursive systems with responses of mixed types. tures. Educational and Psychological Measurement, 61(5)
Structural Equation Modeling, 13(2), 229–251. 777–792.
Tu, Y.-K. (2009). Commentary: Is structural equation mod- Van Ryzin, M. J., & Nowicka, P. (2013). Direct and indirect
elling a step forward for epidemiologists? International effects of a family-based intervention in early adoles-
Journal of Epidemiology, 38(2), 549–551. cence on parent–youth relationship quality, late adolescent
Tucker, L. R., & Lewis, C. (1973). A reliability coefficient health, and early adult obesity. Journal of Family Psychol-
for maximum likelihood factor analysis. Psychometrika, ogy, 27(1), 106–116.
30(1), 1–10. van Veelen, R., Derks, B., & Endedijk, M. D. (2019) Double
Tukey, J. W. (1977). Exploratory data analysis. Addison- trouble: How being outnumbered and negatively stereo-
Wesley. typed threatens career outcomes of women in STEM.
Tunca, B. (2019). Consumer brand engagement in social Frontiers in Psychology, 10, Article 150.
media: A pre-registered replication. Journal of Empirical van Widenfelt, B. M., Treffers, P. D. A., de Beurs, E., Siebe-
Generalisations in Marketing Science, 19(1), 1–20. link, B. M., & Koudijs, E. (2005). Translation and cross-
Uddin, M. J., Groenwold, R. H. H., Ali, M. S., de Boer, A., cultural adaptation of assessment instruments used in psy-
Roes, K. C., Chowdhury, M. A., & Klungel, O. H. (2016a). chological research with children and families. Clinical
Methods to control for unmeasured confounding in phar- Child and Family Psychology Review, 8(2), 135–147.
macoepidemiology: An overview. International Journal Vandenberg, R. J., & Lance, C. E. (2000). A review and syn-
of Clinical Pharmacy, 38(3), 714–723. thesis of the measurement invariance literature: Sugges-
Uddin, M. J., Groenwold, R. H. H., de Boer, A., Gardarsdot- tions, practices, and recommendations for organizational
tir, H., Martin, E., Candore, G., Belitser, S. V., Hoes, A. research. Organizational Research Methods, 3(1), 4–70.
W., Roes, K. C. B., & Klungel, O. H. (2016b). Instrumental VanderWeele, T. J. (2015). Explanation in causal inference:
variables analysis using multiple databases: An example of Methods for mediation and interaction. Oxford University
antidepressant use and risk of hip fracture. Pharmacoepi- Press.
demiology and Drug Safety, 25(S1), 122–131. VanderWeele, T. J. (2019). Principles of confounder selection.
Urbina, S. (2014). Essentials of psychological testing (2nd European Journal of Epidemiology, 34(3), 211–219.
ed.). Wiley. Vansteelandt, S., & Daniel, R. M. (2017). Interventional
Vacha-Haase, T., & Thompson, B. (2011). Score reliability: A effects for mediation analysis with multiple mediators.
retrospective look back at 12 years of reliability general- Epidemiology, 28(2), 258–265.
ization. Measurement and Evaluation in Counseling and Verdam, M. G. E., Oort, F. J., & Sprangers, M. A. G. (2017).
Development, 44(3), 159–168. Structural equation modeling–based effect-size indices
van Buuren, S. (2018). Flexible imputation of missing data were used to evaluate and interpret the impact of response
(2nd ed.). CRC Press. shift effects. Journal of Clinical Epidemiology, 85, 37–44.
van de Schoot, R., Kluytmans, A., Tummers, L., Lugtig, P., Vernon, T., & Eysenck, S. B. G. (Eds). (2007). Structural
Hox, J., & Muthén, B. (2013). Facing off with Scylla and equation modeling [Special issue]. Personality and Indi-
Charybdis: A comparison of scalar, partial, and the novel vidual Differences, 42(5).
possibility of approximate measurement invariance. Fron- Viera, A. L. (2011). Interactive LISREL in practice: Getting
tiers in Psychology, 4, Article 770. started with a SIMPLIS approach. Springer.
van de Schoot, R., & Miočević, M. (Eds.). (2020). Small Vo, T.-T., Superchi, C., Boutron, I., & Vansteelandt, S.
sample size solutions: A guide for applied researchers. (2020). The conduct and reporting of mediation analysis
Routledge. in recently published randomized controlled trials: Results
van de Schoot, R., Schmidt, P., & De Beuckelaer, A. (Eds.). from a methodological systematic review. Journal of Clin-
(2015). Measurement invariance. Frontiers Media. ical Epidemiology, 117, 78–88.
van der Zander, B., Textor, J., & Liśkiewicz, M. (2015, July Voelkle, M. C. (2008). Reconsidering the use of autoregres-
25–31). Efficiently finding conditional instruments for sive latent trajectory (ALT) model. Multivariate Behav-
causal inference. In Q. Yang & M. Wooldridge (Eds.), Pro- ioral Research, 43(4), 564–591.
ceedings of the Twenty-Fourth International Joint Con- von Oertzen, T., & Brick, T. R. (2014). Efficient Hessian com-
ference on Artificial Intelligence. https://www.ijcai.org/ putation using sparse matrix derivatives in RAM notation.
Proceedings/15/Papers/457.pdf Behavior Research Methods, 46(2), 385–395.
van Ginkel, J. R., Linting, M., Rippe, R. C. A., & van der von Oertzen, T., Brandmaier, A. M., & Tsang, S. (2015).
Voort, A. (2020). Rebutting existing misconceptions about Structural equation modeling with Wnyx. Structural
multiple imputation as a method for handling missing Equation Modeling, 22(1), 148–161.
data. Journal of Personality Assessment, 102(3) 297–308. Wagner, J. (2010). The fraction of missing information as
van Prooijen, J.-W., & van der Kloot, W. A. (2001). Confir- a tool for monitoring the quality of survey data. Public
matory analysis of exploratively obtained factor struc- Opinion Quarterly, 74(2), 223–243.

References 469
Waller, N. G. (2008). Fungible weights in multiple regression. measurement invariance using confirmatory factor analy-
Psychometrika, 73(4), 691–703. sis. In R. H. Hoyle (Ed.), Handbook of structural equation
Wang, L., & Finn, A. (2016). Using vanishing tetrad test to modeling (2nd ed., pp. 367–384). Guilford Press.
examine multifaceted causal directionality. Journal of Wilcox, J. B., Howell, R. D., & Breivik, E. (2008). Ques-
Marketing Analytics, 4(1), 51–59. tions about formative measurement. Journal of Business
Wang, P., Mao, N., Liu, C., Geng, J., Wei, X., Wang, W., Research, 61(12), 1219–1228.
Zeng, P., & Li, B. (2022). Gender differences in the rela- Willaby, H. W., Costa, D. S. J., Burns, B. D., MacCann, C.,
tionships between parental phubbing and adolescents’ & Roberts, R. D. (2015). Testing complex models with
depressive symptoms: The mediating role of parent-ado- small sample sizes: A historical overview and empirical
lescent communication. Journal of Affective Disorders, demonstration of what partial least squares (PLS) can offer
302, 194–203. differential psychology. Personality and Individual Differ-
Wang, Y., Lu, N., & Miao, H. (2016). Structural identifiabil- ences, 84, 73–78.
ity of cyclic graphical models of biological networks with Willett, J. B., & Sayer, A. G. (1994). Using covariance struc-
latent variables. BMC Systems Biology, 10(1), Article 41. ture analysis to detect correlates and predictors of indi-
Wang, Y. A., & Rhemtulla, M. (2021). Power analysis for vidual change over time. Psychological Bulletin, 116(2),
parameter estimation in structural equation modeling: A 363–381.
discussion and tutorial. Advances in Methods and Prac- Williams, L. J. (2012). Equivalent models: Concepts, prob-
tices in Psychological Science, 4(1), 1–17. lems, alternatives. In R. H. Hoyle (Ed.), Handbook of
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). structural equation modeling (pp. 247–260). Guilford
Moving to a world beyond “p < 0.05.” American Statisti- Press.
cian, 73(Suppl. 1), 1–19. Williams, L. J., & McGonagle, A. K. (2016). Four research
Wen, Z., & Fan, X. (2015). Monotonicity of effect sizes: designs and a comprehensive analysis strategy for inves-
Questioning kappa-squared as mediation effect size mea- tigating common method variance with self-report mea-
sure. Psychological Methods, 20(2), 193–203. sures using latent variables. Journal of Business Psychol-
West, S. G., Taylor, A. B., & Wu, W. (2012). Model fit and ogy, 31(3), 339–359.
model selection in structural equation modeling. In R. H. Williams, M. N., Grajales, C. A. G., & Kurkiewicz, D. (2013).
Hoyle (Ed.), Handbook of structural equation modeling Assumptions of multiple regression: Correcting two mis-
(pp. 209–231). Guilford Press. conceptions. Practical Assessment, Research, and Evalu-
West, S. G., Wu, W., McNeish, D., & Savord, A. (2023). ation, 18, Article 11.
Model fit in structural equation modeling. In R. H. Hoyle Williams, N. (2021). Toolbox for structural equation mod-
(Ed.), Handbook of structural equation modeling (2nd ed., elling (SEM) (Version 1.1) [Computer software]. https://
pp. 184–205). Guilford Press. www.mathworks.com/matlabcentral/fileexchange/60013-
Westfall J., & Yarkoni, T. (2016). Statistically controlling for toolbox-for-structural-equation-modelling-sem
confounding constructs is harder than you think. PLoS Williams, T. H., McIntosh, D. E., Dixon, F., Newton, J. H.,
ONE, 11(3), Article 0152719. & Youman, E. (2010). A confirmatory factor analysis of
Westland, C. J. (2010). Lower bounds on sample size in struc- the Stanford-Binet Intelligence Scales, fifth edition, with a
tural equation modeling. Electronic Commerce Research high-achieving sample. Psychology in the Schools, 47(10),
and Applications, 9(6), 476–487. 1071–1083.
Wetzel, E., Lüdtke, O., Zettler, I., & Böhnke, J. R. (2016). The Wilson, W. J. (1987). The truly disadvantaged: The inner
stability of extreme response style and acquiescence over city, the underclass, and public policy. University of Chi-
8 years. Assessment, 23(3), 279–291. cago Press.
Whitaker, B. G., & McKinney, J. L. (2007). Assessing the Wirth, R. J., & Edwards, M. C. (2007). Item factor analysis:
measurement invariance of latent job satisfaction ratings Current approaches and future directions. Psychological
across survey administration modes for respondent sub- Methods, 12(1), 58–79.
groups: A MIMIC modeling approach. Behavior Research Wold, H. (1982). Soft modeling: The basic design and some
Methods, 39(3) 502–509. extensions. In K. G. Jöreskog & H. Wold, (Eds.), Systems
Wickrama, K. K. A. S., Lee, T. K., O’Neal, C. W., & Lorenz, under indirect observations: Part II (pp. 1–54). North-
F. O. (2022). Higher-order growth curves and mixture Holland.
modeling with Mplus: A practical guide (2nd ed.). Rout- Wolf, E. J., Harrington, K. M., Clark, S. L., & Miller, M. W.
ledge. (2013). Sample size requirements for structural equation
Widaman, K. F. (1985). Hierarchically nested covariance models: An evaluation of power, bias, and solution propri-
structure models for multitrait-multimethod data. Applied ety. Educational and Psychological Measurement, 73(6),
Psychological Measurement, 9(1), 1–26. 913–934.
Widaman, K. F., & Olivera-Aguilar, M. (2023). Investigating Wolf, M. G., & McNeish, D. (2020). Dynamic Model Fit (R

470 References
Shiny application version 1.1.0). https://www.dynamicfit. lence testing with adjusted fit indexes. Structural Equation
app/ Modeling, 23(3), 319–330.
Wolfle, L. M. (2003). The introduction of path analysis to the Yuan, K.-H., Hayashi, K., & Bentler, P. (2007). Normal
social sciences, and some emergent themes: An annotated theory likelihood ratio statistic for mean and covariance
bibliography. Structural Equation Modeling, 10(1) 1–34. structure analysis under alternative hypotheses. Journal of
Wolfram Research, Inc. (2022). Mathematica (Version 13.0) Multivariate Analysis, 9(6), 1262–1282.
[Computer software]. https://www.wolfram.com/ Yung, Y. F., Thissen, D., & McLeod, L. D. (1999). On the
Wong, C.-S., & Law, K. S. (1999). Testing reciprocal relations relationship between the higher-order factor model and the
by nonrecursive structural equation models using cross- hierarchical factor model. Psychometrika, 64(2), 113–128.
sectional data. Organizational Research Methods, 2(1), Zhang, M. F., Dawson, J., & Kline, R. B. (2021). Evaluat-
69–87. ing the use of covariance-based structural equation mod-
Worland, J., Weeks, G. G., Janes, C. L., & Stock, B. D. (1984). elling with reflective measurement in organisational and
Intelligence, classroom behavior, and academic achieve- management research: A review and recommendations
ment in children at high and low risk for psychopathology: for best practice. British Journal of Management, 32(2),
A structural equation analysis. Journal of Abnormal Child 257–272.
Psychology, 12(3), 437–454. Zhang, Z., Mai, Y., & Yang, M. (2023). WebPower: Basic
Wothke, W. (1993). Nonpositive definite matrices in structural and advanced statistical power analysis (R package 0.8.7).
equation modeling. In K. A. Bollen & J. S. Long (Eds.), https://CRAN.R-project.org/package=WebPower
Testing structural equation models (pp. 256–293). Sage. Zhang, Z., & Wang, L. (2013). Methods for mediation analy-
Wright, S. (1934). The method of path coefficients. Annals of sis with missing data. Psychometrika, 78(1), 154–184.
Mathematical Statistics, 5(3), 161–215. Zhang, Z., & Wang, L. (2022). bmem: Mediation analysis
Wu, A. D., Li, Z., & Zumbo, B. D. (2007). Decoding the with missing data using bootstrap (R package 2.0). https://
meaning of factorial invariance and updating the practice CRAN.R-project.org/package=bmem
of multi-group confirmatory factor analysis: A demonstra- Zhang, Z., & Yuan, K.-H. (2018). Practical statistical power
tion with TIMSS data. Practical Assessment, Research, & analysis using WebPower and R. ISDA Press.
Evaluation, 12, Article 3. Zhao, X., Lynch, J. G., Jr., & Chen, Q. (2010). Reconsider-
Wu, H., & Estabrook, R. (2016). Identification of confirma- ing Baron and Kenny: Myths and truths about mediation
tory factor analysis models of different levels of invariance analysis. Journal of Consumer Research, 37(2), 197–206.
for ordered categorical outcomes. Psychometrika, 81(4), Zheng, X., Yang, J. S., & Harring, J. R. (2022). Latent growth
1014–1045. modeling with categorical response data: A methodologi-
Wu, W., & Lang, K. M. (2016). Proportionality assumption cal investigation of model parameterization, estimation,
in latent basis curve models: A cautionary note. Structural and missing data. Structural Equation Modeling, 29(2),
Equation Modeling, 23(1), 140–154. 182–206.
Xia, Y., & Yang, Y. (2019). RMSEA, CFI, and TLI in struc- Ziegler, M., & Hagemann, D. (2015). Testing the unidimen-
tural equation modeling with ordered categorical data: sionality of items. European Journal of Psychological
The story they tell depends on the estimation methods. Assessment, 31(4), 231–237.
Behavior Research Methods, 51(1), 409–428. Ziliak, S., & McCloskey, D. (2008). The cult of statistical sig-
Yuan, K.-H. (2005). Fit indices versus test statistics. Multi- nificance: How the standard error costs us jobs, justice,
variate Behavioral Research, 40(1), 115–148. and lives. University of Michigan Press.
Yuan, K.-H., & Bentler, P. M. (2000). Three likelihood-based Zou, H., & Hastie, T. (2005). Regularization and variable
methods for mean and covariance structure analysis with selection via the elastic net analysis the ridge, lasso, and
nonnormal missing data. Sociological Methodology, elastic net methods. Journal of the Royal Statistical Soci-
30(1), 165–200. ety, 67(2).
Yuan, K.-H., & Chan, W. (2016). Measurement invariance Zumbo, B. D., Liu, Y., Wu, A. D., Shear, B. R., Olvera Astiva,
via multigroup SEM: Issues and solutions with chi-square- O. L., & Ark, T. K. (2015). A methodology for Zumbo’s
difference tests. Psychological Methods, 21(3), 405– 426. third generation DIF analyses and the ecology of item
Yuan, K.-H., Chan, W., Marcoulides, G. A., & Bentler, P. M. responding. Language Assessment Quarterly, 12(1), 136–
(2016). Assessing structural equation models by equiva- 151.

Author Index
Abelson, R. P., 22 Baron, R., 125, 126, 130

Abt, M., 133 Barrett, P., 16
Abu-Bader, S. H., 56 Bartholomew, D. J., 12
Acharya, A., 366 Bartle-Haring, S., 246, 247, 311
Acock, A. A., 75 Bartlett, M. S., 316
Aguinis, H., 116, 162 Bauer, D. J., 69, 256, 403
Aiken, L. S., 19, 20 Bauldry, S., 220, 223, 248, 259
Akaike, H., 190 Baumgartner, H., 401
Albers, C. J., 20 Beauducel, A., 170
Allison, P. D., 64, 65, 139 Beaujean, A. A., 16, 72
Altman, D. G., 204 Beck, B., 270
Alvarado, J. M., 259 Becker, J.M., 299
Amador, X. F., 41, 278 Bedeian, A. G., 109
Amemiya, Y., 232 Benitez, J., 414, 415
Amrhein, A., 22 Bentler, P. M., 12, 74, 138, 142, 161, 164, 165, 168, 169, 170, 171,
Ananth, C. V., 350 174, 182, 184, 186, 200, 225, 247, 267, 269, 283, 327, 346
Andersen, H. K., 72 Beran, R., 286
Anderson, J. C., 170, 264, 265, 268, 283, 309 Berkson, J., 87
Anderson, S. F., 175 Bernstein, I. H., 320
Andrews, R. M., 365 Berry, W. D., 337, 338
Angrist, J. D., 83, 84, 99 Bishop, J., 388
Antonakis, J., 20, 105, 225, 258, 350, 353 Black, A. C., 424
Appelbaum, M., 25, 37, 39, 40, 42, 44, 45, 172, 414, 415, 420 Blalock, H. M., 11
Arbuckle, J. L., 68, 74, 104, 138, 139, 150, 327 Block, J., 206
Asparouhov, T., 138, 161, 186, 200, 230, 231, 329, 409, 411, 412 Blozis, S. A., 385
Audigier, V., 65 Blum, M. G. B., 350, 366
Austin, J. T., 16, 17 Blunch, N. J., 74
Boker, S. M., 71
Bagozzi, R. P., 12, 218, 239 Bolin, J. E., 75
Bailey, M., 253 Bollen, K. A., 3, 10, 12, 14, 18, 25, 26, 29, 34, 41, 42, 82, 83, 103,
Baiocchi, M., 82, 83 105, 107, 110, 117, 131, 133, 136, 137, 144, 145, 188, 220, 221,
Balla, J., 169 223, 225, 228, 248, 249, 252, 264, 269, 276, 293, 309, 312, 313,
Bandalos, D. L., 139, 176 338, 389
471
IndxKline5E.indd 471 3/22/2023 3:12:10 PM

472 Author Index
Bonett, D. G., 164, 169 Clifton, A., 227

Bono, R., 61 Coffman, D. L., 135, 139, 159, 199
Boomsma, A., 37, 309 Cohen, J., 29, 59, 113, 136
Bornstein, M. H., 394, 395, 399, 401, 402, 412 Cole, D. A., 104, 244, 315, 356, 357
Borsboom, D., 227, 228 Cole, S. R., 87
Bound, J., 84 Collier, J. E., 74
Box, G. E. P., 61, 163, 179 Collins, L. M., 139
Boyle, M. H., 385 Combrisson, E., 188
Brailean, A., 386 Comeau, J., 385
Brandmaier, A. M., 188 Comrey, A. L., 240
Brandt, J. S., 350 Cooper, H., 39, 45
Breckler, S. J., 198 Cooper, S. R., 18
Breitsohl, H., 9 Cornoni–Huntley, J., 325
Breivik, E., 168 Cortina, J. M., 18, 39
Brett, J. M., 127 Costa, P., 376
Brick, T. R., 100 Cox, D. R., 61
Brito, C., 110 Crawford, J. R., 186
Broc, G., 72 Crombie, G., 372
Brønnick, K., 356, 357 Crosswell, A. D., 294
Brosseau-Liard, P. É., 166, 168, 169, 330 Crowne, D. P., 258
Brown, T. A., 230, 231, 243, 244, 251, 252, 255, 259, 276, 410, 411 Cudeck, R., 141, 167, 170, 318
Browne, M. W., 140, 141, 167, 170, 291, 293 Culkin, J. M., 67
Bryant, F. B., 186 Cumming, G., 22, 162
Brydges, C. R., 276 Curran, P. J., 69, 158, 167, 168, 176, 322, 372, 373, 376, 386, 389
Bryk, A. S., 372 Curran, T., 361, 362
Bullock, J. G., 350, 368, 371
Burt, R. S., 268, 310 Daly, A., 113
Byrne, B. M., 74, 372, 398, 401 Daniel, R. M., 366, 367, 371
Davidov, E., 411, 412
Cain, M. K., 60 Davidson, L., 219
Calin-Jageman, R., 22, 162 Davies, N. M., 84
Campbell, D. T., 252 Davis, W. R., 293
Campbell, I., 194 Demirtas, H., 65
Carvacho, G., 86 Dempster A. P., 65
Castanho Silva, B., 18 Deng, L., 316, 318
Chakraborty, S., 317 Depaoli, S., 18, 172
Chalak, K., 95 Derogatis, L., 193, 290
Chambers, C. D., 41 Desai, R. J., 83
Chang, W., 224 Deshon, R. P., 397
Chatterjee, S., 56 Devlieger, I., 310
Chen, B., 144 Diamantopoulos, A., 225, 226, 228
Chen, F., 134, 167, 168, 399 Dickman, K., 72, 120
Chen, F. F., 257, 258 Didelez, V., 365
Chen, J., 345 Dienes, Z., 192
Chen, J. W., 69 Diggle, P. D., 155
Chen, Q., 130 Dijkstra, T. K., 286, 294, 299
Chen, Y., 249 Ding, P., 88
Cheng, C., 365 DiStefano, C., 38, 60, 137, 140, 141, 151, 160, 323, 324, 330
Cheong, J., 212, 213 Dolan, C. V., 153
Cheung, G. W., 159, 399, 402 Dombrowski, S. C., 38
Chin, W. W., 224 Dong, Y., 38, 54, 62, 64, 65
Cho, E., 242, 255, 259 Douma, J. C., 118
Cho, Y., 385 Drasgow, F., 170, 393
Choi, I., 415 Dumas, D., 38
Choi, J., 408 Duncan, S. C., 372, 382, 384
Chou, C. P., 184, 185, 187, 267 Duncan, T. E., 372, 382, 384
Chou, C.C., 280 Dunn, G., 74
Chou, J.S., 188 Dunn, K. J., 258
Choudhary, A., 11 Dunn, W. M., III., 34
Cieciuch, J., 394 Dwivedi, A. K., 129
Cliff, N., 218 Dziak, J. J., 192, 199

Author Index 473
Edwards, J. R., 116, 203, 204, 225, 358, 359 Gill, C. A., 59
Edwards, M. C., 323, 329 Glaser, D. N., 269
Efron, B., 129 Glymour, C., 252
Eich, E., 420 Glymour, M. M., 90
Eid, M., 253, 256 Goldberger A. S., 222
Einstein, A., 219 Golden, R. M., 190
Eisenhauer, J. G., 148 Goldstein, S. M., 16, 17, 39
Elwert, F., 82, 85, 88, 90, 99, 426 Gomer, B., 164
Enders, C. K., 49, 52, 54, 65, 66, 132, 137, 138, 139, 155, 373 Gonzalez, O., 350, 365, 368
Epskamp, S., 70 Gonzalez, R., 235
Epstein, D., 372 Goodboy, A. K., 415
Erceg-Hurn, D. M., 21 Gottfredson, N. C., 155
Ernst, A. F., 20 Gottfredson, R. K., 116
Esposito Vinzi, V., 299 Grace, J. B., 14, 29, 41, 42, 220, 221
Estabrook, R., 390, 409, 410, 412 Graham, J. M., 242
Eysenck, S. B. G., 39, 42 Graham, J. W., 49, 65, 66, 135, 139
Grayson, D., 253
Fabrigar, L. R., 231 Green, D. P., 350, 368, 371
Fairchild, A. J., 358 Green, S. B., 199, 269
Falk, C. F., 188 Greene, W. H., 154
Falk, R. F., 286 Greenland, S., 168
Falke, A., 164 Gregorich, S. E., 395, 396, 397, 398
Fan, W., 267 Greiff, S., 42, 170, 179
Fan, X., 169, 170, 354 Grewal, R., 160
Fan, Y., 339, 345, 415 Grice, J. W., 231
Featherman, D. L., 193 Griffith, G. J., 87
Feng, Y., 176 Grimm, K. J., 372, 388, 390
Fewell, Z., 81 Grömping, U., 26
Fidell, L. S., 49, 56, 119 Gudergan, S. P., 286
Fiedler, K., 108 Guliyev, H., 106
Finch, W. H., 75, 394 Guo, Y., 402, 403, 405, 408
Finkel, S. E., 332 Guttman, L., 231
Finn, A., 252
Finney, S. J., 60, 137, 140, 141, 151, 160, 323, 324, 330 Hadaya, P., 310
Fisher, F., 249, 312 Hagemann, D., 244
Fisher, R. A., 119 Hair, J. F., 13, 17, 18, 222, 225, 239, 285, 288, 289, 299, 414, 415
Fiske, D. W., 252 Haller, H., 21
Fitzpatrick, D. C., 230 Halpin, P. F., 218
Flake, J. K., 230, 231, 239 Hamann, J. D., 71
Flora, D. B., 230, 231, 239, 258, 322 Hammack-Brown, B., 402
Fox, J., 41, 71, 72, 129 Hancock, G. R., 134, 137, 138, 160, 168, 175, 176, 182, 190, 194,
Fraley, R. C., 193 201, 267, 330, 415
Freeman, M. J., 175 Harbaugh, A. G., 396
French, B. F., 176, 394 Hardt, J., 54
French, D. P., 225 Harring, J. R., 388
Fritz, M. S., 107 Hastie, T., 317
Hatcher, L., 75
Gagne, P., 134 Hau, K.T., 246, 309
Gal, D., 21 Hausman, J. A., 84
Galimard, J.E., 66 Hayduk, L. A., 8, 14, 37, 42, 156, 157, 158, 159, 162, 163, 164, 179,
Gana, K., 72 180, 225, 226, 232, 269, 276, 344, 345, 346, 395, 398, 399
Garn, A. C., 358 Hayes, A. F., 128, 129, 203, 350, 358, 360
Geiser, C., 75, 256 Heath, M. T., 328
Gelman, A., 185, 266 Heck, R. H., 75
Gerbing, D. W., 170, 264, 265, 268, 283, 309, 426 Heene, M., 42, 160, 170, 179
Geyer, C. J., 65 Heisey, D. M., 175
Ghisletta, P., 148 Hejazi, N. S., 368
Ghosh, M., 317 Henly, S. J., 318
Giffin, D., 235 Henningsen, A., 71
Gigerenzer, G., 21 Henseler, J., 13, 219, 221, 222, 225, 284, 285, 286, 288, 296, 297,
Gignac, G. E., 257 298, 299, 300, 305, 394

474 Author Index
Hernán, M. A., 82, 84, 98 Karvanen, J., 96

Hershberger, S. L., 194, 196, 199, 222, 249 Kashy, D. A., 253
Herzog, W., 160 Kaufman, A. S., 236, 237, 238
Hill, A. P., 362 Kaufman, J. S., 363
Hinson, V. K., 193 Kaufman, N. L., 236, 237, 238
Hipp, J. R., 345 Keesling, J.W., 12
Ho, M. H. R., 37, 104, 112, 268 Keith, T. Z., 237, 245
Hoekstra, R., 22 Kelley, K., 129, 164, 168, 176, 353, 354, 355
Hoenig, J. M., 175 Kenny, D. A., 14, 25, 34, 86, 104, 109, 111, 125, 126, 130, 135,
Hollebeek, L. D., 38 144, 160, 168, 169, 234, 238, 244, 249, 250, 252, 253, 260, 261,
Hong, S., 129, 358 336, 399
Howards, P. P., 88 Kenward, M. G., 153, 155
Hoyle, R. H., 37, 39, 45, 107, 349 Kerr, N. L., 183
Hu, L.-T., 164, 169, 170, 171, 174 Kim, K. H., 175
Huang, W., 225 Kim, M., 379
Hubert, M., 57 Kim-Spoon, J., 374, 379, 392
Hubona, G. S., 288 Kline, R. B., 20, 22, 27, 31, 45, 53, 56, 107, 140, 156, 175, 204,
Huck, S. W., 28, 31 239, 245, 272, 284, 290, 349, 408, 415
Huh, J., 184, 185, 187 Kmetz, J. L., 20
Huisman, S. M. H., 414, 415 Knight, C. R., 86
Hulin, C. L., 393 Kock, N., 310
Hulme, C., 251 Kolenikov, S., 248
Hung, J., 159 Kompier, M. A. J., 332
Hunsley, J., 26 Koziol, N. A., 320, 327, 329
Hurlbert, S. H., 24 Kraft, P., 416
Hwang, H., 285, 288 Krauss, S., 21
Hyman, H., 108 Krueger, A. B., 83, 84, 99
Kühnel, S., 8
Iacobucci, D., 350, 371
Igolkina, A. A., 76, 415 Lachowicz, M. J., 127, 354
Ing, M., 188 Lai, K., 168, 176, 182, 199, 200, 201, 202
Ioannidis, J. P. A., 20, 21 Lambdin, C., 22
Isherwood, J. C., 37, 39 Lambert, L. S., 203, 204, 358
Lance, C. E., 255, 401
Jaccard, J., 112 Lang, K. M., 65, 139, 151, 377, 378
Jackson, D. L., 16, 37 Lanza, S. T., 12, 199
Jacobucci, R., 188, 317 Latan, H., 285, 305
Jacoby, J., 112 Lautenschlager, G. J., 393
James, L. R., 127 Law, K. S., 332, 333
James, W., 414 Lawton, M. P., 193
Jarvis, C. B., 223 Lee, H., 349, 350, 368
Jerbi, K., 188 Lee, H. B., 240
Jermiin, L. S., 199 Lee, J., 80, 255
Jessor, R., 355 Lee, S., 194
Jessor, S. L., 355 Lee, S.-Y., 327
Jia, F., 49 Lefcheck, J. S., 71, 117, 118, 120, 130
Jiang, G., 316 Lei, M., 60
Jin, S., 132 Leiman, J. M., 258
Joanes, D. N., 59 Leite, W. L., 176, 188
John, L. K., 23 Lek, K., 411
John, R., 345 Levy, R., 182, 190, 194, 199, 201
Johnson, D. R., 49 Lewis, C., 169
Jöreskog, K. G., 12, 33, 67, 74, 139, 146, 148, 156, 162, 164, 165, Leys, C., 58
168, 184, 187, 204, 205, 222, 229, 231, 322 Li, C. H., 320, 325
Jorgensen, T. D., 71, 120, 139, 176, 237, 270, 300, 410 Li, L., 168
Jose, P. E., 350 Li, R., 199
Ju, Y., 155 Liang, X., 317
Jung, E., 402 Likert, R., 319
Lin, L. C., 192
Kaiser, H. F., 72, 120 Linting, M., 62
Kaplan, D., 3, 17, 135, 140, 188, 333, 343, 386 Little, R. J. A., 49, 53, 66

Author Index 475
Little, T. D., 52, 65, 68, 69, 86, 108, 139, 151, 156, 163, 169, 231, Mehta, P. D., 386
236, 310, 332, 349, 357, 372, 377, 390, 391, 392, 394, 396, 397, Mellenbergh, G. J., 393
400, 401 Meredith, W., 372
Littvay, L., 232, 276 Merkle, E. C., 190, 192, 193, 194, 201
Liu, M., 138, 396 Meshcheryakov, G., 76
Liu, X., 55 Michell, J., 218
Liu, Y., 393 Mikis, D., 173
Llabre, M. M., 379 Milan, S., 34, 252
Lockwood, K. G., 294 Miles, J., 169
Loehlin, J. C., 16, 114 Miller, N. B., 286
Loeys, T., 371 Millsap, R. E., 8, 159, 268, 269, 393, 395, 398, 409, 410, 411
Loh, W. W., 309, 310, 350, 366, 367, 368, 370, 371 Miočević, M., 318
Lohmöller, J.B., 286 Mirosevich, V. M., 21
Loken, E., 185, 266 Mitchell, M. A., 356, 357
Lomax, R. G., 60 Moerkerke, B., 371
Lombardi, C. M., 24 Mohan, K., 52, 53
Lubke, G. H., 194 Molenaar, D., 394
Lúcio, P. S., 411 Molenaar, P. C. M., 153
Luo, S., 18 Molenberghs, G., 153
Lynam, D. R., 205, 206, 208, 209 Molina, K. M., 204
Lynch, J. G., 130 Mooijaart, A., 160
Lynch, K. G., 350 Moon, K.W., 129, 358
Morey, R. D., 22
MacCallum, R. C., 16, 17, 33, 36, 37, 44, 45, 175, 177, 187, 196, Morikawa, K., 155
204, 243, 291, 293, 369 Morin, A. J. S., 231
MacKenzie, S. B., 220, 223 Morrison, T. G., 415
MacKinnon, D. P., 14, 108, 127, 128, 349, 350, 358 Moshagen, M., 160
Madans, J. H., 325 Moss, T. P., 38
Mai, R., 171, 268 Mowbray, F. I., 56, 58
Mai, Y., 70 Mudge, J. F., 162
Majewska, J., 58 Mueller, R. O., 160, 330
Malhotra, N. K., 394 Mueller, R, O., 415
Manly, C. A., 62 Mulaik, S. A., 3, 132, 140, 145, 164, 167, 190, 192, 196, 229, 268,
Maraun, M. D., 218 269
Marchetti, G. M., 96 Munafò, M. R., 87
Marcoulides, G. A., 103, 159, 188, 194, 196, 199, 249 Murphy, S. A., 391
Marcoulides, K. M., 181, 188, 318 Murray, D., 21
Mardia, K. V., 59 Muthén, B., 329, 409, 411, 412
Markland, D., 159 Muthén, B. O., 12, 14, 74, 116, 133, 138, 153, 155, 161, 168, 172,
Markus, K. A., 225 186, 200, 211, 230, 231, 320, 328, 329, 372, 408, 409, 411, 412
Marlowe, D., 258 320, 372
Marquart-Pyatt, S., 345 Muthén, L. K., 74, 116, 133, 138, 153, 155, 161, 168, 172, 186, 211,
Marsh, H. W., 169, 170, 231, 246, 253, 309 328, 408
Mastrotheodoros, S., 410 Myers, N. D., 176
Matsunaga, M., 311
Mauro, R., 30 Naimi, A. I., 365
Maxwell, S. E., 356 Neale, M. C., 72, 328, 386
Maydeu-Olivares, A., 146, 147, 151, 154, 161, 163, 316, 317, 330, Nestler, S., 131
339 Nevitt, J., 137, 168
Mayo-Wilson, E., 45 Newsom, J. T., 233, 235, 332, 372, 386, 389, 390, 391, 392, 394
McArdle, J. J., 75, 100, 103, 148, 149, 372, 388, 389 Nezlek, 12
McCloskey, D., 21, 23 Nezu, A. M., 45
McCoach, D. B., 160, 424 Nguyen, T. Q., 349, 363, 368, 371
McCray, G., 258 Niemand, T., 171
McDonald, R. P., 37, 75, 100, 103, 104, 112, 149, 268 Niemiec, C. P., 362
McGonagle, A. K., 256 Nilsson, A., 85
McIntosh, C. N., 159 Nimon, K., 394
McKinney, J. L., 394 Nitzl, C., 224
McNeish, D., 171, 317, 373, 376, 377, 378 Noonan, R., 285, 298, 305
McShane, B. B., 21 Nosek, B. A., 33
Meade, A. W., 393, 399, 403 Nuijten, M. B., 21

476 Author Index
Nunkoo, R., 37 Raffard, S., 280

Nunnally, J. C., 218 Raftery, A. E., 192
Nye, C. D., 170 Ram, N., 390
Rao, S. M., 45
Oakes, M., 21 Raudenbush, S. W., 372
Oberski, D. L., 315 Ray, S., 288
O’Boyle, E. H., 268 Raykov, T., 103, 159, 190, 196, 198, 241, 249, 346
O’Brien, R. M., 56, 244 Rebueno, M. C. D. R., 38
Ockey, G. J., 415 Reinartz, W. J., 225
O’Connell, A. A., 424 Reio, T. G, Jr., 394
Ogden, C. L., 369 Reise, S. P., 257, 258
Olaru, G., 188 Rensvold, R. B., 159, 399, 402
O’Laughlin, K. D., 350 Revelle, W., 120, 231
Oldenburg, R., 76 Reynolds, C. R., 393
Olivera–Aguilar, M., 398 Rhemtulla, M., 65, 139, 161, 176, 177, 220, 222, 223, 228, 319,
Olsson, U. H., 141, 167, 168 320, 324, 330
Ondé, D., 259 Rhoades, B. L., 12
Oppong, F. B., 59 Richardson, H. A., 258
O’Rourke, N., 75 Richiardi, L., 363, 365
Osborne, J. W., 57, 61, 107, 230 Richter, N. F., 37
Ou, L., 389 Riddles, M. K., 155
Rigdon, E. E., 1, 110, 217, 218, 219, 224, 226, 228, 230, 287, 297,
Panwar, M. S., 386 299, 300, 304, 305, 333, 334, 335, 337, 338, 339, 345, 346, 416
Park, H., 345 Rights, J. D., 310
Patrician, P. A., 65 Rindskopf, D., 246
Pavlov, G., 186 Ringle, C. M., 289, 305
Paxton, P., 337, 338, 345 Rioux, C., 310, 311
Pearl, J., 1, 4, 9, 10, 13, 14, 18, 52, 53, 72, 79, 86, 87, 89, 90, 93, Rippe, R. C. A., 62
94, 95, 96, 99, 105, 110, 125, 144, 194, 196, 198, 349, 360, 363, Robbins, B. G., 338, 339, 344
365, 416 Rockwood, N. J., 203, 358, 360
Pedhazur, E. J., 8 Rogosa, D. R., 9
Pek, J., 107, 349 Roid, G. H., 38
Penev, S., 196, 198 Romney, D. M., 190, 192, 193, 194
Peng, C. Y. J., 54, 62, 64, 65 Rönkkö, M., 225, 242, 255, 259
Peng, H. L., 127 Roos, J. M., 259
Perry, J., 339 Ropovik, I., 42, 171, 283
Peters, C. L. O., 139 Rose, N., 54
Petersen, M. L., 364 Rosseel, Y., 71, 103, 119, 132, 147, 153, 157, 186, 236, 237, 270,
Pett, M. A., 119 309, 310, 316, 317, 324, 330, 408
Petter, S., 416 Rossiter, J. R., 226
Phahladira, L., 44 Roth, D. L., 62, 63, 96, 111, 120, 122, 123, 125, 142, 143, 173, 177,
Phillips, L. W., 218 188
Pietrzak, R. H., 41, 278 Rothman, K. J., 168
Pilgrim, C. C., 205 Rousseeuw, P. J., 57
Pinter, J., 133 Rubin, D. B., 13, 14, 49, 54, 66
Pituch, K. A., 58 Rucker, D. D., 127, 204
Platt, R. W., 88 Rust, R. T., 190
Podsakoff, P. M., 253, 258 Rutkowski, D., 412
Pohl, S., 255, 256 Rutkowski, L., 399, 412
Porte, G., 38 Ryder, A. G., 396
Preacher, K. J., 104, 128, 164, 172, 192, 193, 194, 201, 204, 315, Ryu, E., 211, 212, 213
353, 354, 355, 356, 362, 387, 388
Price, B., 56 Sabatelli, R. M., 246, 311
Pritikin, J. N., 329 Salzberger, T., 226
Putnick, D. L., 394, 395, 399, 401, 402, 412 Samsonova, M. G., 415
Sargan, J. D., 249
Qiu, W., 176 Saris, W. E., 175
Sarstedt, M., 224, 305
Raborn, A., 188 Sass, D. A., 399
Rademaker, M. E., 286, 296, 288 Satorra, A., 160, 161, 175, 182, 186, 200, 315
Radloff, L. S., 325 Satterthwaite, F. E., 163

Author Index 477
Sauerbrei, W., 204 Sun, J., 164

Sauvé, G., 41, 43, 44, 278, 279, 280, 281, 420 Sutton, S., 225
Savalei, V., 65, 137, 138, 139, 153, 161, 166, 169, 315, 316, 324, Suzuki, L. A., 393
325, 330 Svetina, D., 72, 399, 408, 409, 410, 412
Sayer, A. G., 372 Swain, 316
Schafer, J. L., 49, 66 Swanson, S. A., 82, 84, 98
Schaffner, K. F., 218, 219 Szucs, D., 20
Schamberger, T., 286
Schmelkin, L. P., 8 Tabachnick, B. G., 49, 56, 119
Schmid, I., 371 Takane, Y., 285
Schmid, J., 258 Takeuchi, D. T., 289, 290, 294, 300
Schmitt, N., 30, 109 Takezawa, K., 115
Schmittmann, V. D., 227 Tan, Q., 394
Schneeweiss, S., 83 Tanaka, J. S., 164
Schneider, E. B., 87, 88 Tang, N., 155
Schönemann, P. H., 230, 231 Taris, T. W., 332
Schreiber, J. B., 8, 37, 45, 414, 415 Tate, C. U., 108, 112, 127, 350, 353
Schuberth, F., 13, 222, 285, 286, 288, 300, 301, 316 Taylor, J. M., 328
Schwarz, 192 Teitcher, J. E., 394
Schweizer, K., 236 Teng, G., 320
Selig, J. P., 356 Textor, J., 34, 96, 198
Seixas, 159 Thelwall, M., 17, 38
Shah, R., 16, 17, 39 Thoemmes, F., 54, 119, 176, 196, 352, 353, 366
Shao, C., 345 Thomas, S. L., 75
Sharma, S., 164 Thompson, B., 24
Sheather, S. J., 188 Thompson, G. C., 56
Shen, B. J., 289, 290, 294, 300 Thompson, Y. T., 400
Shevlin, M., 169 Tikka, S., 96
Shi, D., 146, 147, 151, 160, 326, 327, 330 Ting, K.F., 252
Shipley, B., 90, 96, 117, 118, 130 Tisak, J., 372
Shirkey, G., 345 Tobak, S., 220
Shrier, I., 88 Tomarken, A. J., 424
Silvia, E. S. M., 187 Torres, M., 363
Simmons, J., 23 Trafimow, D., 349
Simms, L. J., 320 Tsai, T.L., 110
Simonton, K. L., 358 Tu, Y.K., 424
Sivo, S. A., 169, 170 Tucker, L. R., 169
Smid, S. C., 317 Tukey, J. W., 58
Snowling, M. J., 251 Tunca, B., 38
Sobel, M. E., 125
Sörbom, D., 12, 67, 74, 139, 146, 148, 156, 164, 165, 168, 187, 204, Uddin, M. J., 81, 82, 84
322 Urbina, S., 25, 30
Spearman, C., 11, 252
Spirtes, P., 110, 252 Vacha-Haase, T., 24
Spruill, J., 270 van Bork, R., 228
Srivastava, M. S., 286 van Buuren, S., 64
Stankov, L., 255 van de Schoot, R., 318, 411, 412
Stanovich, K. E., 109 van der Kloot, W. A., 230
Stark, S., 398 Van Der Voort, A., 62
Steenkamp, J.B. E. M., 401 van der Zander, B., 94
Steiger, J. H., 68, 75, 141, 158, 160, 165, 167, 230, 231, 236 van Ginkel, J. R., 62, 64, 66
Steinmetz, H., 401 van Prooijen, J.W., 230
Sterba, S. K., 310, 311, 386 Van Ryzin, M. J., 369
Stevens, G., 193 van Veelen, R., 212
Stevens, J. P., 58 van Widenfelt, B. M., 394
Steyer, R., 255, 256, 389 Vandenberg, R. J., 401
Stine, R. A., 137 VanderWeele, T. J., 81, 89, 99, 350, 364
Stoel, R. D., 372, 386 Vansteelandt, S., 366, 367, 371
Streiner, D. L., 30 Verdam, M. G. E., 38
Stuart, E. A., 350, 371 Vernon, T., 39, 42
Stürmer, T., 82, 83 Viera, A. L., 74

478 Author Index
Vo, T. T., 368 Wittman, W., 170

Voelkle, M. C., 389 Wold, H., 12, 285, 286, 298
von Oertzen, T., 73, 100, 133 Wolf, E. J., 16
Wolf, M. G., 171
Wagner, J., 65 Wolfle, L. M., 18
Waller, N. G., 133, 424 Wong, C.S., 332, 333
Wang, L., 71, 129, 252, 333 Worland, J., 270
Wang, P., 204 Wothke, W., 50, 51, 139, 246, 309
Wang, Y. A., 176, 177 Wright, S., 1, 10, 11, 144, 343
Wasserstein, R. L., 24, 159 Wu, A. D., 397
Waters, E., 107 Wu, E. J. C., 74, 327
Webster, G. D., 227 Wu, H., 409, 410, 412
Wegener, D. T., 231 Wu, S. R., 345
Welch, W. J., 133 Wu, W., 49, 377, 378
Wells, R. S., 62
Wen, Z., 354 Xia, Y., 170, 326
West, S. G., 164, 169, 171, 172
Westfall, J., 26, 315 Yalcin, I., 232
Westland, C. J., 16 Yang, J.G., 188
Wetzel, E., 396 Yang, M., 318
Whitaker, B. G., 394 Yang, Y., 170, 326
White, H., 96 Yaremych, H. E., 194
White, W., 219 Yarkoni, T., 26, 315
Wickrama, K. K. A. S., 75 Yi, Y., 12, 218, 239
Widaman, K. F., 258, 412 Yoon, M., 402
Wilcox, J. B., 223, 225 Young, R., 49
Wiley, D., 12 Yuan, K.H., 161, 168, 169, 170, 176, 180, 181, 316, 399
Willaby, H. W., 310 Yung, Y. F., 258
Willett, J. B., 372
Williams, L. J., 196, 256, 268 Zhang, J., 69
Williams, M. N., 20, 25, 26, 104, 107, 277 Zhang, M. F., 17, 41, 172, 415
Williams, M. N., 278 Zhang, Z., 71, 129, 176
Williams, N., 76 Zhao, X., 127, 128, 130
Williams, T. H., 255 Zheng, X., 374
Wilson, P., 17, 38 Ziegler, M., 244
Wilson, W. J., 206 Ziliak, S., 21, 23
Winship, C., 85, 86, 88, 99, 426 Zou, H., 317
Wirth, R. J., 329 Zumbo, B. D., 411

Subject Index
Note. Page numbers followed by an f or t indicate a figure or table.
A (Asymmetric) matrix, 100 reporting results and, 423

A priori (prospective) power, 174–177, 177t thresholds for, 170–172
Absolute fit indexes, 164 types of, 164–165
Accelerated longitudinal designs, 388–389 Approximate measurement invariance, 411
Accept–support test, 159–160 Approximate zero constraints, 411
Accuracy in parameter estimation (AIPE), 168 Arbitrary distribution function (ADF) method, 140–141
Acquiescence response style (ARS), 397 Arbitrary GLS, 327–328
ADANCO (Advanced Composite Modeling) program, 288–289. Arcs, 79
See also Software for SEM analyses Aroian test, 125, 128
Adaptive quadrature, 328 Ascertainment bias. See Collider bias
Adjacent variables, 79 Assumptions
Adjusted goodness-of-fit (AGIF) index, 164 common factor models and, 220
Adjusted test statistics, 316–317 diagrams for, 103–105, 103f
AGReMA (A Guideline for Reporting Mediation Analyses) Long- measurement errors in manifest-variable path models and,
Form Checklist, 368, 371 315–316
Akaike Information Criterion (AIC), 190–194, 192t, 193t, 328–329 mediation and, 126–128, 126f
Alignment method, 411–412 reporting results and, 420
Alternative models, 33, 37, 198–199. See also Specification Asymptotic covariance matrix, 133
Amos (Analysis of Moment Structures) program, 74, 138, 139, Augmented moment matrix (AMM), 148
150. See also Software for SEM analyses Automatic modification, 187, 188
Analysis (estimation) step of MI, 65–66. See also Multiple Autoregression latent trajectory (ALT) model, 389
imputation (MI) Autoregressive errors, 105
Analysis model, 65–66 Autoregressive models, 106
Analysis of variance (ANOVA), 10, 20, 392 Autoregressive paths, 332, 332f
Analysis plan, 416 Auxiliary theory, 2, 219
Ancestors in a DAG, 80 Auxiliary variables, 54, 419
Annotated syntax, data, and output files, 68, 69. See also Software Available case methods, 54
for SEM analyses Average variance extracted (AVE), 239–240
Answer-scale validity, 226
Ant colony optimization methods, 188 Back-door criterion, 93, 94f
Approximate fit indexes Back-door path, 81, 81f
history of, 164 Backward MI method, 402
measurement invariance and, 408–409 Backward search. See Model trimming
models with continuous and ordinal indicators, 325 Badness-of-fit statistic, 157, 165–166
overview, 163, 173–174, 174t, 178 Baseline confounders, 365n
479

480 Subject Index
Baseline model, 168 item response theory and, 329

Basic CFA models, 232–234, 233f, 236–242, 237f, 238t, 240t, measurement invariance and, 409–410
243t. See also Confirmatory factor analysis (CFA) measurement model and diagram, 323, 323f
Basic growth models, 374–381, 374f, 375f, 379f, 380t, 381t methods to scale latent response variables, 323–324
Basic latent growth models, 372–374 models with continuous and ordinal indicators, 325
Basis coefficients, 376 other estimation strategies for, 327–329
Basis growth models, 375f, 376–377, 379–381, 380t, 381t overview, 319, 329–330
Basis set, 90–92, 91t, 92t polychoric correlations, 321–322, 322f
Bayes factor (BF), 192 Categorical data, 319–320
Bayes Information Criterion (BIC), 192–194, 193t, 328–329 Categorical exogenous variables, 113, 113t
Bayesian estimation Categorical indicators, 321–322
basic growth models with no covariates and, 378 Causal directionality, 103–104
measurement invariance and, 412 Causal indicators, 220–221, 225
SEM and, 18 Causal inference methods, 9, 14
small samples and, 317 Causal loops
software for SEM analyses and, 74 assumptions of, 333
Bentler Comparative Fit Index (CFI). See Comparative fit index identification requirements, 333–336, 334f, 335f
(CFI) overview, 331–333, 332f, 345
Bentler–Raykov R2, 346 specification and, 418
Bentler–Weeks representational system, 74 Causal mediation analysis, 360–368, 362f. See also Mediation
Berkson’s paradox, 87 Causal modeling, 10
Best fitting probability distribution (PFPD), 201 Causal pathways, 349
Best single indicator, 232 Causal–formative measurement, 219, 219f, 220–221, 223–224
Best-fitting proper indices (BFPI), 299 Causally-linked observed variables, 227
Best-subsets regression, 188 Centrality coefficients, 227
Between-imputation variance, 65 Centroids, 58
Bias, 42, 224, 393 Chain, 84–85, 85f
Bias-corrected bootstrap, 129 Chains of relations, 349
Biasing path, 81, 81f Change point, 388
Bidirectional edge, 80 Children in a DAG, 80
Bifactor models, 255–258, 257f. See also Nested models Chi-square
Bivariate ordinary least squares (OLS) regression, 80 categorical data and, 319–320
Block classification method, 333–334, 337 composite models and, 300–301
Block recursive models, 335f, 336, 337–338, 347–348 conditional indirect effects over groups, 211
Blocked-error model, 344–345, 345t fit indexes and, 163, 164
Blocked-error R2 (beR2), 344–345, 345t multiple-group path analysis and, 207–208, 207t
Bollen–Stine bootstrap, 137–138 nonrecursive models and, 339
Bootstrapping. See also Significance testing overview, 156–161, 184–186, 198–199
analyzing nonnormal data and, 137–138 reporting results and, 420, 422–423
common factor model in a small sample and, 312 scaled chi-squares, 161, 163
composite models and, 285 strengths of, 162
conditional indirect effects over groups, 211–212 structural regression models and, 272
indirect effects and, 128–129 Close-fit hypothesis, 180
overview, 19, 26 Close-yet-failing models, 180
software for SEM analyses and, 74 Clustering methods, 227
Bow pattern, 110 C-OAR-SE method, 226–227
Bow-free pattern, 110 Code variables, 113, 113t, 114f
Box plots, 58 Coefficient omega, 241
Box-and-whisker plots. See Box plots Collider, 85, 87–88
Box–Cox transformations, 61 Collider bias, 87–88, 89f
Bright-line rule, 24 Collinearity, extreme, 56
Building, 183–184, 188–190, 189t, 398 Combination rule, 170
Burn-in period, 65 Common factor model
CFA and EFA and, 229
CALIS (Covariance Analysis of Linear Structural Equations) overview, 220, 309–311
procedure, 75 in a small sample, 311–315, 311f, 312t, 313t, 314t, 315t
Casewise ML, 138–140, 155 Common factors, 229
Categorical confirmatory factor analysis. See also Confirmatory Common metric completely standardized solution, 204
factor analysis (CFA); Continuous/categorical variable Common metric standardized solution, 204
methodology (CCVM) Common variance, 219–220, 229
estimators, 324 Comparative fit index (CFI)
example demonstrating, 325–327, 326t, 327t, 328t common factor model in a small sample and, 312

Subject Index 481
composite models and, 300–301 indicator selection and, 231–232

measurement invariance and, 398–399 item response theory and, 329
model comparison and, 202 measurement invariance and, 38, 394, 397, 399–401, 408–410,
models with continuous and ordinal indicators, 325 411, 412
multiple-group path analysis and, 207t models with continuous and ordinal indicators, 325
overview, 165–166, 168–169, 173–174, 174t multitrait–multimethod (MTMM) data and, 252–255, 254f, 256
reporting results and, 423 overview, 4, 15, 229, 258–259
thresholds for, 170–172 respecification of CFA models, 243–246, 245t
T-size indexes and, 181 sample size and, 419
two-factor CFA model and, 239 scaling factors in, 234–236, 235f
Competitive mediation, 127 second-order and bifactor models of, 255–258, 257f
Complementary mediation, 127 software for SEM analyses and, 75
Complete mediation, 127 structural regression models and, 265–268, 266f, 273, 280, 281
Completely overlapping, 194 Confirmatory path analysis. See Piecewise SEM
Composite indicators, 221–222, 225 Confirmatory tetrad analysis (CTA), 252, 286
Composite latent construct, 220–221 Confounding
Composite measurement, 219, 219f, 221–222, 223–224, 228 causal mediation analysis and, 365–366
Composite SEM. See also Structural equation modeling (SEM) confounder bias, 81f, 82–84
alternative composite model, 294–297, 295f covariate selection and, 88
computer tools and, 288–289 full SR models and, 267–268
example demonstrating, 289–294, 289f, 290t, 291t, 292f, 295f overview, 80–81, 81f, 82
Henseler–Ogasawara (HO) specification and, 301–304, 302f, in parametric models, 105–106, 106f
303t, 304t Congeneric indicators, 241, 251–252
overview, 1–2, 11, 12–13, 284–285, 304–305 Consistent mediation, 127
partial least squares path modeling (PLS-PM) and, 297–301, Consistent PLS (PLSc), 286
298t Constrained baseline approach, 398
reporting results and, 416–417 Constrained estimation, 141, 249–251, 250f, 251f
sample size and, 12, 15–16 Constrained parameter, 102
terminology, 285–288 Constraint interaction, 236n
Composite–formative models, 284–285 Construct bias, 393
Computer software. See Software for SEM analyses Constructs, 218
Concept proxy framework, 1–2, 218–219, 218f Continuous indicators, 325
Conditional causality, 107 Continuous variables
Conditional effects, 204, 209–212, 210t interactive effects of, 115–116, 115f, 116f
Conditional growth models, 382, 383f reporting results and, 420, 421
Conditional independencies, 84–88, 85f, 90–92, 91t, 92t SEM and, 10
Conditional instrument, 95–96, 95f Continuous/categorical variable methodology (CCVM)
Conditional linear effects, 115 estimators, 324
Conditional multivariate normality, 136 example demonstrating, 325–327, 326t, 327t, 328t
Conditional process analysis, 358–360, 359f latent response variables and thresholds and, 321, 322f
Conditioning set, 84 measurement model analyzed in, 323, 323f
Confidence intervals, 129, 272, 298t, 423 models with continuous and ordinal indicators, 325
Configural invariance, 395. See also Measurement invariance (MI) overview, 320
Confirmation bias, 17 Contracted chains, 80–81, 81f, 103–106, 103f, 106f
Confirmatory composite analysis (CCA) Controlled direct effect (CDE), 363, 364–365. See also Direct
formative measurement and, 225 effects
Henseler–Ogasawara (HO) specification and, 304 Convergence, 61, 133, 246–248, 247t, 420
overview, 4, 11, 287, 288 Corrected normal theory method, 138
Confirmatory factor analysis (CFA). See also Categorical Correlated causes, 106–109, 107f
confirmatory factor analysis; Factor analysis Correlated error, 80, 260–261, 260t, 262f
basic CFA models, 232–234, 233f Correlated measurement error, 26, 276, 277–278, 277f
compared to exploratory factor analysis (EFA), 229–231 Correlated traits–correlated methods (CTCM) model, 253, 254f
composite models and, 300–301 Correlated traits–correlated methods minus one (CTC(M-1))
covariance structure analysis and, 11 models, 256
dynamic thresholds and, 171–172 Correlated-uniqueness (CU) model, 244, 253, 254f, 255
equality constraints and, 251–252 Correlation matrix, 47, 47t, 141–142, 300–301
equivalent CFA models, 249–251, 250f, 251f Correlation residuals, 145–147, 146t, 314–315, 315t, 341t
estimation problems, 246–249, 247t Correlation root mean residual (CRMR), 316
example demonstrating, 236–242, 237f, 238t, 240t, 243t Correlation weights, 299
formative measurement and, 225 Correlations, 47t, 141–142
identification rules for correlated errors or multiple loadings, Correspondence rules, 2, 219
260–261, 260t, 262f Counterfactuals, 13, 360–368, 362f

482 Subject Index
Counting rule, 104 Determinants, 48

Covariance equivalence, 194 Diagonal weighted least squares (DWLS), 324, 325, 327t, 328t,
Covariance matrix, 160, 300–301, 420 329
Covariance matrix nesting, 200, 200f Dichotomania, 20, 23
Covariance residuals, 145–147, 146t Dichotomous outcomes, 117
Covariance structure analysis, 11. See also Traditional SEM Differential additive response bias, 397
Covariances, 10, 43, 47t Differential functioning, 394
Covariate selection, 81–82, 88–89, 89f Differential item functioning, 394
Coverage, 177 Diggle–Kenward selection modeling, 155
Credibility crisis, 20 Dimensional invariance, 395n
Cross-domain change, 386, 387f Direct causal effects, 122–123, 123t
Cross-group equality constraint, 102 Direct effect growth curve model, 386, 387f
Cross-lag direct effects, 332 Direct effects
Cross-lag panel designs, 356–358, 357f causal mediation analysis and, 360, 363–368, 369–370
Cross-loadings, 229 cross-lag panel designs for mediation and, 356
Cross-products information, 154 example demonstrating, 96–98, 96f, 97t, 98t
Cross-sectional designs nonrecursive models and, 343
mediation analysis in, 350–353, 351f, 352f single-door criterion and, 94
reporting results and, 423 starting values and, 134
specification and, 418 Direct feedback loop
Cross-validation, 416 identification requirements, 333–336, 334f, 335f
Cross-world independence, 365 nonrecursive models and, 341f, 343
C-rules. See Correspondence rules overview, 331, 332f
CTA-PLS, 286 Directed acyclic graph (DAG)
Curve-of-factors latent growth model, 386–388, 387f causal loops and, 333
Curvilinear effects, 113–115, 114f, 117, 420 d-separation criterion and, 90–92, 91f, 92t
example demonstrating, 96–98, 96f, 97t, 98t
Data analysis. See also Software for SEM analyses graph vocabulary and, 79–80
inputting data, 46–47, 47t graphical identification criteria and, 92–96, 94f, 95f
missing data and, 49 mediation analysis and, 368, 371
overview, 34, 37 nonparametric SEM and, 11
reporting standards and, 40t overview, 79
Data loss mechanism, 49. See also Missing data piecewise SEM and, 118
Data matrix Directed cyclic graph (DCG), 11, 79, 333
model comparison and, 202 Directed edge, 79
overview, 63t Directed path, 80
positive definiteness and, 48–49, 50–51 Directionality, 103–104
reporting results and, 420 Disacquiescence response style (DRS), 397
Data preparation Disconfirmability, principle of. See Principle of
distributions, 58–61, 60t disconfirmability
extreme collinearity and, 56 Discrete-time survival indicators, 155
handling incomplete data and, 54–56 Disjunctive cause criterion, 89, 89f
inputting data, 46–47, 47t Distinguishable models, 201
missing data and, 49, 52–54 Distributions, 58–61, 60t, 420
outliers, 56–58 Disturbance, 103, 123, 123t
overview, 46, 61, 62 Disturbance covariances, 134
positive definiteness and, 48–49, 50–51 Disturbance variances, 134
relative variances, 62, 63t Do-calculus, 96
reporting results and, 419–420 Dominant indicators, 296
Data screening, 50, 61 Donor cases, 64
Default ML, 135–138. See also Maximum likelihood (ML) Donor set, 64
estimation Drawing editor, 68, 69–70. See also Software for SEM analyses
Definition variable approach, 386 d-separation (d-sep) test, 118–119, 120–122, 121t, 130
Degrees of freedom d-separation criterion
composite models and, 296–297 equivalent models and, 197–198
full SR models and, 268 graphical identification criteria and, 92–93
model degrees of freedom, 102–103 overview, 89–92, 91f, 92t
reporting results and, 42, 44, 422–423 reporting results and, 420
Delta scaling (parameterization), 323–324 d-separation equivalence, 194
Derived concept, 218 Dynamic (flexible) thresholds, 171–172
Descendants in a DAG, 80 Dynamic Model Fit, 171

Subject Index 483
Edges. See Arcs Exogeneity, 82

EFA in CFA framework (E/CFA), 231 Exogenous regressor, 80
Effect decomposition, 143–144, 144t, 343 Exogenous variables in SEM
Effect indicators, 219–220, 219f advanced topics in parametric models and, 113–116, 113t, 114f,
Effect sizes 115f, 116f
fit indexes and, 163 assumptions and, 104
for indirect effects, 353–356, 355t default ML and, 136
measurement invariance and, 401 fixed-X option and, 136
overview, 38 overview, 33
Effects coding identification (ECI) constraints, 235f, 236 RAM graphical symbolism and, 101–102
Effects coding method, 235f, 236, 400–401 Expectation–maximization (EM) algorithm, 65
Eigenvectors, 48 Expected information matrix, 153–154
80/20 rule of data analysis, 61 Expected parameter change, 188, 243
Elastic net method, 317 Explaining aware effect, 87
Emergent variables, 222 Exploratory bifactor models, 258
Empirical concepts, 218 Exploratory factor analysis (EFA). See also Factor analysis
Empirical growth records, 373 compared to confirmatory factor analysis (CFA), 229–231
Empirical overidentification, 246 indicator selection and, 231–232
Empirical respecification, 184. See also Respecification of models measurement invariance and, 394
Empirical underidentification, 135, 246 overview, 229
Empirical weights, 299 software for SEM analyses and, 74
Endogeneity, 81, 105–106 Exploratory structural equation modeling (ESEM), 231
Endogenous regressor, 81 Exploratory tetrad analysis (ETA), 252
Endogenous selection bias. See Collider bias Extended instruments, 96
Endogenous variables, 33, 104 Extra dependent variable (DV) method, 139
Enhancing the Quality and Transparency of Health Research Extreme collinearity, 56
(EQUATOR) network, 39 Extreme response style (ERS), 396
Epistemic correlations. See Correspondence rules
EQS program, 74, 186. See also Software for SEM analyses F (Filter) matrix, 100
Equal-fit hypothesis, 184, 201 Factor analysis, 15–16, 231–232. See also Confirmatory factor
Equality constraint analysis (CFA); Exploratory factor analysis (EFA)
common factor models and, 309 Factor covariances, 234–235, 246, 400–401
confirmatory factor analysis (CFA) and, 251–252 Factor indeterminacy, 230
measurement invariance and, 405–408, 406t, 407t Factor score indeterminacy, 230, 231
overview, 102 Factor score regression (FSR), 310
Equilibrium, 333, 343 Factor variances, 235, 235f, 401
Equivalence testing, 180–181 Factorial (factor) weighting, 298–299
Equivalent models Factor-ratio test, 402
confirmatory factor analysis (CFA) and, 249–251, 250f, 251f Faithfulness assumption, 90
coping with, 196–198 Fallacy of the transposed conditional. See Inverse probability
overview, 17, 194–196, 195f errors
reporting results and, 44, 416 Falsification tests, 84
Error covariances, 246, 260–261 Feedback loops, 331–333, 332f
Error of estimation, 167 Fences, 58. See also Box plots
Error propagation, 131–132 Filter myth, 22
Error variance homogeneity. See Strict invariance First- and second-stage conditional process model, 360
Errors of approximation, 167 First-order factors, 255, 257
E-step (expectation), 65 First-stage conditional process model, 358–360, 359f, 361–362, 362f
Estimation accuracy fit index (EAFI), 163 Fisher information matrix. See Information matrices
Estimation methods Fit indexes
alternative estimators for continuous outcomes, 140–141 approach to fit evaluation and, 172–173
for categorical data, 319–320 CFI overview, 168–169
confirmatory factor analysis (CFA) and, 246–249, 247t example demonstrating, 173–174, 174t
relative variances and, 61 measurement invariance and, 398–399
reporting results and, 40t, 42, 44, 420–422 multiple-group path analysis and, 207–208
sample size and, 15 overview, 163–166, 166f
Exact-fit hypothesis, 157, 175 reporting results and, 420, 423
Exact-fit test, 172, 294 RMSEA overview, 166–168, 166f
Excess kurtosis, 59 thresholds for, 170–172
Exclusion restriction, 82 Fitted correlations. See Predicted correlations
Excrescent variables, 301 Fitted covariances. See Predicted covariances

484 Subject Index
Fitted means. See Predicted means chi-square and, 159

Fitted residuals. See Covariance residuals composite models and, 285
Fitting propensity, 201–202 example demonstrating, 173–174, 174t
Fixed parameter, 101–102 full SR models and, 268
Fixed thresholds, 170–171 measurement invariance and, 404–405, 404t
Fixed weights, 299 model comparison and, 202
Fixed-X option, 136 nonrecursive models and, 339
Fork, 84–85, 85f overview, 41, 177–178
Formative measurement, 224–225, 227–228, 299–300 reporting results and, 44, 415–416, 420
Forward CI method, 402 structural regression models and, 272t
Four-step modeling, 268–269, 272 thresholds for, 170–172
Fourth moment about the mean, 59 Global SAM, 310. See also Structural-after-measurement (SAM)
Fourth standardized moment, 59 approach
Fraction of missing information (FMI), 65 Goodman test, 125, 128
Free baseline approach, 398 Goodness-of-fit index (GFI), 164
Free parameter, 101–102, 244, 246 Goodness-of-fit statistics, 163, 225
Front-door path. See Directed path Graph vocabulary, 79–80
Full information maximum likelihood (FIML) Graphical editors, 69–70. See also Software for SEM analyses
basic latent growth models and, 373 Graphical identification criteria
composite models and, 285 causal loops and, 333–336, 334f, 335f
for incomplete data, 138–140 example demonstrating, 96–98, 96f, 97t, 98t
missing data and, 55 overview, 92–96, 94f, 95f, 98–99
overview, 2, 65 Graphical symbolism
software for SEM analyses and, 72, 74–75 basic CFA models and, 232–233, 233f
Full LISREL model, 263. See also Structural regression (SR) basic growth models with no covariates and, 374–375, 375f
models diagrams for contracted chains and, 103–105, 103f
Full longitudinal design, 357–358, 357t full SR models and, 263–265, 264f, 270, 270f
Full SR models. See also Structural regression (SR) models mediation analysis and, 368, 371
example demonstrating, 269–273, 270f, 271t, 272t, 274t RAM graphical symbolism, 100–103, 111
other modeling strategies for, 268–269 Group differences. See Multiple-group path analysis
overview, 263–265, 264f, 281 Group-mean substitution, 55
two-step modeling and, 265–268, 266f Growth predictor model, 373, 382–385, 383f, 384t
Full-information maximum likelihood (FIML) estimator, 328, 329
Full-information methods. See Global estimation Half longitudinal design, 356–357, 357t
Fully conditional specification, 64, 65 HARKing, 183–184
Fully weighted least squares (WLS) estimation, 140 Hausman specification test, 84
Fundamental confidence fallacy, 22 Hawthorne effect, 225
Fungible weights method in regression, 133 Henseler–Ogasawara (HO) specification, 285, 301–304, 302f,
303t, 304t
General (nonmonotone) missing data, 64. See also Missing data Heywood cases, 134–135, 246–248, 247t, 249
General linear model (GLM), 11 Hierarchical CFA model, 255–258, 257f
Generalized least squares (GLS), 140, 327–328 Hierarchical linear modeling, 373
General-specific models, 255–258, 257f Hierarchically related models, 182–183, 198–199, 200–202, 200f
Genetic data, 18, 188 High-dimension data, 366
Global estimation High-dimension mediation, 366
alternative estimators for continuous outcomes, 140–141 Hinges, 58. See also Box plots
analyzing nonnormal data and, 137–138 Holistic construal, 218
default ML, 135–137 Homogeneity assumption, 213
error propagation, 131–132 Homogeneous indicators, 231–232
estimation problems in CFA and, 248–249 Homophily bias. See Collider bias
example demonstrating, 142–147, 142t, 143t, 144t, 146t Hyman–Tate Criterion, 108
fitting models to correlation matrices, 141–142 Hypothesis testing, 7, 15, 16, 410
maximum likelihood (ML) estimation, 132–135
overview, 117–118, 131–132, 142, 151, 177–178 Identification
reporting results and, 422–423 basic CFA models and, 234
SEM computer tools and, 150–151 causal loops and, 333–336, 334f, 335f
Global fit. See also Chi-square; Model fit; Root mean square error causal–formative measurement and, 221
of approximation (RMSEA); Standardized root mean square composite models and, 297
residual (SRMR) confirmatory factor analysis (CFA) and, 244, 246
approach to fit evaluation and, 172–173 estimation problems in CFA and, 248
basic growth models with no covariates and, 378 measurement invariance and, 410

Subject Index 485
overview, 2, 34, 35–36 Iterative estimation methods, 61, 133, 309

reporting results and, 418–419 Iterative steps, 298t
Identification rules, 260–261, 260t, 262f
Idiosyncratic error, 80 Johnson–Neyman technique, 362
Ill-scaled covariance matrix, 62 Joint interventional indirect effect (IIEjo), 367
Imaging data, 18 Journal article reporting standards for quantitative studies (JARS-
Impact, 393 Quant), 25, 39, 40t. See also Reporting results
Implied conditional independence, 84 Journal article reporting standards (JARS), 39, 40t. See also
Imputation model, 64 Reporting results
Imputation step of MI, 64–65. See also Multiple imputation (MI) Just-determined equations. See Identification; Just-identified
Inadmissible solutions, 134–135, 246–248, 247t, 309 equations
Incomplete data. See Missing data Just-identified equations, 35, 36, 110–111. See also Identification
Inconsistent mediation, 127 JWK (Jöreskog, Keesling, Wiley) model, 12
Incremental (relative, comparative) fit indexes, 164–165
Incremental validity, 26 Kaiser–Dickman algorithm, 120
Independence (null) model, 168–169 Kappa-squared, 354
Index of moderated mediation, 360 Kelley–Lai precision method, 176, 177
Indicator selection, 231–232, 234–235 Knot, 388
Indirect effects Kolmogorov–Smirnov (K-S) test, 59
causal mediation analysis and, 360, 363–368, 369–370 Kurtosis, 59–61, 60t, 420
conditional indirect effects over groups, 211–212
cross-lag panel designs for mediation and, 356 L → M block (latent to manifest), 220, 284
effect sizes for, 353–356, 355t Lagrange Multiplier (LM), 187
estimates of, 128t Lasso (least absolute shrinkage and selection operator) method,
example demonstrating, 125–129, 126f, 128t 317
the four steps and, 126–128, 126f Last observation carried forward (LOCF), 55
multiple-group path analysis and, 210–212, 210t Latent basis curve model. See Latent basis growth model
overview, 106–109, 107f Latent basis growth model, 375f, 376–377
reporting results and, 423 Latent class analysis, 12
Indirect feedback loop, 331, 335f, 336 Latent difference score models, 389
Indistinguishable models, 201 Latent growth curve models, 372, 389, 391–392
Individually varying time points, 385 Latent growth factors, 373
Inequality constraint, 102 Latent growth modeling
Information matrices, 133–134, 153–154, 248 basic growth models with no covariates and, 374–379, 374f,
Information-theoretic fit indexes. See Predictive fit indexes 375f, 379f
Informative priors, 317 basic latent growth models, 372–374
Initialization, 297, 298t example analyses of basic growth models, 379–381, 380t,
Inner estimation, 297–298, 298t 381t
Inner model, 295f, 296, 297–298 example for a growth predictor model with time-invariant
Instrumental variables, 81f, 82–84, 94–96, 95f covariates, 382–385, 383f, 384t
Interaction misspecification, 160–161 extensions of, 385–389, 387f, 388f
Interactive effects, 115–116, 115f, 116f overview, 371, 372, 385, 389
Intercept, 148–150, 208–209, 375–376, 382, 384, 389, 391–392, Latent response formulation (LRF), 320
396 Latent response variables, 321, 322f, 323–324
Interchangeable methods (raters), 256 Latent state-trait models, 389
Interpretation, 423 Latent variable longitudinal curve model. See Curve-of-factors
Interpretational confounding, 267–268 latent growth model
Intervening (intermediate) variables, 33 Latent variable models, 12, 15. See also Structural regression (SR)
Interventional direct effect (IDE), 366–368, 369–370. See also models
Direct effects Latent variable path analysis with partial least squares (LVPLS),
Interventional indirect effect (IIE), 366–368, 369–370. See also 286
Indirect effects Latent variable path models, 15, 263
Invariance, 395–398, 412. See also Measurement invariance (MI) Latent variables, 24
Inverse probability errors, 21–22 Latent-to-observed (L2O) transformations, 249
Inverted fork, 85, 85f lavaan. See also Software for SEM analyses
I-step (imputation), 64–65 casewise ML and, 139
Item response theory (IRT) comparing nonnested models and, 192–193, 192t
categorical confirmatory factor analysis and, 329 correlation residuals and, 147
measurement invariance and, 411, 412 information matrices and, 153–154
software for SEM analyses and, 72, 74, 75 mean structures and, 148
training in, 20 nonrecursive models and, 339–340, 340t

486 Subject Index
lavaan (continued) Longitudinal independence model, 169

regularized SEM and, 317 Longitudinal measurement invariance, 393. See also
robust ML and, 161, 163 Measurement invariance (MI)
structural regression models and, 273, 280 Lower diagonal form, 47, 47t
Law of diffusion of idiocy, 22
Learning SEM M → C block (manifest to composite), 221, 284
assessing your knowledge, 27–31 M → L block (manifest to latent), 220, 284
obstacles to, 20–23 MacCallum–RMSEA method. See also Root mean square error of
overview, 26–27 approximation (RMSEA)
preparing for, 7–9, 19 basic growth models and, 380
significance testing and, 23–24 common factor model in a small sample and, 312
training in graduate programs and, 19–20 example demonstrating, 176–177, 177t
Lee–Hershberger replacing rules, 194 full SR models and, 269, 270–271
Leptokurtic distributions, 59 nonrecursive models and, 339
Level and shape model. See Latent basis growth model overview, 175, 176
Likelihood fallacy, 23 structural regression models and, 280–281
Likelihood ratio test, 157, 211 MACS (mean and covariance structure) models, 394, 399–400
Likert items, 319 Mahalonbis distance, 58
Limited-information methods. See Local estimation Manifest variable path models
Linear and quadratic (polynomial) growth models, 378–379, 379f, controlling measurement error in, 315–316
391–392 example demonstrating, 96–98, 96f, 97t, 98t
Linear effects, 10 full SR models and, 263–265, 264f
Linear growth models, 378–379, 379f overview, 14–15
Linear–linear growth process, 388–389, 388f partial SR models with single indicators, 275–276
Link function, 117 Marginal model, 367
Links, 79 Marker variable method. See Reference variable method
LISREL (Linear Structural Relations) program. See also Software Markov assumption, 90
for SEM analyses Markov Chain, 64–65
casewise ML and, 139 Markov Chain Monte Carlo (MCMC) approach, 64–65, 328
fit indexes and, 163 Masking, 57
indirect effects and, 129 Matrix input, 47, 47t, 48–49, 50–51
JWK model and, 12 Maximum likelihood (ML) estimation
modification procedure and, 187 basis growth models and, 381, 381t
multiple-group path analysis and, 204 casewise ML, 138–140
overview, 67, 74 categorical data and, 319–320, 329
scaled chi-square difference tests and, 186 common factor model in a small sample and, 314–315, 314t, 315t
standardized residuals and, 146 default ML, 135–137
symbols and notation and, 4 estimation problems in CFA and, 248–249
Listwise deletion, 54 Henseler–Ogasawara (HO) specification and, 301–304, 302f,
Little MCAR Test, 59 303t, 304t
Local estimation measurement invariance and, 405–408, 406t, 407t
common factor models and, 309 multiple-group path analysis and, 210, 210t, 211t
estimation problems in CFA and, 248–249 nonrecursive models and, 342t
example demonstrating, 120–129, 121t, 122t, 123t, 126f, 128t overview, 132–135, 151
overview, 117–118, 130 partial SR models with single indicators, 274–275, 275t
piecewise SEM and, 120 robust ML, 138
Local fit. See also Model fit scaled chi-squares and, 161, 163
approach to fit evaluation and, 172 structural regression models and, 273, 274t, 281, 282t
overview, 41, 42 two-factor CFA model and, 239–240, 240t
reporting results and, 44, 415–416, 420 Maximum Wishart likelihood (MWL), 135–137
Local SAM, 310. See also Structural-after-measurement (SAM) McArdle–McDonald RAM system. See Reticular action model
approach (RAM)
Local Type I error fallacy, 21 Mean residual, 150
Locally independent effect indicators, 220 Mean structures, 147–150, 149f
Loess procedure, 115 Mean substitution, 55
Logistic regression for dichotomous outcomes, 20 Mean-adjusted chi-square, 161
Longitudinal data Means, 10, 401
basic latent growth models and, 373 Measured confounding bias, 82, 88
causal loops and, 331–333, 332f Measurement. See also Measurement error; Psychometrics
cross-lag panel designs for mediation and, 356–358, 357t categorical data and, 323, 323f
latent growth modeling and, 386, 389 estimation problems in CFA and, 246–249, 247t

Subject Index 487
measure selection, 34 software for SEM analyses and, 75

overview, 2, 7, 8, 14–15, 24–25, 26 steps of multiple imputation (MI) and, 64–66
reporting results and, 419 Missing completely at random (MCAR). See also Missing data
sample size and, 15–16 categorical data and, 329
selecting a measurement model and, 223–224 diagnosing, 53–54
training in, 23 handling incomplete data and, 54–56, 139–140
unequal measurement intervals and options for defining the information matrices and, 153–154
intercept and, 391–392 overview, 49, 52, 53f
Measurement error. See also Measurement steps of multiple imputation (MI) and, 64–66
causal–formative measurement and, 221 Missing data
composite models and, 287 basic latent growth models and, 373
formative measurement and, 225 casewise ML and, 155
in manifest-variable path models, 315–316 categorical data and, 329
overview, 104 diagnosing, 53–54
regression analysis and, 25–26 handling, 54–56, 138–140
Measurement invariance (MI) information matrices and, 153–154
analysis decisions, 398–401 overview, 49, 52–54, 53f
confirmatory factor analysis (CFA) and, 408–410 reporting results and, 419
example demonstrating, 402–408, 403f, 404f, 406t, 407t software for SEM analyses and, 74–75
levels of invariance, 395–398 steps of multiple imputation (MI) and, 64–66
other statistical approaches to, 410–412 Missing not at random (MNAR). See also Missing data
overview, 393–394, 408–409, 412 basic latent growth models and, 373
partial measurement invariance, 401–402, 403t casewise ML and, 140, 155
replication and, 38 diagnosing, 53–54
Median absolute deviation (MAD), 57 overview, 52, 53f
Mediation software for SEM analyses and, 75
based on nonparametric models and counterfactuals, 360–368, steps of multiple imputation (MI) and, 64–66
362f Missingness graphs, 52
conditional process analysis, 358–360, 359f Missingness mechanism, 52
cross-lag panel designs for, 356–358, 357f Misspecification, 131–132
in cross-sectional designs, 350–353, 351f, 352f Mixture modeling, 12, 74
effect sizes for indirect effects and, 353–356, 355t Model building, 183–184, 188–190, 189t, 398
the four steps, 125, 126–128, 126f Model chi-square, 156–161, 162, 163. See also Chi-square
mediation myth, 349 Model degrees of freedom, 102–103. See also Degrees of
mediation ratio, 354 freedom
overview, 86, 349–350, 371 Model fit. See also Identification
reporting results and, 368–370, 423 approach to fit evaluation and, 172–173
timing criterion and, 107–109 model comparison and, 202
Mesokurtic distributions, 59 model fit indexing, 156, 163–166
Method variance, 230 multiple-group path analysis and, 207, 207t
Methodology, 40t, 44 overview, 37, 415–416
Metric invariance. See Weak invariance reporting standards and, 39, 40t, 41, 44
M-graphs, 52 Model parameters
Midpoint response style (MRS), 396 basic CFA models and, 233f, 234
MIIVsem package, 312–315, 313f, 314t. See also Software for composite models and, 296–297
SEM analyses overview, 14–15, 34
MIMIC (multiple indicators and multiple causes) modeling RAM graphical symbolism and, 101–102
composite models and, 293, 294 Model size effect, 160
with covariates, 410 Model trimming, 183–184, 398
other statistical approaches to, 410–411 Model-implied instrumental variables (MIIV), 312, 313–315,
overview, 222–223, 222f 313f
Minimally sufficient (adjustment) set, 93, 94f Model-implied instrumental variables using two-stage least
Minimum covariance determinant (MCD) method, 286 squares (MIIV-2SLS), 248–249, 309, 312
Minimum fit chi-square, 157 Modeling school, 163
Missing at random (MAR). See also Missing data Models. See also Specification
basic latent growth models and, 373 model hacking, 23
categorical data and, 329 overview, 8, 14–15
diagnosing, 53–54 reporting standards and, 39, 42
handling incomplete data and, 55–56, 139–140 selection uncertainty, 193–194
information matrices and, 153–154 specification of, 32–33
overview, 52, 53f testing, 156

488 Subject Index
Moderated mediation, 204 Naming fallacy, 218, 423

Moderated multiple regression (MMR), 115–116 Natural direct effect (NDE), 363, 364–365. See also Direct effects
Moderated path analysis, 203. See also Multiple-group path analysis Natural indirect effect (NIE), 363, 364–365. See also Indirect
Modification index (MI), 187–190, 189t effects
Modified Bollen–Stine bootstrap, 138 Nearly equivalent models, 198, 416
Modularity, 86 Negative bias, 104
Moment equivalence, 194 Negative excess kurtosis, 59
Moment matrix nesting, 200 Negative skew, 59, 61
Monotone missing data, 64. See also Missing data Negligible randomness assumption, 90
Monte Carlo simulations Neo-Fisherian significance assessments (NFSA), 24
conditional indirect effects over groups, 211–212 Nested models
model comparison and, 202 building and trimming, 183–184
model size effect and, 160 chi-square difference test and, 184–186
overview, 175–176, 177, 178f conditional indirect effects over groups, 211–212
software for SEM analyses and, 72, 74, 75 confirmatory factor analysis (CFA) and, 255–258, 257f
Mplus. See also Software for SEM analyses overview, 182–183, 198–199, 200–202, 200f
categorical data and, 328–329 Nesting and equivalence testing (NET), 200, 200f
indirect effects and, 129 Network models, 227
measurement invariance and, 408 Neutral cycle, 339
overview, 2, 75 Neyman–Rubin model. See Potential outcomes model (POM)
scaled chi-square difference tests and, 186 Nil hypothesis, 157
traditional SEM and, 12 Nodes, 79
M-step (maximization), 65 Nominal variables, 10
Multi-agent estimation algorithm, 133 Nominalist fallacy, 218
Multicollinearity, 336, 419. See also Extreme collinearity Nonadjacent variables, 79
Multidimensional measurement, 232 Noncentrality fit indexes, 165
Multinormality. See Multivariate normality Noncontinuous variables, 10
Multiphase longitudinal designs, 332 Nonignorable missing data. See Missing data; Missing not at
Multiple imputation (MI) random (MNAR)
casewise ML and, 139–140 Nonlinear constraint, 102
distributions and, 59 Nonlinear curve fitting, 376
missing data and, 55–56, 138–140 Nonnegative least squares (NNLS), 299
R packages and, 72 Nonnested models, 190–194, 191f, 192t, 193t
software for SEM analyses and, 74–75 Nonnormal data, 137–138, 153–154
steps of, 64–66 Non-normed fit index (NNFI), 169
Multiple loadings, 260–261, 260t, 262f Nonoverlapping models, 201
Multiple optima, 133 Nonparametric bootstrapping, 285
Multiple regression (MR), 20, 25–26. See also Regression analysis Nonparametric causal models
Multiple-group path analysis correlated causes and, 107
conditional indirect effects over groups, 211–212 covariate selection and, 88–89, 89f
example demonstrating, 205–211, 205t, 206f, 207t, 208t, 209t, example demonstrating, 96–98, 96f, 97t, 98t
210t, 211t overview, 79, 98–99
overview, 203–205, 212 Nonparametric models, 360–368, 362f
Multiple-indicator measurement Nonparametric regression, 114–115
alternative measurement models and approaches, 225–227 Nonparametric SEM, 1, 4, 11, 13–14, 98–99, 416–417. See also
causal–formative measurement and, 220–221 Structural equation modeling (SEM); Structurally causal
composite measurement and, 221–222 model (SCM)
formative measurement and, 224–225 Nonpositive definite (NPD) data, 48–49, 50–51
mixed-model measurement and, 222–223, 222f Nonrecursive models
overview, 14, 15, 217–219, 218f, 227–228 blocked-error model and, 344–345, 345t
reflective measurement and, 219–220, 219f causal loops and, 334
reporting results and, 417 evaluation of the rank condition, 347–348
selecting a measurement model and, 223–224 example demonstrating, 338–344, 340t, 341f, 342t
Multiple-indicators and multiple-causes (MIMIC) model. See identification requirements, 336
MIMIC (multiple indicators and multiple causes) modeling mediation analysis in cross-sectional designs and, 350, 351f
Multitrait–multimethod (MTMM) study, 252–255, 254f, 256 order condition and rank condition and, 337–338
Multivariate growth model, 386, 387f overview, 109–111, 109f, 345
Multivariate imputation by chained equations (MICE). See Fully respecification of, 336–337, 337f
conditional specification Nonsingular matrix, 48
Multivariate normality, 58–59, 136–137 Normal distributions, 117
Multivariate outlier, 58. See also Outliers Normal theory method, 136

Subject Index 489
Normalized noncentrality parameter, 167 with correlated causes or indirect effects, 106–109, 107f
Normalized residuals, 145–147, 146t diagrams for contracted chains and, 103–105, 103f
Normalizing transformations (normalization), 60–61, 60t example demonstrating, 111, 111f
Normed chi-square. See Chi-square overview, 100, 111
Normed chi-square (NC), 161 recursive, nonrecursive, and partially recursive models and,
Notable-null hypothesis, 157 109–111, 109f
Not-close-fit hypothesis, 175, 180 symbolism for, 100–103
Null hypothesis rejection, 16, 118, 211, 312 Parametric models
Numerical integration, 328 advanced topics in, 113–116, 113t, 114f, 115f, 116f
confounding in, 105–106, 106f
Observational equivalence, 194 with estimates, 123–125
Observational sector, 218 overview, 14–15
Observed correlation, 118 Parceling, 310–311
Observed information matrix, 153–154 Parents in a DAG, 80
Occasion-specific congenerity, 387f, 388 Pareto principle, 61
Odds-against-chance fallacy, 21 Parsimony principle, 103, 418
One-sided confidence, 167 Parsimony-adjusted indexes, 164
One-step modeling, 265 Partial correlations, 120–122, 121t
Order condition, 337–338 Partial least squares path modeling (PLS-PM). See also
Ordered-categorical variables, 319. See also Ordinal variables Composite SEM
Ordinal indicators, 325 composite models and, 285, 294–296, 295f, 297–301, 298t,
Ordinal variables, 10, 421 304
Ordinary least squares (OLS). See also Ordinary least squares overview, 297–300, 298t
(OLS) regression terminology, 285–288
composite models and, 288, 303–304, 304t Partial measurement invariance, 401–402, 403t, 412
Henseler–Ogasawara (HO) specification and, 303–304, 304t Partial mediation, 127
mediation analysis and, 355–356, 355t Partial SR models. See also Structural regression (SR) models
Ordinary least squares (OLS) regression. See also Ordinary least example demonstrating, 278–281, 279f, 282t
squares (OLS) overview, 263, 281
composite models and, 285 with single indicators, 274–278, 275f, 277f
local estimation and, 117 Partial-information methods. See Local estimation
maximum likelihood (ML) estimation and, 132 Partially overlapping models, 200f, 201
mediation analysis and, 353 Partially recursive models, 109–111, 109f
overview, 80, 83 Partially reduced form model, 293, 294
piecewise SEM and, 122–123, 123t Path-specific indirect effects, 366
Outcome (dependent) variables. See Endogenous variables Pattern invariance. See Weak invariance
Outcome model, 367 Pattern matching, 55
Outer estimation, 297–298, 298t Pattern-mixture modeling method, 155
Outer model, 295f, 296, 297–298 Pearson correlations, 50, 118, 240, 242
Outliers, 56–58, 59–61 Pedagogical approach, 3, 14–15
Overall discrepancy, 167 Penalized likelihood estimation, 317
Overall effect, 368 Permanent illusion. See Inverse probability errors
Overcontrol (overadjustment) bias, 85, 88, 89f Person-level fit, 159
Overdetermined equations. See Identification; Overidentification Piecewise latent growth model, 387f, 388
Overidentification, 36, 110–111, 246. See also Identification Piecewise SEM, 14, 118–129, 121t, 122t, 123t, 126f, 128t, 130
Platykurtic distributions, 59
p hacking, 23 PLS Mode A, 299–300
p values, 21–22, 24, 26, 159, 422–423 PLS Mode B, 299–300
Pairwise deletion, 54–55 PLS Mode BNNLS, 299
Pairwise estimation of Pearson correlations, 321 Polychoric correlations, 321–322, 322f, 420
Panel model, 332, 332f Polynomial growth model. See Linear and quadratic (polynomial)
Parallel growth process model, 386, 387f growth models
Parallel indicators, 251–252 Polynomial regression, 113–114
Parallel latent growth curve model, 386, 387f Polytomous items, 321, 322f
Parallel mediation model, 350, 352–353, 352f Pooling (combination) step of MI, 66. See also Multiple
Parameter estimates, 51, 137–138, 209–211, 210t imputation (MI)
Parameter nesting, 182–183 Poor-fit hypothesis, 175, 180
Parameterization, 399–401 Population discrepancy, 167
Parametric causal models Population-corrected robust RMSEA, 168
advanced topics in, 113–116, 113t, 114f, 115f, 116f Positive definite (PD) data, 48–49, 50–51
confounding in, 105–106, 106f Positive excess kurtosis, 59

490 Subject Index
Positive semidefinite, 300 Recursive models

Positive skew, 59, 61 identification requirements, 335f
Positivity assumptions, 363 indirect effects in, 128t
Posttreatment (treatment-dependent) confounders, 365 mean structures and, 149–150, 149f
Potential outcomes model (POM), 13–14 mediation analysis in cross-sectional designs and, 350, 351f
Power. See also Statistical power Monte Carlo simulations and, 178f
chi-square and, 159–160 overview, 109–111, 109f
example demonstrating, 176–177, 177t structural regression models and, 273
overview, 174–177, 177t Recursive structural models, 331, 333
power analysis, 44 Reduced-form R2, 346
power terms, 113–114 Reference group method, 235, 235f, 400
reporting results and, 423 Reference method, 256
sample size and, 16 Reference variable method, 233, 233f, 235f, 400
selecting a measurement model and, 224 Referent loading identification approach. See Reference variable
Precision, 22, 168, 174–177, 177t method
Predicted correlations, 144–145 Reflective measurement
Predicted covariances, 144–145 overview, 219–220, 219f, 228
Predicted means, 150 partial least squares path modeling (PLS-PM) and, 299–300
Prediction, 13 reporting results and, 416–417
Predictive fit indexes, 165, 190 selecting a measurement model and, 223–224
Predictive mean matching method, 64 Regions of significance, 362
Predictor growth model, 382–385, 383f, 384t Regression analysis
Pretest sensitization, 225 assessing your knowledge, 27–31
Principle of disconfirmability, 36 compared to SEM, 10
Principles of SEM, 4 overview, 2, 7, 19, 25–26, 27–31
PROCESS, 129, 358. See also Software for SEM analyses reporting standards and, 41, 44
Product term XW, 115–116, 115f traditional SEM and, 11–12, 18
Propagation of measurement error, 25 Regression diagnostics, 20
Propagation of specification error, 118 Regression method, 64
Proportionality, 102, 377–378 Regression residuals, 41
Pseudo-groups, 204 Regression substitution, 55
P-step (posterior), 64–65 Regression through the origin (RTO) method, 148
Psych package, 120–122, 121t Regression weights, 299
Psychometrics. See also Measurement Regularization methods, 317
assessing your knowledge, 27–31 Regularized SEM, 317
overview, 2, 19, 24–25, 26 Reject–support test, 159–160
reporting results and, 419 Relative variances, 60t, 62, 63t
training in, 23 Relativity of reliability, 25
Psychometrics Primer, 274, 316 Relevance, 82
Reliability, 15, 315–316
Quantile–quantile (Q-Q) plots, 146 Reliability coefficients, 24–25
Reliability induction, 24–25
R packages. See also Software for SEM analyses Reliability paradox, 160
analyzing nonnormal data and, 138 Replacement variate, 218
comparing nonnested models and, 192–193, 192t Replacing rule, 196, 197–198
composite models and, 288 Replication of results, 17, 22, 38, 416
conditional process analysis and, 358 Reporting results
indirect effects and, 129 approach to fit evaluation and, 172–173
measurement invariance and, 410 best practices in, 415–416
mediation analysis and, 354–355, 355t estimation, 420–422
overview, 2, 71–73, 71t example demonstrating, 41, 43–44, 43f
piecewise SEM and, 120–122, 121t family relations, 416–417
regularized SEM and, 317 identification, 418–419
structural regression models and, 270–272, 271t interpretation, 423
R2 statistic, 344–345, 345t, 423 measures, 419
Random hot-deck imputation (RHDI), 55 for mediation studies, 368–370
Random intercept-only model, 375–376 overview, 37
Random seed, 66 sample and data, 419–412
Rank condition, 338, 347–348 specification and respecification, 417–418, 422
Raw residuals. See Covariance residuals standards for, 7, 39–41, 40t, 42, 414, 415t
Reactive measurement, 225–226 tabulation, 422–423

Subject Index 491
Residual confounding, 82 Satorra–Saris method, 175, 176

Residual invariance. See Strict invariance Saturated correlates method, 139
Respecification of models. See also Models; Specification Saturated equations. See Identification; Just-identified equations
confirmatory factor analysis (CFA) and, 243–246, 245t Scalar invariance. See Strong invariance
empirical versus theoretical respecification, 184 Scaled (corrected) model test statistic, 138
nonrecursive models that are not identified, 336–337, 337f Scaled chi-squares, 161, 163, 185, 186. See also Chi-square
overview, 33, 37 Scaling, 103, 104, 135, 234–236, 235f, 399–401, 418
reporting results and, 40t, 42, 422 Scaling latent response variables, 323–324
Response styles, 396 Schmid–Leiman (SL) transformation, 258
Restricted measurement models, 229 Score test, 187
Reticular action model (RAM) Second-order CFA model, 255–258, 257f
basic CFA models and, 232–233, 233f Second-order factor, 255
basic growth models with no covariates and, 374–375, 375f Second-order growth curve model. See Curve-of-factors latent
diagrams for contracted chains and, 103–105, 103f growth model
example demonstrating, 111, 111f Second-stage conditional process model, 360
full SR models and, 263–265, 264f, 270, 270f Seed, 129
mean structures and, 149–150, 149f Seeming unrelated regressions (SUR), 72
overview, 100–103, 111 SEM trees, 188
Retrospective (post hoc, observed) power analysis, 174–175 SEMNET, 9
Reversed indicator rule, 251 Sensitivity analysis, 49
Rho-A, 286 Sequential (serial) mediation model, 352, 352f
Ridge method, 51, 317 Sequential ignorability, 363, 365
Robust DWLS, 324, 329 Sequential regression multivariate imputation. See Fully
Robust ML, 138, 150, 161, 163, 185 conditional specification
Robust PLS, 286 Sign indeterminacy, 296
Robust PLSc method, 286 Sign rule, 297
Robust standard error estimates, 138 Significance testing. See also Statistical significance
Robust WLS, 328 assessing your knowledge, 27–31
Root mean square error of approximation of the path component assumptions of, 21
(RMSEA-P), 268 based on the RMSEA, 180–181
Root mean square error of approximation (RMSEA). See also confirmatory factor analysis (CFA) and, 234
MacCallum–RMSEA method distributions and, 59
measurement invariance and, 398–399 inaccuracies, errors, and hacking in, 21–23
model comparison and, 202 indirect effects and, 129
models with continuous and ordinal indicators, 325 measurement invariance and, 397, 408–409
multiple-group path analysis and, 207, 207t mediation analysis and, 353
overview, 165–168, 166f, 173–174, 174t overview, 2, 19, 23–24
power analysis and, 175–177, 177t power analysis and, 175–176
reporting results and, 423 reporting standards and, 42
significance testing and, 180–181 sample size and, 16
structural regression models and, 272 training in, 20
thresholds for, 170–172 Significance Testing Primer, 22
T-size indexes and, 181 Significosis, 20, 23
two-factor CFA model and, 239 Simple linear regressions, 116
Root mean square residual (RMR), 169 Simultaneity, 34, 81. See also Global estimation
Rotational indeterminacy, 230, 231 Single-door criterion, 94
Single-equation methods. See Local estimation
S (Symmetric) matrix, 100 Single-factor CFA model, 238, 238t, 280. See also Confirmatory
Sample means. See Centroids factor analysis (CFA)
Sample size. See also Small samples Single-factor measurement model, 44
approach to fit evaluation and, 172 Single-group approach, 212
chi-square and, 158 Single-imputation methods, 54
overview, 15–16, 415 Single-indicator measurement, 14, 417
replication and, 38 Singular matrix, 48
reporting results and, 44, 419–420 Skew normal distribution, 138
traditional SEM and, 12 Skew t distributions, 138
Sample-corrected robust RMSEA, 168 Skewness, 59–61, 60t, 138, 420
Sampling, 21, 50, 419–420 Skew-SEM, 138
Sandwich standard errors, 138 Slippery slope of nonsignificance. See Zero fallacy
Sargan overidentification test, 249 Small samples. See also Sample size
Satorra–Bentler scaled chi-square, 161, 163 adjusted test statistics for, 316–317

492 Subject Index
Small samples (continued) Standardized average individual case residuals, 198

Bayesian methods and regularized SEM and, 317 Standardized mean residuals, 208t, 209, 209t
common factor models in, 311–315, 311f, 312t, 313t, 314t, 315t Standardized residuals, 145–147, 146t
measurement errors in manifest-variable path models and, Standardized root mean square residual (SRMR)
315–316 basic growth models and, 380
overview, 2, 309, 318 common factor model in a small sample and, 312
parceling and, 310–311 composite models and, 288, 300
reporting results and, 419 formative measurement and, 225
SmartPLS program, 289. See also Software for SEM analyses measurement invariance and, 399
Sobel approximate standard error, 125, 128 model comparison and, 202
Social network analysis, 227 multiple-group path analysis and, 207t
Soft modeling, 12–13. See also Composite SEM overview, 165–166, 169–170, 173–174, 174t
Software for SEM analyses. See also individual software reporting results and, 423
packages/programs structural regression models and, 272
analyzing nonnormal data and, 137–138 thresholds for, 170–172
availability of, 17 two-factor CFA model and, 239
casewise ML and, 139, 155 Standards for reporting. See Reporting results
categorical data and, 327–329 Starting values, 133, 134, 247
causal loops and, 333–334 Stationarity, 333
commercial tools, 73–76 Statistical beauty, 17
commercial versus free tools, 70–71, 73t Statistical power, 16, 224. See also Power
common factor model in a small sample and, 312–315, 313f, Statistical precision, 16
314t, 315t Statistical significance, 20, 23, 24, 423. See also Significance testing
composite models and, 13, 76, 285, 288–289, 300–301 Steiger–Lind Root Mean Square Error of Approximation
diagrams for contracted chains and, 103–104 (RMSEA). See Root mean square error of approximation
distributions and, 58–59 (RMSEA)
fit indexes and, 165, 166 Steps in SEM, 32–38
fixed-X option and, 136 Stochastic regression imputation, 55
global estimation and, 150–151 Strict invariance, 397–398, 412
Henseler–Ogasawara (HO) specification and, 301, 302f Strictly confirmatory application, 33. See also Specification
human-computer interaction and, 68, 69–70 Strong invariance, 397, 402–408, 403f, 404f, 406t, 407t
indirect effects and, 128–129 Structural causal model (SCM). See Nonparametric SEM
information matrices and, 153–154 Structural equation modeling (SEM). See also Composite SEM;
inputting data, 46–47, 47t Nonparametric SEM; Traditional SEM
JWK model and, 12 best practices in, 414–424, 415t
local estimation and, 118 identification and, 34
mean structures and, 148 overview, 9, 16–17, 18, 414, 424
measurement errors in manifest-variable path models and, 316 Structural regression (SR) models
measurement invariance and, 408–409, 410 basic latent growth models and, 373
multiple-group path analysis and, 204, 206–207, 207t example demonstrating, 269–273, 270f, 271t, 272t, 274t,
nonparametric path model and, 14, 96–98, 97t, 98t 278–281, 279f, 282t
nonpositive definite (NPD) data matrices and, 50–51 full SR models, 263–265, 264f
overview, 2, 4, 8–9, 34, 37, 67–68, 76 models with continuous and ordinal indicators, 325
partial least squares path modeling (PLS-PM) and, 300–301 other modeling strategies for, 268–269
piecewise SEM and, 118, 120–122, 121t overview, 4, 263, 281, 283
power analysis and, 175, 176 partial SR models with single indicators, 274–278, 275f, 277f
reporting results and, 420–422 two-step modeling and, 265–268, 266f
sample size and, 16, 316–317 Structural-after-measurement (SAM) approach, 309–310
scaled chi-square difference tests and, 186 Structurally causal model (SCM), 1, 4, 9–10
standardized residuals and, 145–146 Structurally different methods (raters), 256
using, 68, 70 Structure coefficient, 242
Spatial autoregression, 105–106 Sufficient (adjustment) set, 92–93
Specific variance, 219–220, 229–230 Synthesis model, 295f, 296
Specification. See also Models; Respecification of models System matrix, 347–348
overview, 2, 32–33
reporting results and, 40t, 42, 417–418 Tabu list, 188
specification searches, 183 Tabu search, 188
Spurious mediation, 108 Tabulation, 422–423
Squares and cross products matrix (SSCP), 148 Tailored tests, 329
Stability index, 343 Tailored-fit evaluation strategy, 268
Standard errors, 24, 133–134, 137, 298t, 312 Tau-equivalent indicators, 251–252, 315–316
Standard multiple regression (identity link), 117 Tau-equivalent reliability, 241

Subject Index 493
Temporal precedence, 86 Unanalyzed association, 101

Test bias, 393 Unconditional causality, 107
Test statistics, 316–317, 341t Unconditional growth models, 382
Testable implications, 91f, 92 Unconditional linear effects, 115
Tetrachoric correlation. See Polychoric correlations Unconditional multivariate normality, 136
Theoretical concept, 218 Underdetermined equations. See Identification;
Theoretical respecification, 184. See also Respecification of models Underidentification
Theta scaling (parameterization), 323–324 Underidentification, 35, 246. See also Identification
Third central moment, 59 Undirected path, 80
Third standardized moment, 59 Unidimensional measurement, 232, 311, 315–316
Three-indicator rule, 234 Union basis set, 90–92, 91t, 92t, 118
Three-stage least squares (3SLS), 83 Unique variance, 219–220, 229
Thresholds, 321, 322f, 323, 323f, 326, 399, 409–410 Unit loading identification (ULI) constraint, 103, 233, 235f
Time binning, 385 Unit variance identification (UVI) constraint, 235f
Time structured data, 372–373 Univariate nonnormality, 59
Time-invariant covariates, 373, 382–385, 383f, 384t Univariate outlier, 56–57. See also Outliers
Time-varying covariates, 374 Unknown weights, 299
Timing criterion, 107–109 Unmeasured confounders, 80, 81, 81f
Tolerance, 56 Unnormalized (raw) noncentrality parameter, 166
Total causal effect. See Total effect Unrestricted measurement models, 229
Total effect Unstandardized estimates, 44
back-door criterion and, 93 Unstandardized residual path coefficient, 103
example demonstrating, 96–98, 96f, 97t, 98t Unweighted least squares (ULS) method, 140
graphical identification criteria and, 92–96, 94f, 95f Upsilon, 354–355
overview, 80 User-specified weights, 299
single-door criterion and, 94
Total indirect effects, 143 Valid tracing, 144–145
Total interventional effect (TIE), 367–368 Validity (valid research hypothesis) fallacy, 22
Traditional SEM, 1–2, 11–12, 15–16. See also Covariance Vanishing partial correlations, 91f, 92
structure analysis; Structural equation modeling (SEM) Vanishing tetrad test (VTT), 252
Training in graduate programs, 19–20 Vanishing tetrads, 252
Trait-specific method effects, 253 Variance estimates, 132–133
Transformed sample, 137–138 Variance inflation factor (VIF), 56, 419
Transparency, 39, 408, 420 Variance standardization method, 235, 235f, 400
Triangle inequality, 48 Variance-adjusted chi-square, 161
Trifactor models, 256 Variance-based SEM. See Composite SEM
Trimming, 183–184, 398 Vertices. See Nodes
T-size indexes, 180–181 Vicious cycle, 339
Tucker coefficient of congruence, 38 Virtuous cycle, 339
Tucker-Lewis Index (TLI), 169 v-MAR, 52. See also Missing at random (MAR)
2+ emitted paths rule, 293
Two-factor CFA model. See also Confirmatory factor analysis Wald test, 187, 211, 212
(CFA) Weak instrument, 83–84
equivalent CFA models and, 249–251, 250f, 251f Weak invariance, 396, 402–408, 403f, 404f, 406t, 407t
measurement invariance and, 402–408, 403f, 404f, 406t, 407t Weakly informative priors, 317
overview, 239–242, 240t, 243t Weight matrix, 140–141
Two-index strategy, 170 Weighted least squares (WLS), 324, 328, 421
Two-indicator rule, 234 Weights, 298–299, 298t
Two-stage least squares (2SLS) Welch-James t test, 408
common factor model in a small sample and, 314–315, 314t, Whiskers, 58. See also Box plots
315t Within-group completely standardized solution, 204
common factor models and, 309 Within-group standardized solution, 204
composite models and, 285, 288 Within-imputation variance, 65
estimation problems in CFA and, 248–249 Wright’s tracing rules, 144–145
mediation analysis and, 353
overview, 82–83 Z-bias, 88
piecewise SEM and, 122–123 Zero fallacy, 22
Two-step estimation, 310 Zero matrix, 157
Two-step identification rule, 264–265 Zero vector, 157
Two-step modeling, 265–268, 266f, 269–273, 270f, 271t, 272t, Zero-error variance model, 293, 294
274t, 281
Type I errors, 212, 315

About the Author
Rex B. Kline, PhD, is Professor of Psychology at Concordia University in

Montréal, Québec, Canada. Since earning a doctorate in clinical psychology,
he has conducted research on the psychometric evaluation of cognitive
abilities, behavioral and scholastic assessment of children, structural equation
modeling, training of researchers, statistics reform in the behavioral sciences,
and usability engineering in computer science. Dr. Kline has published a
number of chapters, journal articles, and books in these areas.
494
BioKline5E.indd 494 3/22/2023 4:38:32 PM

Kline 2023

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kline 2023

Uploaded by

Copyright:

Available Formats

FIFTH EDITION

PRINCIPLES and PRACTICE of

FMKline5E.indd 1 3/22/2023 5:01:45 PM

MEASUREMENT THEORY AND APPLICATIONS FOR THE SOCIAL SCIENCES

CONDUCTING PERSONAL NETWORK RESEARCH: A PRACTICAL GUIDE

QUASI-EXPERIMENTATION: A GUIDE TO DESIGN AND ANALYSIS

THEORY CONSTRUCTION AND MODEL-BUILDING SKILLS:

LONGITUDINAL STRUCTURAL EQUATION MODELING WITH Mplus:

COMPOSITE-BASED STRUCTURAL EQUATION MODELING:

BAYESIAN STRUCTURAL EQUATION MODELING

INTRODUCTION TO MEDIATION, MODERATION, AND CONDITIONAL

THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY, SECOND EDITION

APPLIED MISSING DATA ANALYSIS, SECOND EDITION

PRINCIPLES AND PRACTICE OF STRUCTURAL EQUATION MODELING, FIFTH EDITION

FMKline5E.indd 2 3/22/2023 5:01:45 PM

Series Editor’s Note by Todd D. Little

THE GUILFORD PRESS

FMKline5E.indd 3 3/22/2023 5:01:45 PM

All rights reserved

No part of this book may be reproduced, translated, stored in a retrieval system, or

Printed in the United States of America

This book is printed on acid-free paper.

Last digit is print number: 9 8 7 6 5 4 3 2 1

Library of Congress Cataloging-in-Publication Data is available from the publisher.

ISBN 978-1-4625-5191-0 (paperback) — ISBN 978-1-4625-5200-9 (hardcover)

FMKline5E.indd 4 3/22/2023 5:01:45 PM

FMKline5E.indd 5 3/22/2023 5:01:45 PM

FMKline5E.indd 6 3/22/2023 5:01:45 PM

FMKline5E.indd 7 3/22/2023 5:01:45 PM

FMKline5E.indd 8 3/22/2023 5:01:45 PM

FMKline5E.indd 9 3/22/2023 5:01:45 PM

FMKline5E.indd 10 3/22/2023 5:01:45 PM

PART I. CONCEPTS, STANDARDS, AND TOOLS

FMKline5E.indd 11 3/22/2023 5:01:45 PM

2 • Background Concepts and Self‑Test 19

3 • Steps and Reporting 32

FMKline5E.indd 12 3/22/2023 5:01:45 PM

PART II. SPECIFICATION, ESTIMATION, AND TESTING

7 • Parametric Causal Models 100

8 • Local Estimation and Piecewise SEM 117

9 • Global Estimation and Mean Structures 131

FMKline5E.indd 13 3/22/2023 5:01:45 PM

Detailed Example / 142

10 • Model Testing and Indexing 156

11 • Comparing Models 182

12 • Comparing Groups 203

FMKline5E.indd 14 3/22/2023 5:01:46 PM

PART III. MULTIPLE‑INDICATOR APPROXIMATION OF CONCEPTS

13 • Multiple-Indicator Measurement 217

14 • Confirmatory Factor Analysis 229

15 • Structural Regression Models 263

FMKline5E.indd 15 3/22/2023 5:01:46 PM

16 • Composite Models 284

PART IV. ADVANCED TECHNIQUES

18 • Categorical Confirmatory Factor Analysis 319

19 • Nonrecursive Models with Causal Loops 331

FMKline5E.indd 16 3/22/2023 5:01:46 PM

Identification Requirements / 333

20 • Enhanced Mediation Analysis 349

21 • Latent Growth Curve Models 372

22 • Measurement Invariance 393

FMKline5E.indd 17 3/22/2023 5:01:46 PM

1. Assumptions of significance testing—random sampling from population distributions with known