You are on page 1of 26

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/326077335

Understanding the Model Size Effect on SEM Fit Indices

Article  in  Educational and Psychological Measurement · June 2018


DOI: 10.1177/0013164418783530

CITATIONS READS

11 819

3 authors, including:

Dexin Shi Alberto Maydeu-Olivares


University of South Carolina University of South Carolina
28 PUBLICATIONS   139 CITATIONS    133 PUBLICATIONS   4,365 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Psychological Assessment View project

Assessing fit in factor analysis and structural equation models View project

All content following this page was uploaded by Dexin Shi on 03 July 2018.

The user has requested enhancement of the downloaded file.


Article
Educational and Psychological
Measurement
Understanding the Model 1–25
Ó The Author(s) 2018
Size Effect on SEM Fit Reprints and permissions:
sagepub.com/journalsPermissions.nav
Indices DOI: 10.1177/0013164418783530
journals.sagepub.com/home/epm

Dexin Shi1, Taehun Lee2 and Alberto Maydeu-Olivares1

Abstract
This study investigated the effect the number of observed variables (p) has on three
structural equation modeling indices: the comparative fit index (CFI), the Tucker–
Lewis index (TLI), and the root mean square error of approximation (RMSEA). The
behaviors of the population fit indices and their sample estimates were compared
under various conditions created by manipulating the number of observed variables,
the types of model misspecification, the sample size, and the magnitude of factor load-
ings. The results showed that the effect of p on the population CFI and TLI depended
on the type of specification error, whereas a higher p was associated with lower val-
ues of the population RMSEA regardless of the type of model misspecification. In
finite samples, all three fit indices tended to yield estimates that suggested a worse fit
than their population counterparts, which was more pronounced with a smaller sam-
ple size, higher p, and lower factor loading.

Keywords
model size effect, structural equation modeling (SEM), fit indices

In applications of structural equation modeling (SEM), a critical step is to evaluate


the goodness of fit of the proposed model with the data. When maximum likelihood
is used to estimate a model, the likelihood ratio (LR) test statistic is regarded as the
most commonly used test for assessing the overall goodness of fit (Jöreskog, 1969;
Maydeu-Olivares, 2017). Assuming that the proposed model is correctly specified,

1
University of South Carolina, Columbia, SC, USA
2
Chung-Ang University, Seoul, South Korea

Corresponding Author:
Taehun Lee, Department of Psychology, Chung-Ang University, 84 Heukseok-Ro, Dongjak-Gu, Seoul
60974, South Korea.
Email: lee.taehun@gmail.com
2 Educational and Psychological Measurement 00(0)

the test statistic asymptotically follows a central chi-square distribution. Therefore,


the chi-square test allows researchers to evaluate the fitness of a model by using the
null hypothesis significance test approach. For the chi-square test to be valid, one
important assumption is that the sample size (N) should be sufficiently large. It has
been generally believed that fitting a large SEM model (with many observed vari-
ables)1 to moderate or small samples results in an upwardly biased estimate for the
chi-square statistic and, thus, an inflated Type I error rate. This upward bias in the
LR-based chi-square statistic is known as the model size effect (Herzog, Boomsma,
& Reinecke, 2007; Moshagen, 2012; Shi, Lee, & Terry, 2015, 2017; Yuan, Tian, &
Yanagihara, 2015), and it has important ramifications for empirical practices.
On one hand, large models are frequently encountered in psychological research.
For example, in longitudinal studies with latent variables, a boost in the number of
observed variables is expected as the number of measurement occasions increases.
Jackson, Gillaspy, and Purc-Stephenson (2009) reviewed roughly 194 published
studies and found that the median number of observed items included in the models
was 17, and 25% of models included more than 24 items. In addition, models with
more observed variables can be more desirable from many perspectives. For exam-
ple, based on classic psychometric theory, the use of many variables or items (p) is
suggested for achieving higher reliability (Lord & Novick, 1968; McDonald, 1999).
Research has also shown that using more items in the measurement model is associ-
ated with more proper solutions and more accurate parameter estimates (Marsh, Hau,
Balla, & Grayson, 1998). On the other hand, fitting large multiple-item models in
SEM also leads to potential difficulties with using the LR-based chi-square tests
owing to the problem of a small sample. Paradoxically, the model size effect sug-
gests that well-fitting models and many well-established and desirable goals (e.g.,
reliable scores) are actually competing against one another. The model size effect on
LR-based chi-square statistics has been investigated in many studies (Herzog et al.,
2007; Moshagen, 2012; Shi, Lee, et al., 2015, 2017; Yuan et al., 2015).
In practice, the chi-square test is ‘‘not always the final word in assessing fit’’
(West, Taylor, & Wu, 2012, p. 211). A major concern is that the LR chi-square test
is the test of exact fit, meaning that the null hypothesis is tested such that there is no
discrepancy between the hypothesized model and the true data-generating process. In
practice, the model under consideration is almost always incorrect to some degree
(Box, 1979; MacCallum, 2003). As a result, the chi-square test of exact fit often
rejects the null hypothesis, especially in large samples, even when the postulated
model is only trivially false. As such, a host of goodness-of-fit measures have been
developed in an attempt to provide additional information about the usefulness of the
hypothesized model when the solution is quite feasible and explains the observed
data quite well. Many fit indices are developed based on the chi-square test or com-
puted using the LR chi-square in their formulation. In this article, we consider three
commonly used fit indices: the comparative fit index (CFI), the Tucker–Lewis index
(TLI), and the root mean square error of approximation (RMSEA). The formulas for
these fit indices are described below.
Shi et al. 3

The CFI (Bentler, 1990) measures the relative improvement in fit going from the
baseline model to the postulated model. Due to Bentler (1990, p. 240), the population
CFI can be expressed as follows:
Fk
CFI = 1 ;
F0

where FK and F0 represent the minimum of some discrepancy function for the postu-
lated model and the baseline model, respectively. The sample CFI is estimated as
follows:
2
d = max (x 0 df0 , 0) max (x2k dfk , 0)
CFI ;
max (x 20 df0 , 0)

where x 20 and df0 denote the chi-square statistic and degrees of freedom for the base-
line model, and x 2k and dfk represent the chi-square statistic and degrees of freedom
for the postulated model, respectively. CFI is a normed fit index in the sense that it
ranges between 0 and 1, with higher values indicating a better fit. The most com-
monly used criterion for a good fit is CFI  .95 (Hu & Bentler, 1999; West et al.,
2012).
The TLI (Tucker & Lewis, 1973) measures a relative reduction in misfit per
degree of freedom. This index was originally proposed by Tucker and Lewis (1973)
in the context of exploratory factor analysis and later generalized to the covariance
structure analysis context and labeled as the nonnormed fit index by Bentler and
Bonett (1980). This index is nonnormed in that its value can occasionally be negative
or exceed 1. Following the expression of Bentler (1990, p. 241), the population TLI
can be expressed as follows:
Fk =dfK
TLI = 1 ;
F0 =df0

where F0 =df0 and Fk =dfk represent the misfit per degree of freedom for the baseline
model and the postulated model, respectively.
The sample estimator of TLI can be given as follows:
2 2
d = x0 =df0 x k =dfk :
TLI 2
x 0 =df0 1

In general, TLI  .95 is a commonly used cutoff criterion for the goodness of fit
(Hu & Bentler, 1999; West et al., 2012).
The RMSEA (Steiger, 1989, 1990; Steiger & Lind, 1980) measures the discre-
pancy due to the approximation per degree of freedom as follows:
sffiffiffiffiffiffi
Fk
RMSEA = ;
dfk
4 Educational and Psychological Measurement 00(0)

where FK denotes the minimum of some discrepancy function between the popula-
tion covariance matrix S and the model implied covariance matrix S0 for the
hypothesized model. The sample estimate of RMSEA is defined as follows (Browne
& Cudeck, 1993):
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
d = max (x
RMSEA
df , 0)
:
df (N 1)

The RMSEA is a badness-of-fit measure, yielding lower values for a better fit. An
RMSEA  .06 could be considered acceptable (Hu & Bentler, 1999), whereas a
model with an RMSEA  .10 is unworthy of serious consideration (Browne &
Cudeck, 1993).
The above three fit indices have been routinely reported by SEM software (e.g.,
Mplus) and have been used as standard tools for evaluating model fit (Hancock &
Mueller, 2010; McDonald & Ho, 2002). As discussed in the model size literature
(e.g., Herzog et al., 2007; Moshagen, 2012), these fit indices are also likely to be
influenced by the model size because they are functions of the LR chi-square statistic
that tends to be upwardly biased in large models. Typically, applied researchers rely
heavily on practical fit indices rather than on the formal chi-square test when evalu-
ating specified SEM models. Therefore, it is integral to understand whether or not
selected fit indices tend to increase or decrease as the model size increases for the
appropriate application of practical fit indices.
Previous studies have shed some light on the effect of model size on practical fit
indices. Under correctly specified models, researchers have focused on the behaviors
of the fit indices in the sample.2 The results showed that increasing the number of
indicators (p) led to a decline in the average sample estimates of CFI and TLI, indi-
cating that the model fit worsens (Anderson & Gerbing, 1984; Ding, Velicer, &
Harlow, 1995; Kenny & McCoach, 2003). However, it was found that the average
RMSEA tended to decrease (i.e., indicating an improved fit) as more indicators were
added to the correctly specified models (Kenny & McCoach, 2003).
Under the conditions of misspecified models, most previous studies examined the
effect of the number of indicators on the population values of the selected practical
fit indices. Kenny and McCoach (2003) investigated the effect of the number of
observed variables (p) on the population values of CFI, TLI, and RMSEA. They
found that as p increased the population RMSEA tended to decrease, indicating that
the model fit improved regardless of the type of model misspecification; for CFI and
TLI, their population values tended to decrease (i.e., indicating a worse fit) when the
model misspecifications were introduced by fitting a single-factor model to two-
factor data or by omitting cross-loadings. However, when the models were misspeci-
fied by ignoring the nonzero residual correlations, it was found that the population
values of CFI and TLI tended to increase (i.e., indications of a better fit) as p
increased. Breivik and Olsson (2001) also found similar patterns of behaviors for the
population values of CFI and RMSEA. More recently, Savalei (2012) also found that
Shi et al. 5

in general, the population RMSEA tended to eventually decrease as the number of


indicators (p) increased regardless of the type of model misspecification.
Although suggestive, the findings of these studies are somewhat limited for the
following reasons. First, the maximum number of observed variables (p) manipulated
in the previous studies were 40 at most. For example, the maximum number of
observed variables (p) considered by Ding et al. (1995) was 18. For Kenny and
McCoach (2003), the number of observed variables (p) ranged from 4 to 25 for cor-
rectly specified models and from 4 to 40 for misspecified models. In psychological
studies, many questionnaires can include an extremely high number of items (i.e., p).
For example, the commonly used Revised NEO Personality Inventory (NEO PI-R)
has 240 items (Costa & McCrae, 1992). In education-related studies, many compre-
hensive tests also include more than 100 items (e.g., ETS Major Field Tests; Ling,
2012). The CFA models have also been used in studying gene expression microarray
data, where hundreds of genes are considered as observed indicators (Xie & Bentler,
2003). In the above applications, researchers fitted factor analysis models with
extremely large number of observed variables, and the practical fit indices (e.g., CFI,
TLI, and RMSEA) are interpreted and used to evaluate the models. Therefore, the
findings based on simulation studies with models of small to moderate size may not
be generalizable to psychological research using models of a much larger size.
Second, under model misspecification conditions, most previous studies focused
on the behavior of the population fit indices. In practice, researchers would only be
able to obtain and interpret the sample estimates of the fit indices, not the population
values. As shown earlier, the estimates of CFI, TLI, and RMSEA are functions of the
chi-square statistic, whose bias is affected by both the sample size and the model size
(Moshagen, 2012; Shi, Lee, et al., 2017). Therefore, it would be unwise to simply
assume that the findings about population values of fit indices are directly applicable
to their sample estimates. Unfortunately, the effect of model size on the behaviors of
the sample fit indices under realistic settings involving model misspecifications has
not yet been systematically examined.
To fill these gaps in the literature, this article aims to understand the effects the
number of observed variables (p) has on SEM fit indices by using a more compre-
hensive simulation study. In our simulation design, we manipulated the value of p to
reach 120 observed variables so that the findings can be generalized to more com-
plex modeling situations. Moreover, in misspecified models, the behaviors of both
the sample fit indices and their population values were studied and the results were
compared. We close our discussion by offering some practical guidance of using the
fit indices in large SEM models.

Monte Carlo Simulation


We performed a simulation study to investigate the effect of model size on CFI, TLI,
and RMSEA in both correctly specified and misspecified models. We examined the
following three types of model (mis)specifications:
6 Educational and Psychological Measurement 00(0)

1. Correctly specified models: The true data-generating model was a single-


factor model, and the same model was fitted to the simulated data.
2. Misspecified dimensionality: The data-generating model was a two-factor
CFA model with an interfactor correction of .90. A single-factor model was
fitted to the simulated data.
3. Omitted residual correlations: The data-generating model was a single-factor
model with three correlated residuals. The true value for the residual correla-
tions was .15. The fitted model was a single-factor model with correlations
among residual terms fixed to zero.

It should be noted that the degree of model misspecifications examined in the cur-
rent study is substantively ignorable (see Shi, Maydeu-Olivares, & DiStefano, 2018),
which is intended to simulate situations where the specified model is trivially false
or the misspecified component would be uninteresting to most researchers. That is,
the two factors with .90 correlation cannot or need not be meaningfully discriminated
in practice and the precise estimation of residual correlations of size equal to .15
could be considered meaningless or uninteresting to most researchers. It would be
fair to say that either correctly specified or only slightly misspecified models should
be retained and interpreted, because inferences drawn from poorly fitting models can
be misleading (Saris, Satorra, & van der Veld, 2009). Therefore, in the present arti-
cle, the effect of model size on the behavior of the selected fit indices were studied
under the conditions involving slight model misspecifications.
For model identification, the variances of the factors were set to 1.0. Other vari-
ables manipulated in the simulation are described below.

 Model size: Model size was indicated by the total number of observed vari-
ables (p). The number of observed variables included 10, 30, 60, 90, and 120.
When the population model had a two-factor structure, the same number of
observed variables was loaded on each factor.
 Sample size: Sample sizes included 200, 500, and 1,000.
 Levels of factor loadings: We included items with low (.40) or high (.80) fac-
tor loadings (l), which represent either weak or strong factor(s). The variances
of the error terms were set as 1 l2 .

In summary, the number of conditions examined was 90 = 3 (types of model speci-


fication) 3 5 (model size levels) 3 3 (sample size levels) 3 2 (factor loading levels).
For each simulated condition, 1,000 replications were generated with the simsem
package in R (Pornprasertmanit, Miller, & Schoemann, 2012; R Development Core
Team, 2015). The observed data were generated from a multivariate normal
distribution.
For each condition, we first fit the one-factor CFA models to the population covar-
iance matrix and compute the population values for CFI, TLI, and RMSEA. A series
of single-factor models with a varying number of indicators (p) were then fitted to the
Shi et al. 7

simulated data sets, from which the empirical distributions of CFI, TLI, and RMSEA
were obtained across 1,000 replications. All data analyses were conducted with the
maximum likelihood estimation using the lavaan package in R (R Development Core
Team, 2015; Rosseel, 2012). All replications converged for all conditions.
The behaviors of CFI, TLI, and RMSEA across different simulation conditions
are summarized in Tables 1 to 3. Specifically, for each fit index, we reported their
population values and the average sample estimates and computed the relative differ-
ences (biases) between the two quantities. That is, relative bias (RB) was computed
as follows:

uest upop
RB = ;
upop

where uest represents the average of sample estimates of the fit indices of more than
1,000 replications and upop indicates the population value of fit indices. Following
recommendations from previous studies, RBs less than 10% (in absolute value) were
considered acceptable (Muthén, Kaplan, & Hollis, 1987; Muthén & Muthén, 2002;
Shi, Song, & Lewis, 2017). It should be noted that for correctly specified models the
population RMSEA is zero, and therefore, RB is undefined. Under such conditions,
the absolute bias (AB) was instead computed as follows:

AB = 
uest upop :

We also recognized that the values of RB can be deceptively high when the popula-
tion RMSEA values are near zero (e.g., .004). We therefore reported both RB and AB
for RMSEA under misspecified conditions.
Analyses of variance (ANOVAs) were conducted with the RB as the dependent
variable and the simulation conditions (and their interactions) as independent vari-
ables. An eta-square (h2) value above 5% was used to identify conditions that con-
tributed to sizable amounts of variability in the outcome. For visual presentations of
the patterns (see Figures 1-3), we also plotted the average sample estimates of the fit
indices against model size levels for different levels of sample size (N = 200, 500,
1,000, population), factor loading (l = .40, .80), and misspecification (no misspecifi-
cation, omitted correlated residuals, and misspecified dimensionality). A horizontal
line has been drawn in these figures to mark the cutoff values for CFI (.95), TLI
(.95), and RMSEA (.06) suggested by Hu and Bentler (1999).

Comparative Fit Index


Table 1 and Figure 1 demonstrate the behaviors of population values and sample
estimates of CFI as a function of model size (p), factor loading (l), and sample size
(N) under the three conditions of model specification. As seen in the figure, for the
correctly specified models, the population CFI is a constant of 1.0, independent of p
and l (also see the first column in Table 1). Under the condition of misspecified
8
Table 1. Effect of p on CFI.

Correctly specified Misspecified dimensionality Omitted residual correlations


Factor loadings N p POP EST RB POP EST RB POP EST RB

.4 200 10 1.000 0.972 20.03 0.994 0.965 20.03 0.901 0.894 20.01
30 1.000 0.956 20.04 0.989 0.944 20.05 0.973 0.935 20.04
60 1.000 0.872 20.13 0.985 0.853 20.13 0.988 0.863 20.13
90 1.000 0.751 20.25 0.981 0.728 20.26 0.993 0.747 20.25
120 1.000 0.611 20.39 0.978 0.588 20.40 0.995 0.609 20.39
500 10 1.000 0.990 20.01 0.994 0.984 20.01 0.901 0.899 0.00
30 1.000 0.989 20.01 0.989 0.980 20.01 0.973 0.966 20.01
60 1.000 0.979 20.02 0.985 0.963 20.02 0.988 0.968 20.02
90 1.000 0.958 20.04 0.981 0.938 20.04 0.993 0.952 20.04
120 1.000 0.928 20.07 0.978 0.905 20.07 0.995 0.924 20.07
1,000 10 1.000 0.995 20.01 0.994 0.990 0.00 0.901 0.900 0.00
30 1.000 0.995 20.01 0.989 0.987 0.00 0.973 0.972 0.00
60 1.000 0.994 20.01 0.985 0.980 20.01 0.988 0.983 20.01
90 1.000 0.990 20.01 0.981 0.970 20.01 0.993 0.983 20.01
120 1.000 0.983 20.02 0.978 0.960 20.02 0.995 0.977 20.02
(continued)
Table 1. (continued)

Correctly specified Misspecified dimensionality Omitted residual correlations


Factor loadings N p POP EST RB POP EST RB POP EST RB

.8 200 10 1.000 0.997 0.00 0.968 0.967 0.00 0.938 0.936 0.00
30 1.000 0.994 20.01 0.951 0.945 20.01 0.980 0.975 20.01
60 1.000 0.980 20.02 0.941 0.921 20.02 0.990 0.970 20.02
90 1.000 0.954 20.05 0.936 0.890 20.05 0.994 0.948 20.05
120 1.000 0.912 20.09 0.932 0.849 20.09 0.995 0.908 20.09
500 10 1.000 0.999 0.00 0.968 0.968 0.00 0.938 0.937 0.00
30 1.000 0.999 0.00 0.951 0.949 0.00 0.980 0.979 0.00
60 1.000 0.997 0.00 0.941 0.938 0.00 0.990 0.987 0.00
90 1.000 0.994 20.01 0.936 0.929 20.01 0.994 0.987 20.01
120 1.000 0.988 20.01 0.932 0.921 20.01 0.995 0.984 20.01
1,000 10 1.000 0.999 0.00 0.968 0.968 0.00 0.938 0.937 0.00
30 1.000 0.999 0.00 0.951 0.950 0.00 0.980 0.980 0.00
60 1.000 0.999 0.00 0.941 0.940 0.00 0.990 0.990 0.00
90 1.000 0.998 0.00 0.936 0.934 0.00 0.994 0.992 0.00
120 1.000 0.997 0.00 0.932 0.930 0.00 0.995 0.993 0.00

Note. CFI = comparative fit index; factor loadings = (standardized) factor loadings; N = sample size; p = number of observed variables; RB = relative bias. POP
indicates the population CFI; EST indicates the average sample estimates of CFI. For POP and EST, values less than 0.95 are underlined. |RB| larger than 0.10 (10%)
are in boldface.

9
10 Educational and Psychological Measurement 00(0)

Figure 1. Effect of p on comparative fit index (CFI).

dimensionality where a single-factor model was fitted to the data sets generated by
the two-factor models with r = .90, as p increased from 10 to 120, the population
CFI tended to decrease from .994 to .978 when l = .40, and from .968 to .932 when
l = .80 (the second column in Table 1). Under the condition of omitted residual cov-
ariances, the population CFI tended to increase with p from .901 to .995 when l =
.40, and from .938 to .995 when l = .80 (the third column in Table 1). As a remin-
der, it should be noted that we intended to simulate modeling situations where the
degree of specification error is relatively small. This was evidenced by the popula-
tion values of CFI exceeding the Hu–Bentler cutoff (CFI  .95) under most simula-
tion conditions, with the lowest value being .901.
The results from the ANOVA showed that the important sources of variance in
the (relative) biases in estimating the population CFI were, by order of h2, the sample
size (N, h2 = .22), the interaction between number of observed variables and sample
size (N3p, h2 = .16), the model size (p, h2 = .14), followed by the magnitude of
Shi et al. 11

factor loadings (l) and the interactions between the magnitude of factor loadings, the
sample size (l3N, h2 = .09) and model size (l3p, h2 = .06).
In general, as shown in Table 1, the population CFI can be accurately approxi-
mated with a small RB (|RB| \ 10%) when the sample size is large (i.e., N  500)
across all conditions. On the other hand, when sample size was small (e.g., N = 200),
a noticeable RB (i.e. |RB|  10%, see values in bold in Table 1) can occur when the
model size is large (p  60) and the quality of measurement is mediocre (l = .40).
In addition, the effect of p became more conspicuous when the magnitude of factor
loadings was small (l = .40). For example, when fitting correctly specified models
with N = 200 and l = .40, the RBs (in absolute values) increased from 3% to 39% as
p increased from 10 to 120. However, if magnitude of factor loadings was large (l =
.80), the increments of RB (in absolute values) was smaller, ranging from 0% (p =
10) to 9% (p = 120).
Because of the tendency of population CFI values being underestimated across all
conditions, models with no specification error or with minor specification errors can
be rejected if evaluated solely based on their sample CFI. The results presented in
Table 1 suggest that the rejecting of models with the population CFI values indicat-
ing a well-fitting model is indeed likely to occur when the model size is very large
(e.g., p  90; see the underlined values in Table 1). For example, under correctly
specified models, when the factor loadings were .40 and N = 200, as p increased
from 10 to 120, the average estimates of CFI changed from .972 (closely fitting
model) to .611 (poorly fitting model). A similar pattern was observed when collap-
sing two factors with r = .90; when p increased from 10 to 120, the average CFI
dropped from .965 to .588, leading to different conclusions in terms of the model fit.
When the misspecification was caused by omitting residual correlations, the average
CFI initially increased as more observed variables were added to the fitted model;
but as p continued increasing, the average CFI would eventually decrease. Taking N
= 200 and factor loadings = .80 as an example, when p increased from 10 to 30, the
mean CFI increased from .936 to .975; the average CFI then dropped to .948 (p =
90) and finally reached the bottom value at .908 (p = 120).

Tucker–Lewis Index
As shown in Table 2 and Figure 2, the behaviors of both population values and their
sample estimates of TLI are virtually indistinguishable from the patterns of CFI in
large models. For correctly specified models, the population TLI is a constant of
1.00, regardless of the model size (the first column in Table 2). Under the condition
of misspecified dimensionality, the population TLI tended to decrease as p increased
(the second column in Table 2), whereas the population TLI tended to increase when
three residual covariances were omitted (the third column in Table 2). Taking condi-
tions with l = .40 as an example, when one-factor models were fitted to two-factor
data with r = .90, as p increased from 10 to 120, the population TLI slightly
decreased from .995 to .978. When omitting residual correlations, a higher p was
12
Table 2. Effect of p on TLI.

Correctly specified Misspecified dimensionality Omitted residual correlations


Factor loadings N p POP EST RB POP EST RB POP EST RB

.4 200 10 1.000 0.994 20.01 0.995 0.985 20.01 0.923 0.865 20.06
30 1.000 0.958 20.04 0.990 0.944 20.05 0.975 0.931 20.04
60 1.000 0.867 20.13 0.985 0.848 20.14 0.989 0.858 20.13
90 1.000 0.745 20.26 0.981 0.722 20.26 0.993 0.741 20.25
120 1.000 0.605 20.40 0.978 0.581 20.40 0.995 0.603 20.39
500 10 1.000 0.999 0.00 0.995 0.990 20.01 0.923 0.871 20.05
30 1.000 0.992 20.01 0.990 0.980 20.01 0.975 0.964 20.01
60 1.000 0.978 20.02 0.985 0.962 20.02 0.989 0.967 20.02
90 1.000 0.957 20.04 0.981 0.936 20.05 0.993 0.950 20.04
120 1.000 0.927 20.07 0.978 0.903 20.08 0.995 0.922 20.07
1,000 10 1.000 0.999 0.00 0.995 0.991 0.00 0.923 0.872 20.05
30 1.000 0.998 0.00 0.990 0.986 0.00 0.975 0.969 20.01
60 1.000 0.995 20.01 0.985 0.979 20.01 0.989 0.983 20.01
90 1.000 0.990 20.01 0.981 0.970 20.01 0.993 0.982 20.01
120 1.000 0.982 20.02 0.978 0.959 20.02 0.995 0.977 20.02
(continued)
Table 2. (continued)

Correctly specified Misspecified dimensionality Omitted residual correlations


Factor loadings N p POP EST RB POP EST RB POP EST RB

.8 200 10 1.000 0.999 0.00 0.975 0.958 20.02 0.952 0.918 20.03
30 1.000 0.994 20.01 0.954 0.941 20.01 0.981 0.973 20.01
60 1.000 0.979 20.02 0.943 0.918 20.03 0.991 0.969 20.02
90 1.000 0.952 20.05 0.937 0.888 20.05 0.994 0.946 20.05
120 1.000 0.911 20.09 0.934 0.846 20.09 0.995 0.907 20.09
500 10 1.000 1.000 0.00 0.975 0.959 20.02 0.952 0.919 20.03
30 1.000 0.999 0.00 0.954 0.945 20.01 0.981 0.977 0.00
60 1.000 0.997 0.00 0.943 0.935 20.01 0.991 0.987 0.00
90 1.000 0.993 20.01 0.937 0.927 20.01 0.994 0.987 20.01
120 1.000 0.988 20.01 0.934 0.920 20.01 0.995 0.983 20.01
1,000 10 1.000 1.000 0.00 0.975 0.959 20.02 0.952 0.920 20.03
30 1.000 1.000 0.00 0.954 0.946 20.01 0.981 0.978 0.00
60 1.000 0.999 0.00 0.943 0.938 20.01 0.991 0.989 0.00
90 1.000 0.998 0.00 0.937 0.932 20.01 0.994 0.992 0.00
120 1.000 0.997 0.00 0.934 0.929 20.01 0.995 0.992 0.00

Note. TLI = the Tucker–Lewis index; factor loadings = (standardized) factor loadings; N = sample size; p = number of observed variables; RB = relative bias. POP
indicates the population TLI; EST indicates the average sample estimates of TLI. For POP and EST, values less than 0.95 are underlined. |RB| larger than 0.10 (10%)
are in boldface.

13
14 Educational and Psychological Measurement 00(0)

Figure 2. Effect of p on Tucker–Lewis index (TLI).

associated with a larger population TLI, ranging from .923 (p = 10) to .995 (p =
120).
In finite samples, the population values of TLI tended to be underestimated, which
was more pronounced with a smaller sample size, larger model size, and lower factor
loading. As with CFI, ANOVA results showed that the important sources of the RB
variance in estimating population TLI included the sample size (N, h2 = .20), model
size (p, h2 = .12), magnitude of the factor loadings (l, h2 = .08), and the two-way
interactions among the above factors (i.e., N3p, h2 = .16; N3l, h2 = .09; p3l, h2 =
.06). It appeared that a sample of size N  500 may be required to obtain a reason-
able estimate of the population TLI (with |RB| \ 10%), regardless of the level of fac-
tor loading and model size considered in the current study (i.e., p  120). It was also
noted that even when the sample size was relatively large (e.g., N = 500), by applying
the conventional cutoff, very large models (p  90) with no specification error or
minor specification errors could be rejected based on the sample TLI. For example, a
correctly specified model with p = 120, l = .40 and N = 500 would be rejected, if the
Table 3. Effect of p on RMSEA.

Correctly specified Misspecified dimensionality Omitted residual correlations


Factor loadings N p POP EST AB POP EST AB RB POP EST AB RB

.4 200 10 0.000 0.016 0.016 0.010 0.017 0.007 0.70 0.049 0.047 20.002 20.04
30 0.000 0.017 0.017 0.009 0.019 0.010 1.11 0.015 0.022 0.007 0.47
60 0.000 0.026 0.026 0.008 0.027 0.019 2.38 0.007 0.027 0.020 2.86
90 0.000 0.033 0.033 0.008 0.034 0.026 3.25 0.005 0.033 0.028 5.60
120 0.000 0.040 0.040 0.007 0.041 0.034 4.86 0.004 0.040 0.036 9.00
500 10 0.000 0.009 0.009 0.010 0.012 0.002 0.20 0.049 0.048 20.001 20.02
30 0.000 0.007 0.007 0.009 0.011 0.002 0.22 0.015 0.016 0.001 0.07
60 0.000 0.009 0.009 0.008 0.012 0.004 0.50 0.007 0.012 0.005 0.71
90 0.000 0.012 0.012 0.008 0.014 0.006 0.75 0.005 0.013 0.008 1.60
120 0.000 0.014 0.014 0.007 0.016 0.009 1.29 0.004 0.014 0.010 2.50
1,000 10 0.000 0.006 0.006 0.010 0.010 0.000 0.00 0.049 0.048 20.001 20.02
30 0.000 0.004 0.004 0.009 0.009 0.000 0.00 0.015 0.015 0.000 0.00
60 0.000 0.004 0.004 0.008 0.009 0.001 0.13 0.007 0.009 0.002 0.29
90 0.000 0.005 0.005 0.008 0.010 0.002 0.25 0.005 0.007 0.002 0.40
120 0.000 0.007 0.007 0.007 0.010 0.003 0.43 0.004 0.008 0.004 1.00
(continued)

15
16
Table 3. (continued)

Correctly specified Misspecified dimensionality Omitted residual correlations


Factor loadings N p POP EST AB POP EST AB RB POP EST AB RB

.8 200 10 0.000 0.017 0.017 0.078 0.077 20.001 20.01 0.120 0.120 0.000 0.00
30 0.000 0.017 0.017 0.056 0.059 0.003 0.05 0.037 0.041 0.004 0.11
60 0.000 0.026 0.026 0.044 0.051 0.007 0.16 0.018 0.032 0.014 0.78
90 0.000 0.033 0.033 0.037 0.050 0.013 0.35 0.012 0.035 0.023 1.92
120 0.000 0.040 0.040 0.033 0.052 0.019 0.58 0.009 0.041 0.032 3.56
500 10 0.000 0.010 0.010 0.078 0.078 0.000 0.00 0.120 0.120 0.000 0.00
30 0.000 0.007 0.007 0.056 0.056 0.000 0.00 0.037 0.038 0.001 0.03
60 0.000 0.009 0.009 0.044 0.045 0.001 0.02 0.018 0.021 0.003 0.17
90 0.000 0.012 0.012 0.037 0.039 0.002 0.05 0.012 0.017 0.005 0.42
120 0.000 0.014 0.014 0.033 0.036 0.003 0.09 0.009 0.017 0.008 0.89
1,000 10 0.000 0.007 0.007 0.078 0.078 0.000 0.00 0.120 0.120 0.000 0.00
30 0.000 0.004 0.004 0.056 0.056 0.000 0.00 0.037 0.037 0.000 0.00
60 0.000 0.004 0.004 0.044 0.044 0.000 0.00 0.018 0.019 0.001 0.06
90 0.000 0.005 0.005 0.037 0.038 0.001 0.03 0.012 0.013 0.001 0.08
120 0.000 0.007 0.007 0.033 0.034 0.001 0.03 0.009 0.011 0.002 0.22

Note. RMSEA = root mean square error of approximation; factor loadings = (standardized) factor loadings; N = sample size; P = number of observed variables; RB
= relative bias; AB = absolute bias. POP indicates the population RMSEA; EST indicates the average sample estimates of RMSEA. For POP and EST, values greater
than 0.06 are underlined. |RB| larger than 0.10 (10%) are in boldface.
Shi et al. 17

Figure 3. Effect of p on root mean square error of approximation (RMSEA).

fixed cutoff score were applied in the strictest sense, by yielding an average sample
TLI of .927. When fitting a one-factor model to two-factor data with r = .90, for p =
90, l = .40 and N = 500, the average sample TLI was .936, suggesting that the model
with the population TLI of .981 may be rejected in the sample according to the .95
cutoff.
When the sample size was small (e.g., N = 200), the population TLI tended to be
substantially underestimated (i.e., RBs were negative), especially if the model size was
large (p  90) and the magnitude of factor loadings was small (l = .40). For example,
when N = 200 and the misspecification was caused by collapsing two highly correlated
factors (r = .90) into one, for l = .40, the absolute values of RBs increased from 1% (p
= 10) to 40% (p = 120). Under the same conditions except for when l = .80, as p
increased from 10 to 120, the RB (in absolute values) increased from 2% to 9%.

Root Mean Square Error of Approximation


Table 3 and Figure 3 show the effect of p on RMSEA. For correctly specified mod-
els, the population RMSEA is a constant of .00 independent of p and l (the first
18 Educational and Psychological Measurement 00(0)

column in Table 3). Under both types of specification errors (the second and third
columns), the population RMSEA decreased as p increased. This effect of p was
more pronounced at a higher l. For example, under the condition of omitted residual
correlations, the population RMSEA decreased from .049 to .004 when p increased
from 10 to 120 (l = .40). With a higher factor loading (l = .80), the population
RMSEA decreased from .120 (p = 10) to .009 (p = 120). It was noted that the popu-
lation values for RMSEA were below the conventional cutoff value (i.e., RMSEA
 .06) across all conditions except for the two cases where the misspecified models
had a low number of high-quality indicators (underlined values in Table 3). That is,
when p = 10 and l = .80, the population value of RMSEA was .078 under misspeci-
fied dimensionality, and the population RMSEA was .120 under the condition of
omitted residual correlations.
ANOVA results showed that the important sources of the RB variance included
sample size (N, h2 = .18), the model size (p, h2 = .16), the magnitude of factor load-
ings (l, h2 = .08), the interaction between sample size and the model size (N3p, h2
= .13), and the interaction between sample size and magnitude of factor loadings
(N3l, h2 = .06).
Figure 3 shows that the sample estimates of RMSEA tended to be upwardly biased
across all conditions. Figure 3 also makes it clear that as p increases, the difference
between the population RMSEA and the sample average values becomes larger. For
example, under correctly specified models (N = 200, l = .40), the difference between
the sample average RMSEA and the population RMSEA increased from .016 (p =
10) to .040 (p = 120). We also observed that the sample RMSEA could be noticeably
different from their population value (with |RB|  10%, values in bold in Table 3)
when p was high (e.g., p  60), even when the sample size was reasonably large
(e.g., N = 1,000). For example, when the model was misspecified by omitting the resi-
dual covariances (with p = 120, N = 1,000, and l = .40), the average sample RMSEA
= .008, almost twice the value of the corresponding population value (i.e., .004).
However, as discussed earlier, when the population RMSEA values were near zero,
the values of RB and the |RB|  10% criterion may not be an appropriate measure of
‘‘acceptability’’ of the average sample estimates. Therefore, in Table 3, we reported
the AB for estimating the population RMSEA. The largest AB observed across all
simulated conditions was .040 (i.e., correctly specified model, N = 200 and p = 120).

Discussion and Conclusion


This study explored the model size effect in three practical model fit indices. We
found that the model size (p) had an important impact on the population values of
CFI, TLI, and RMSEA. Specifically, under misspecified models, as p increased, the
population RMSEA decreased regardless of the type of model misspecification. The
findings regarding the effects of p on population RMSEA are consistent with the con-
clusions drawn by Kenny and McCoach (2003) and Savalei (2012). According to its
definition, the RMSEA penalizes model complexity by incorporating a degree of
Shi et al. 19

freedom in the formulation, and it measures the discrepancy due to approximation


per degree of freedom. Therefore, for models with a close fit, the population RMSEA
can decrease as p increases, because a higher p is typically associated with larger
degrees of freedom.
For both CFI and TLI, it was interesting to note that as p increased, whether the
population CFI or TLI tended to slightly increase or decrease depended on the type of
specification error. Specifically, as p increased, the population values of the CFI and
TLI decreased in the presence of misspecified dimensionality. However, when the
models were misspecified by omitting the correlated residuals, both population CFI
and TLI increased in larger models. We believe that essentially the same explanation
provided in Kenny and McCoach (2003, p. 347) can apply to our observations: The
specified single-factor model predicts that all covariances take on the same value.
Thus, the degree of misfit can depend on the degree to which the covariances in the
true population covariance matrix differ from the mean covariances in the model-
implied covariance matrix. As p increased, the variability of the covariances in the
true population covariance matrix increased under the condition of misspecified
dimensionality. However, under the condition of omitted residual covariances, the
variability declined with p because the number of omitted residual covariances was
fixed to three regardless of the model size. As such, the effect of p on the population
CFI and TLI depended on the type of specification error.
On the other hand, for RMSEA, its population value tends to decrease as p
increases, notwithstanding the types of model misspecifications. Considering the fact
that our study worked with models of a relatively small degree of specification error,
the population RMSEA appeared to have a desirable property by producing lower
values when p  30. However, it seems that researchers may need to be cautious
when interpreting a large RMSEA while working with small models including high-
quality indicators (i.e., p = 10 and l = .80), because even the population RMSEA
was effectively above the conventional cutoff value under this situation. For exam-
ple, when three minor residual correlations were omitted and the factor loadings
were .80, the population RMSEA could range from .009 (p = 120) to .120 (p = 10).
By applying the commonly used cutoff scores, the models can achieve either an
excellent fit (i.e., RMSEA  .01) or a poor fit (i.e., RMSEA  .10). In fact,
researchers have realized the sensitivity of population RMSEA to degrees of freedom
and argued that RMSEA should not even be computed for models with a low degrees
of freedom (Kenny, Kaniskan, & McCoach, 2015).
Our study also gained a better understanding of the behaviors of CFI, TLI, and
RMSEA in samples. That is, in small samples, compared with their population val-
ues, the sample RMSEA tended to be upwardly biased, and the sample CFI and TLI
were downwardly biased. Therefore, when N was low, on average, the sample esti-
mates for all three fit indices tended to suggest a worse fit (than their population val-
ues), notwithstanding the types of model (mis)specification. For all three fit indices,
the differences between the population values and the average sample estimates
increased as p increased; the differences also became more pronounced when the
20 Educational and Psychological Measurement 00(0)

standardized factor loadings were low. It was also noted that the pattern of sample
RMSEA we observed was partially different from the pattern in Kenny and McCoach
(2003), where the authors found that in correctly specified models, the sample
RMSEA could improve as the number of variables increased. We argue that one
likely reason for the reversed effect of p in Kenny and McCoach’s (2003) study is that
the range of the p manipulation in their simulation was relatively low (from 4 to 25).
Moreover, when fitting large SEM models (e.g., p  30) with small samples (e.g.,
N  200), disagreement between the sample CFI/TLI and RMSEA would likely be
observed. Specifically, the sample CFI and TLI could be largely downwardly biased,
even when the models were correctly specified. This is especially true when the qual-
ity of measurement was poor (i.e., the standardized factor loadings were low). For
example, depending on the number of observed variables, correctly specified models
(l = .40, N = 200) could produce an average CFI ranging from .611 to .972. It appears
that a sample of size N  500 may be required to gain relatively accurate estimates
(with |RB| \ 10%) for both CFI and TLI in large models. It was also noted that when
fitting very large models (p  90) of good quality of measurement (l = .80) to a sam-
ple of small to medium size (i.e., less than 1,000), the use of sample CFI/TLI may
reject the model that is known to have a close fit in the population if the fixed cutoff
scores were applied in the strictest sense. Under such conditions (e.g., p  90 and l =
.40), it appears that a sample of size of N  1,000 may be required to safely interpret
CFI and TLI.
In small samples, the average sample RMSEA tends to be upwardly biased, and
the bias increases as p increases (indicating a larger difference between the popula-
tion RMSEA and the average sample estimates). Additionally, when the number of
observed variables is high (p . 30), the sample RMSEA could be noticeably over-
estimated (with an RB  10%), even when the sample size is 1,000. As with the
sample CFI and TLI, the sample RMSEA was sensitive to model size. Nevertheless,
the average sample estimates for RMSEA were below the conventional cutoff value
(i.e., RMSEA  .06) under nearly all conditions examined in our study except for
the three conditions where p = 10 and l = .80 (see the underlined values in Table 3).
As noted earlier, the |RB|  10% criterion may not be informative from a practical
viewpoint when the population parameter is zero or near zero.
Methodologists have shown that for a given level of model misspecification, poor
measurement quality is associated with better model fit (i.e., the reliability paradox;
see Hancock & Mueller, 2011). This phenomenon has been derived mathematically
or revealed at the population level (Hancock & Mueller, 2011; Heene, Hilbert,
Draxler, Ziegler, & Bühner, 2011) and also has been demonstrated based on sample
estimates using simulation study (McNeish, An, & Hancock, 2018). The findings in
our study also showed that the reliability paradox may have operated for both popu-
lation RMSEA values and their sample estimates across all conditions of sample size
(N) and model size (p). For CFI and TLI, however, our findings showed that the
effect of measurement quality on model fit evaluations can depend on factors such as
sample size (N) and model size (p), resulting in the disappearance of the reliability
Shi et al. 21

paradox under certain conditions. Specifically, sample estimates of CFI and TLI, on
average, tended to indicate worse fit under the condition of poorer measurement
quality (i.e., l = .40) when a model of large size was fit to a sample of small to
medium size (N = 200, 500).
The findings in the current study are based on the assumption that the observed
data are multivariate normal distributed. In many applications, the assumption of nor-
mal data is likely to be violated (e.g., if ordered categorical data are analyzed), where
the chi-square test statistics with robust corrections are commonly used. Previous
studies have shown that the robust chi-square test statistics could also be influenced
by the number of observed variables in the fitted model (Shi, DiStefano, McDaniel,
& Jiang, 2018; Yuan, Yang, & Jiang, 2017). For future studies, it would be interest-
ing to explore the model size effect on practical model fit indices in the presence of
nonnormal data (DiStefano, Liu, Jiang, & Shi, 2018; DiStefano, McDaniel, Zhang,
Shi, & Jiang, 2018; Maydeu-Olivares, Shi, & Rosseel, 2018). In addition, we only
included two specific types and minor levels of model misspecification. Additional
types of misspecification (e.g., omitted cross-loading values) and levels of misspeci-
fication (e.g., severely misspecified models) should be investigated in future studies,
as little is known about the effect of model size on fit indices under such situations.
In summary, our findings support the idea that the fit indices rely not only on
the model fit or misfit but also on the context of the model, such as the number of
observed variables (p). On one hand, given the same level of model misspecifica-
tion (e.g., fitting a one-factor model to the two-factor data), the population values
of the fit indices can be heavily affected by the model size. On the other hand, in
small samples (N \ 500), as p increases, the estimates of the sample fit indices,
mainly CFI and TLI, are likely to be biased and yield a far worse fit than their
population values. In this sense, there are no ‘‘golden rules.’’ In empirical studies,
researchers should consider the number of observed variables when using the
practical fit indices to assess model fit. That said, we can offer a few cautionary
remarks to the researchers evaluating models with no specification error or with
minor specification errors.

1. Regardless of the sample size, researchers should be cautious in interpreting


RMSEA for small models (p  10), especially when the factor loadings are
large (e.g., l = 0.80). Closer attention should also be paid when interpreting
CFI/TLI for either small models (p  10) or very large models (p  90) with
good quality of measurement (e.g., l = 0.80).
2. A sample of N = 200 observations only provides a reasonable estimate for
CFI and TLI when p  30. A sample of size N  500 is generally required
to safely use sample CFI and TLI in large models (p  60).

We hope that the results from the current study are informative to applied
researchers who work with imperfect models of various sizes.
22 Educational and Psychological Measurement 00(0)

Authors’ Note
Alberto Maydeu-Olivares is also affiliated to University of Barcelona, Barcelona, Catalonia,
Spain.

Declaration of Conflicting Interests


The author(s) declared no potential conflicts of interest with respect to the research, authorship,
and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship,
and/or publication of this article: This work was supported by the National Research
Foundation of Korea (NRF) grant funded by the Korea government(MSIP) (No.
2017R1C1B2012424). This research was also supported by the National Science Foundation
under Grant No. SES-1659936.

Notes
1. The size of an SEM model has been indicated by different indices, including the number
of observed variables (p), the number of parameters to be estimated (q), the degrees of
freedom (df = p (p + 1)/2 2q), and the ratio of the observed variables to latent factors
(p/f). Recent studies have suggested that the number of observed variables (p) is the most
important determinant of model size effects (Moshagen, 2012; Shi, Lee, et al., 2015,
2017). Therefore, in the current study, we define large models as SEM models with many
observed indicators.
2. When the models are correctly specified, the population CFI, TLI, and RMSEA are con-
stant because FK is zero.

ORCID iD
Taehun Lee https://orcid.org/0000-0001-8261-701X

References
Anderson, J. C., & Gerbling, D. W. (1984). The effects of sampling error on convergence,
improper solutions, and goodness of fit indeces for maximum likelihood confirmatory
factor analysis. Psychometrika, 49, 155-173.
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin,
107, 238-246.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of
covariance structures. Psychological Bulletin, 88, 588-606.
Box, G. E. P. (1979). Some problems of statistics and everyday life. Journal of the American
Statistical Association, 74(365), 1-4.
Breivik, E., & Olsson, U. H. (2001). Adding variables to improve fit: The effect of model size
on fit assessment in LISREL. In R. Cudeck, S. Du Toit & D. Sorbom (Eds.), Structural
Shi et al. 23

equation modeling: Present and future (pp. 169-194). Lincolnwood, IL: Scientific Software
International.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen
& J. S. Long (Eds.), Testing structural equation models (pp. 136-162). Newbury Park, CA:
Sage.
Costa, P., & McCrae, R. (1992). Normal personality assessment in clinical practice: The NEO
Personality Inventory. Psychological Assessment, 4(1), 5-13.
Ding, L., Velicer, W. F., & Harlow, L. L. (1995). The effects of estimation methods, number
of indicators per factor and improper solutions on structural equation modeling fit indices.
Structural Equation Modeling: A Multidisciplinary Journal, 2, 119-144.
DiStefano, C., Liu, J., Jiang, N., & Shi, D. (2018). Examination of the weighted root mean
square residual: Evidence for trustworthiness. Structural Equation Modeling: A
Multidisciplinary Journal, 25, 453-466.
DiStefano, C., McDaniel, H., Zhang, Y., Shi, D., & Jiang, Z. (2017). Fitting large factor
analysis models with ordinal data. Manuscript submitted for publicaiton.
Hancock, G. R., & Mueller, R. O. (2010). The reviewer’s guide to quantitative methods in the
social sciences. New York, NY: Routledge.
Hancock, G. R., & Mueller, R. O. (2011). The reliability paradox in assessing structural
relations within covariance structure models. Educational and Psychological Measurement,
71, 306-324.
Heene, M., Hilbert, S., Draxler, C., Ziegler, M., & Bühner, M. (2011). Masking misfit in
confirmatory factor analysis by increasing unique variances: A cautionary note on the
usefulness of cutoff values of fit indices. Psychological Methods, 16, 319-336.
Herzog, W., Boomsma, A., & Reinecke, S. (2007). The model-size effect on traditional and
modified tests of covariance structures. Structural Equation Modeling: A Multidisciplinary
Journal, 14, 361-390.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling: A
Multidisciplinary Journal, 6, 1-55.
Jackson, D. L., Gillaspy, J. A., & Purc-Stephenson, R. (2009). Reporting practices in
confirmatory factor analysis: An overview and some recommendations. Psychological
Methods, 14, 6-23.
Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor
analysis. Psychometrika, 34, 183-202.
Kenny, D. A., Kaniskan, B., & McCoach, D. B. (2015). The performance of RMSEA in
models with small degrees of freedom. Sociological Methods & Research, 44, 486-507.
Kenny, D. A., & McCoach, D. B. (2003). Effect of the number of variables on measures of fit
in structural equation modeling. Structural Equation Modeling: A Multidisciplinary
Journal, 10, 333-351.
Ling, G. (2012). Why the major field test in business does not report subscores: Reliability
and construct validity evidence. ETS Research Report Series. Retrieved from https://
www.ets.org/Media/Research/pdf/RR-12-11.pdf
Lord, F., & Novick, M. (1968). Statistical theories of mental test scores. Reading, MA:
Addison-Wesley.
MacCallum, R. C. (2003). 2001 Presidential address: Working with imperfect models.
Multivariate Behavioral Research, 38, 113-139.
24 Educational and Psychological Measurement 00(0)

Marsh, H. W., Hau, K. T., Balla, J. R., & Grayson, D. (1998). Is more ever too much? The
number of indicators per factor in confirmatory factor analysis. Multivariate Behavioral
Research, 33, 181-220.
Maydeu-Olivares, A. (2017). Maximum likelihood estimation of structural equation models
for continuous data: Standard errors and goodness of fit. Structural Equation Modeling: A
Multidisciplinary Journal, 24, 383-394.
Maydeu-Olivares, A., Shi, D., & Rosseel, Y. (2018). Assessing fit in structural equation
models: A Monte-Carlo evaluation of RMSEA versus SRMR confidence intervals and tests
of close fit. Structural Equation Modeling: A Multidisciplinary Journal, 25, 389-402.
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that
are not missing completely at random. Psychometrika, 52, 431-462.
Muthén, L., & Muthén, B. (2002). How to use a Monte Carlo study to decide on sample size
and determine power. Structural Equation Modeling: A Multidisciplinary Journal, 9,
599-620.
McDonald, R. P. (1999). Test theory: A unified approach. Mahwah, NJ: Lawrence Erlbaum.
McDonald, R. P., & Ho, M.-H. R. (2002). Principles and practice in reporting structural
equation analyses. Psychological Methods, 7, 64-82.
McNeish, D., An, J., & Hancock, G. R. (2018). The thorny relation between measurement
quality and fit index cutoffs in latent variable models. Journal of Personality Assessment,
100, 43-52.
Moshagen, M. (2012). The model size effect in SEM: Inflated goodness-of-fit statistics are due
to the size of the covariance matrix. Structural Equation Modeling: A Multidisciplinary
Journal, 19, 86-98.
Pornprasertmanit, S., Miller, P., & Schoemann, A. M. (2012). R packagesimsem: SIMulated
structural equation modeling. Retrieved from http://cran.r-project.org
R Development Core Team. (2015). R: A language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing.
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of
Statistical Software, 48(2), 1-36.
Saris, W. E., Satorra, A., & van der Veld, W. M. (2009). Testing structural equation models or
detection of misspecifications? Structural Equation Modeling: A Multidisciplinary Journal,
16, 561-582.
Savalei, V. (2012). The relationship between root mean square error of approximation and
model misspecification in confirmatory factor analysis models. Educational and
Psychological Measurement, 72, 910-932.
Shi, D., DiStefano, C., McDaniel, H. L., & Jiang, Z. (2018). Examining chi-square test
statistics under conditions of large model size and ordinal data. Structural Equation
Modeling: A Multidisciplinary Journal. Advance online publication.
Shi, D., Lee, T., & Terry, R. A. (2015). Abstract: Revisiting the model size effect in structural
equation modeling (SEM). Multivariate Behavioral Research, 50, 142-142.
Shi, D., Lee, T., & Terry, R. A. (2018). Revisiting the model size effect in structural equation
modeling. Structural Equation Modeling: A Multidisciplinary Journal, 25, 21-40.
Shi, D., Maydeu-Olivares, A., & DiStefano, C. (2018). The relationship between the
standardized root mean square residual and model misspecification in factor analysis
models. Multivariate Behavioral Research. Advance online publication.
Shi, D., Song, H., & Lewis, M. D. (2017). The impact of partial factorial invariance on cross-
group comparisons. Assessment. Advance online publication.
Shi et al. 25

Steiger, J. H. (1989). EzPATH: A supplementary module for SYSTAT and SYGRAPH.


Evanston, IL: Systat.
Steiger, J. H. (1990). Structural model evaluation and modification: An interval estimation
approach. Multivariate Behavioral Research, 25, 173-180.
Steiger, J., & Lind, J. C. (1980, May). Statistically based tests for the number of common
factors. Paper Presented at the annual meeting of the Annual Spring Meeting of the
Psychometric Society, Iowa City.
Tucker, L. R., & Lewis, C. (1973). The reliability coefficient for maximum likelihood factor
analysis. Psychometrika, 38, 1-10.
West, S. G., Taylor, A. B., & Wu, W. (2012). Model fit and model selection in structural
equation modeling. In R. H. Hoyle (Ed.), Handbook of structural equation modeling (pp.
209-231). New York, NY: Guilford Press.
Xie, J., & Bentler, P. M. (2003). Covariance structure models for gene expression microarray
data. Structural Equation Modeling: A Multidisciplinary Journal, 10, 566-582.
Yuan, K. H., Tian, Y., & Yanagihara, H. (2015). Empirical correction to the likelihood ratio
statistic for structural equation modeling with many variables. Psychometrika, 80, 379-405.
Yuan, K. H., Yang, M., & Jiang, G. (2017). Empirically corrected rescaled statistics for SEM
with small N and large p. Multivariate Behavioral Research, 52, 673-698.

View publication stats

You might also like