Professional Documents
Culture Documents
net/publication/326077335
CITATIONS READS
11 819
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Assessing fit in factor analysis and structural equation models View project
All content following this page was uploaded by Dexin Shi on 03 July 2018.
Abstract
This study investigated the effect the number of observed variables (p) has on three
structural equation modeling indices: the comparative fit index (CFI), the Tucker–
Lewis index (TLI), and the root mean square error of approximation (RMSEA). The
behaviors of the population fit indices and their sample estimates were compared
under various conditions created by manipulating the number of observed variables,
the types of model misspecification, the sample size, and the magnitude of factor load-
ings. The results showed that the effect of p on the population CFI and TLI depended
on the type of specification error, whereas a higher p was associated with lower val-
ues of the population RMSEA regardless of the type of model misspecification. In
finite samples, all three fit indices tended to yield estimates that suggested a worse fit
than their population counterparts, which was more pronounced with a smaller sam-
ple size, higher p, and lower factor loading.
Keywords
model size effect, structural equation modeling (SEM), fit indices
1
University of South Carolina, Columbia, SC, USA
2
Chung-Ang University, Seoul, South Korea
Corresponding Author:
Taehun Lee, Department of Psychology, Chung-Ang University, 84 Heukseok-Ro, Dongjak-Gu, Seoul
60974, South Korea.
Email: lee.taehun@gmail.com
2 Educational and Psychological Measurement 00(0)
The CFI (Bentler, 1990) measures the relative improvement in fit going from the
baseline model to the postulated model. Due to Bentler (1990, p. 240), the population
CFI can be expressed as follows:
Fk
CFI = 1 ;
F0
where FK and F0 represent the minimum of some discrepancy function for the postu-
lated model and the baseline model, respectively. The sample CFI is estimated as
follows:
2
d = max (x 0 df0 , 0) max (x2k dfk , 0)
CFI ;
max (x 20 df0 , 0)
where x 20 and df0 denote the chi-square statistic and degrees of freedom for the base-
line model, and x 2k and dfk represent the chi-square statistic and degrees of freedom
for the postulated model, respectively. CFI is a normed fit index in the sense that it
ranges between 0 and 1, with higher values indicating a better fit. The most com-
monly used criterion for a good fit is CFI .95 (Hu & Bentler, 1999; West et al.,
2012).
The TLI (Tucker & Lewis, 1973) measures a relative reduction in misfit per
degree of freedom. This index was originally proposed by Tucker and Lewis (1973)
in the context of exploratory factor analysis and later generalized to the covariance
structure analysis context and labeled as the nonnormed fit index by Bentler and
Bonett (1980). This index is nonnormed in that its value can occasionally be negative
or exceed 1. Following the expression of Bentler (1990, p. 241), the population TLI
can be expressed as follows:
Fk =dfK
TLI = 1 ;
F0 =df0
where F0 =df0 and Fk =dfk represent the misfit per degree of freedom for the baseline
model and the postulated model, respectively.
The sample estimator of TLI can be given as follows:
2 2
d = x0 =df0 x k =dfk :
TLI 2
x 0 =df0 1
In general, TLI .95 is a commonly used cutoff criterion for the goodness of fit
(Hu & Bentler, 1999; West et al., 2012).
The RMSEA (Steiger, 1989, 1990; Steiger & Lind, 1980) measures the discre-
pancy due to the approximation per degree of freedom as follows:
sffiffiffiffiffiffi
Fk
RMSEA = ;
dfk
4 Educational and Psychological Measurement 00(0)
where FK denotes the minimum of some discrepancy function between the popula-
tion covariance matrix S and the model implied covariance matrix S0 for the
hypothesized model. The sample estimate of RMSEA is defined as follows (Browne
& Cudeck, 1993):
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
d = max (x
RMSEA
df , 0)
:
df (N 1)
The RMSEA is a badness-of-fit measure, yielding lower values for a better fit. An
RMSEA .06 could be considered acceptable (Hu & Bentler, 1999), whereas a
model with an RMSEA .10 is unworthy of serious consideration (Browne &
Cudeck, 1993).
The above three fit indices have been routinely reported by SEM software (e.g.,
Mplus) and have been used as standard tools for evaluating model fit (Hancock &
Mueller, 2010; McDonald & Ho, 2002). As discussed in the model size literature
(e.g., Herzog et al., 2007; Moshagen, 2012), these fit indices are also likely to be
influenced by the model size because they are functions of the LR chi-square statistic
that tends to be upwardly biased in large models. Typically, applied researchers rely
heavily on practical fit indices rather than on the formal chi-square test when evalu-
ating specified SEM models. Therefore, it is integral to understand whether or not
selected fit indices tend to increase or decrease as the model size increases for the
appropriate application of practical fit indices.
Previous studies have shed some light on the effect of model size on practical fit
indices. Under correctly specified models, researchers have focused on the behaviors
of the fit indices in the sample.2 The results showed that increasing the number of
indicators (p) led to a decline in the average sample estimates of CFI and TLI, indi-
cating that the model fit worsens (Anderson & Gerbing, 1984; Ding, Velicer, &
Harlow, 1995; Kenny & McCoach, 2003). However, it was found that the average
RMSEA tended to decrease (i.e., indicating an improved fit) as more indicators were
added to the correctly specified models (Kenny & McCoach, 2003).
Under the conditions of misspecified models, most previous studies examined the
effect of the number of indicators on the population values of the selected practical
fit indices. Kenny and McCoach (2003) investigated the effect of the number of
observed variables (p) on the population values of CFI, TLI, and RMSEA. They
found that as p increased the population RMSEA tended to decrease, indicating that
the model fit improved regardless of the type of model misspecification; for CFI and
TLI, their population values tended to decrease (i.e., indicating a worse fit) when the
model misspecifications were introduced by fitting a single-factor model to two-
factor data or by omitting cross-loadings. However, when the models were misspeci-
fied by ignoring the nonzero residual correlations, it was found that the population
values of CFI and TLI tended to increase (i.e., indications of a better fit) as p
increased. Breivik and Olsson (2001) also found similar patterns of behaviors for the
population values of CFI and RMSEA. More recently, Savalei (2012) also found that
Shi et al. 5
It should be noted that the degree of model misspecifications examined in the cur-
rent study is substantively ignorable (see Shi, Maydeu-Olivares, & DiStefano, 2018),
which is intended to simulate situations where the specified model is trivially false
or the misspecified component would be uninteresting to most researchers. That is,
the two factors with .90 correlation cannot or need not be meaningfully discriminated
in practice and the precise estimation of residual correlations of size equal to .15
could be considered meaningless or uninteresting to most researchers. It would be
fair to say that either correctly specified or only slightly misspecified models should
be retained and interpreted, because inferences drawn from poorly fitting models can
be misleading (Saris, Satorra, & van der Veld, 2009). Therefore, in the present arti-
cle, the effect of model size on the behavior of the selected fit indices were studied
under the conditions involving slight model misspecifications.
For model identification, the variances of the factors were set to 1.0. Other vari-
ables manipulated in the simulation are described below.
Model size: Model size was indicated by the total number of observed vari-
ables (p). The number of observed variables included 10, 30, 60, 90, and 120.
When the population model had a two-factor structure, the same number of
observed variables was loaded on each factor.
Sample size: Sample sizes included 200, 500, and 1,000.
Levels of factor loadings: We included items with low (.40) or high (.80) fac-
tor loadings (l), which represent either weak or strong factor(s). The variances
of the error terms were set as 1 l2 .
simulated data sets, from which the empirical distributions of CFI, TLI, and RMSEA
were obtained across 1,000 replications. All data analyses were conducted with the
maximum likelihood estimation using the lavaan package in R (R Development Core
Team, 2015; Rosseel, 2012). All replications converged for all conditions.
The behaviors of CFI, TLI, and RMSEA across different simulation conditions
are summarized in Tables 1 to 3. Specifically, for each fit index, we reported their
population values and the average sample estimates and computed the relative differ-
ences (biases) between the two quantities. That is, relative bias (RB) was computed
as follows:
uest upop
RB = ;
upop
where uest represents the average of sample estimates of the fit indices of more than
1,000 replications and upop indicates the population value of fit indices. Following
recommendations from previous studies, RBs less than 10% (in absolute value) were
considered acceptable (Muthén, Kaplan, & Hollis, 1987; Muthén & Muthén, 2002;
Shi, Song, & Lewis, 2017). It should be noted that for correctly specified models the
population RMSEA is zero, and therefore, RB is undefined. Under such conditions,
the absolute bias (AB) was instead computed as follows:
AB =
uest upop :
We also recognized that the values of RB can be deceptively high when the popula-
tion RMSEA values are near zero (e.g., .004). We therefore reported both RB and AB
for RMSEA under misspecified conditions.
Analyses of variance (ANOVAs) were conducted with the RB as the dependent
variable and the simulation conditions (and their interactions) as independent vari-
ables. An eta-square (h2) value above 5% was used to identify conditions that con-
tributed to sizable amounts of variability in the outcome. For visual presentations of
the patterns (see Figures 1-3), we also plotted the average sample estimates of the fit
indices against model size levels for different levels of sample size (N = 200, 500,
1,000, population), factor loading (l = .40, .80), and misspecification (no misspecifi-
cation, omitted correlated residuals, and misspecified dimensionality). A horizontal
line has been drawn in these figures to mark the cutoff values for CFI (.95), TLI
(.95), and RMSEA (.06) suggested by Hu and Bentler (1999).
.4 200 10 1.000 0.972 20.03 0.994 0.965 20.03 0.901 0.894 20.01
30 1.000 0.956 20.04 0.989 0.944 20.05 0.973 0.935 20.04
60 1.000 0.872 20.13 0.985 0.853 20.13 0.988 0.863 20.13
90 1.000 0.751 20.25 0.981 0.728 20.26 0.993 0.747 20.25
120 1.000 0.611 20.39 0.978 0.588 20.40 0.995 0.609 20.39
500 10 1.000 0.990 20.01 0.994 0.984 20.01 0.901 0.899 0.00
30 1.000 0.989 20.01 0.989 0.980 20.01 0.973 0.966 20.01
60 1.000 0.979 20.02 0.985 0.963 20.02 0.988 0.968 20.02
90 1.000 0.958 20.04 0.981 0.938 20.04 0.993 0.952 20.04
120 1.000 0.928 20.07 0.978 0.905 20.07 0.995 0.924 20.07
1,000 10 1.000 0.995 20.01 0.994 0.990 0.00 0.901 0.900 0.00
30 1.000 0.995 20.01 0.989 0.987 0.00 0.973 0.972 0.00
60 1.000 0.994 20.01 0.985 0.980 20.01 0.988 0.983 20.01
90 1.000 0.990 20.01 0.981 0.970 20.01 0.993 0.983 20.01
120 1.000 0.983 20.02 0.978 0.960 20.02 0.995 0.977 20.02
(continued)
Table 1. (continued)
.8 200 10 1.000 0.997 0.00 0.968 0.967 0.00 0.938 0.936 0.00
30 1.000 0.994 20.01 0.951 0.945 20.01 0.980 0.975 20.01
60 1.000 0.980 20.02 0.941 0.921 20.02 0.990 0.970 20.02
90 1.000 0.954 20.05 0.936 0.890 20.05 0.994 0.948 20.05
120 1.000 0.912 20.09 0.932 0.849 20.09 0.995 0.908 20.09
500 10 1.000 0.999 0.00 0.968 0.968 0.00 0.938 0.937 0.00
30 1.000 0.999 0.00 0.951 0.949 0.00 0.980 0.979 0.00
60 1.000 0.997 0.00 0.941 0.938 0.00 0.990 0.987 0.00
90 1.000 0.994 20.01 0.936 0.929 20.01 0.994 0.987 20.01
120 1.000 0.988 20.01 0.932 0.921 20.01 0.995 0.984 20.01
1,000 10 1.000 0.999 0.00 0.968 0.968 0.00 0.938 0.937 0.00
30 1.000 0.999 0.00 0.951 0.950 0.00 0.980 0.980 0.00
60 1.000 0.999 0.00 0.941 0.940 0.00 0.990 0.990 0.00
90 1.000 0.998 0.00 0.936 0.934 0.00 0.994 0.992 0.00
120 1.000 0.997 0.00 0.932 0.930 0.00 0.995 0.993 0.00
Note. CFI = comparative fit index; factor loadings = (standardized) factor loadings; N = sample size; p = number of observed variables; RB = relative bias. POP
indicates the population CFI; EST indicates the average sample estimates of CFI. For POP and EST, values less than 0.95 are underlined. |RB| larger than 0.10 (10%)
are in boldface.
9
10 Educational and Psychological Measurement 00(0)
dimensionality where a single-factor model was fitted to the data sets generated by
the two-factor models with r = .90, as p increased from 10 to 120, the population
CFI tended to decrease from .994 to .978 when l = .40, and from .968 to .932 when
l = .80 (the second column in Table 1). Under the condition of omitted residual cov-
ariances, the population CFI tended to increase with p from .901 to .995 when l =
.40, and from .938 to .995 when l = .80 (the third column in Table 1). As a remin-
der, it should be noted that we intended to simulate modeling situations where the
degree of specification error is relatively small. This was evidenced by the popula-
tion values of CFI exceeding the Hu–Bentler cutoff (CFI .95) under most simula-
tion conditions, with the lowest value being .901.
The results from the ANOVA showed that the important sources of variance in
the (relative) biases in estimating the population CFI were, by order of h2, the sample
size (N, h2 = .22), the interaction between number of observed variables and sample
size (N3p, h2 = .16), the model size (p, h2 = .14), followed by the magnitude of
Shi et al. 11
factor loadings (l) and the interactions between the magnitude of factor loadings, the
sample size (l3N, h2 = .09) and model size (l3p, h2 = .06).
In general, as shown in Table 1, the population CFI can be accurately approxi-
mated with a small RB (|RB| \ 10%) when the sample size is large (i.e., N 500)
across all conditions. On the other hand, when sample size was small (e.g., N = 200),
a noticeable RB (i.e. |RB| 10%, see values in bold in Table 1) can occur when the
model size is large (p 60) and the quality of measurement is mediocre (l = .40).
In addition, the effect of p became more conspicuous when the magnitude of factor
loadings was small (l = .40). For example, when fitting correctly specified models
with N = 200 and l = .40, the RBs (in absolute values) increased from 3% to 39% as
p increased from 10 to 120. However, if magnitude of factor loadings was large (l =
.80), the increments of RB (in absolute values) was smaller, ranging from 0% (p =
10) to 9% (p = 120).
Because of the tendency of population CFI values being underestimated across all
conditions, models with no specification error or with minor specification errors can
be rejected if evaluated solely based on their sample CFI. The results presented in
Table 1 suggest that the rejecting of models with the population CFI values indicat-
ing a well-fitting model is indeed likely to occur when the model size is very large
(e.g., p 90; see the underlined values in Table 1). For example, under correctly
specified models, when the factor loadings were .40 and N = 200, as p increased
from 10 to 120, the average estimates of CFI changed from .972 (closely fitting
model) to .611 (poorly fitting model). A similar pattern was observed when collap-
sing two factors with r = .90; when p increased from 10 to 120, the average CFI
dropped from .965 to .588, leading to different conclusions in terms of the model fit.
When the misspecification was caused by omitting residual correlations, the average
CFI initially increased as more observed variables were added to the fitted model;
but as p continued increasing, the average CFI would eventually decrease. Taking N
= 200 and factor loadings = .80 as an example, when p increased from 10 to 30, the
mean CFI increased from .936 to .975; the average CFI then dropped to .948 (p =
90) and finally reached the bottom value at .908 (p = 120).
Tucker–Lewis Index
As shown in Table 2 and Figure 2, the behaviors of both population values and their
sample estimates of TLI are virtually indistinguishable from the patterns of CFI in
large models. For correctly specified models, the population TLI is a constant of
1.00, regardless of the model size (the first column in Table 2). Under the condition
of misspecified dimensionality, the population TLI tended to decrease as p increased
(the second column in Table 2), whereas the population TLI tended to increase when
three residual covariances were omitted (the third column in Table 2). Taking condi-
tions with l = .40 as an example, when one-factor models were fitted to two-factor
data with r = .90, as p increased from 10 to 120, the population TLI slightly
decreased from .995 to .978. When omitting residual correlations, a higher p was
12
Table 2. Effect of p on TLI.
.4 200 10 1.000 0.994 20.01 0.995 0.985 20.01 0.923 0.865 20.06
30 1.000 0.958 20.04 0.990 0.944 20.05 0.975 0.931 20.04
60 1.000 0.867 20.13 0.985 0.848 20.14 0.989 0.858 20.13
90 1.000 0.745 20.26 0.981 0.722 20.26 0.993 0.741 20.25
120 1.000 0.605 20.40 0.978 0.581 20.40 0.995 0.603 20.39
500 10 1.000 0.999 0.00 0.995 0.990 20.01 0.923 0.871 20.05
30 1.000 0.992 20.01 0.990 0.980 20.01 0.975 0.964 20.01
60 1.000 0.978 20.02 0.985 0.962 20.02 0.989 0.967 20.02
90 1.000 0.957 20.04 0.981 0.936 20.05 0.993 0.950 20.04
120 1.000 0.927 20.07 0.978 0.903 20.08 0.995 0.922 20.07
1,000 10 1.000 0.999 0.00 0.995 0.991 0.00 0.923 0.872 20.05
30 1.000 0.998 0.00 0.990 0.986 0.00 0.975 0.969 20.01
60 1.000 0.995 20.01 0.985 0.979 20.01 0.989 0.983 20.01
90 1.000 0.990 20.01 0.981 0.970 20.01 0.993 0.982 20.01
120 1.000 0.982 20.02 0.978 0.959 20.02 0.995 0.977 20.02
(continued)
Table 2. (continued)
.8 200 10 1.000 0.999 0.00 0.975 0.958 20.02 0.952 0.918 20.03
30 1.000 0.994 20.01 0.954 0.941 20.01 0.981 0.973 20.01
60 1.000 0.979 20.02 0.943 0.918 20.03 0.991 0.969 20.02
90 1.000 0.952 20.05 0.937 0.888 20.05 0.994 0.946 20.05
120 1.000 0.911 20.09 0.934 0.846 20.09 0.995 0.907 20.09
500 10 1.000 1.000 0.00 0.975 0.959 20.02 0.952 0.919 20.03
30 1.000 0.999 0.00 0.954 0.945 20.01 0.981 0.977 0.00
60 1.000 0.997 0.00 0.943 0.935 20.01 0.991 0.987 0.00
90 1.000 0.993 20.01 0.937 0.927 20.01 0.994 0.987 20.01
120 1.000 0.988 20.01 0.934 0.920 20.01 0.995 0.983 20.01
1,000 10 1.000 1.000 0.00 0.975 0.959 20.02 0.952 0.920 20.03
30 1.000 1.000 0.00 0.954 0.946 20.01 0.981 0.978 0.00
60 1.000 0.999 0.00 0.943 0.938 20.01 0.991 0.989 0.00
90 1.000 0.998 0.00 0.937 0.932 20.01 0.994 0.992 0.00
120 1.000 0.997 0.00 0.934 0.929 20.01 0.995 0.992 0.00
Note. TLI = the Tucker–Lewis index; factor loadings = (standardized) factor loadings; N = sample size; p = number of observed variables; RB = relative bias. POP
indicates the population TLI; EST indicates the average sample estimates of TLI. For POP and EST, values less than 0.95 are underlined. |RB| larger than 0.10 (10%)
are in boldface.
13
14 Educational and Psychological Measurement 00(0)
associated with a larger population TLI, ranging from .923 (p = 10) to .995 (p =
120).
In finite samples, the population values of TLI tended to be underestimated, which
was more pronounced with a smaller sample size, larger model size, and lower factor
loading. As with CFI, ANOVA results showed that the important sources of the RB
variance in estimating population TLI included the sample size (N, h2 = .20), model
size (p, h2 = .12), magnitude of the factor loadings (l, h2 = .08), and the two-way
interactions among the above factors (i.e., N3p, h2 = .16; N3l, h2 = .09; p3l, h2 =
.06). It appeared that a sample of size N 500 may be required to obtain a reason-
able estimate of the population TLI (with |RB| \ 10%), regardless of the level of fac-
tor loading and model size considered in the current study (i.e., p 120). It was also
noted that even when the sample size was relatively large (e.g., N = 500), by applying
the conventional cutoff, very large models (p 90) with no specification error or
minor specification errors could be rejected based on the sample TLI. For example, a
correctly specified model with p = 120, l = .40 and N = 500 would be rejected, if the
Table 3. Effect of p on RMSEA.
.4 200 10 0.000 0.016 0.016 0.010 0.017 0.007 0.70 0.049 0.047 20.002 20.04
30 0.000 0.017 0.017 0.009 0.019 0.010 1.11 0.015 0.022 0.007 0.47
60 0.000 0.026 0.026 0.008 0.027 0.019 2.38 0.007 0.027 0.020 2.86
90 0.000 0.033 0.033 0.008 0.034 0.026 3.25 0.005 0.033 0.028 5.60
120 0.000 0.040 0.040 0.007 0.041 0.034 4.86 0.004 0.040 0.036 9.00
500 10 0.000 0.009 0.009 0.010 0.012 0.002 0.20 0.049 0.048 20.001 20.02
30 0.000 0.007 0.007 0.009 0.011 0.002 0.22 0.015 0.016 0.001 0.07
60 0.000 0.009 0.009 0.008 0.012 0.004 0.50 0.007 0.012 0.005 0.71
90 0.000 0.012 0.012 0.008 0.014 0.006 0.75 0.005 0.013 0.008 1.60
120 0.000 0.014 0.014 0.007 0.016 0.009 1.29 0.004 0.014 0.010 2.50
1,000 10 0.000 0.006 0.006 0.010 0.010 0.000 0.00 0.049 0.048 20.001 20.02
30 0.000 0.004 0.004 0.009 0.009 0.000 0.00 0.015 0.015 0.000 0.00
60 0.000 0.004 0.004 0.008 0.009 0.001 0.13 0.007 0.009 0.002 0.29
90 0.000 0.005 0.005 0.008 0.010 0.002 0.25 0.005 0.007 0.002 0.40
120 0.000 0.007 0.007 0.007 0.010 0.003 0.43 0.004 0.008 0.004 1.00
(continued)
15
16
Table 3. (continued)
.8 200 10 0.000 0.017 0.017 0.078 0.077 20.001 20.01 0.120 0.120 0.000 0.00
30 0.000 0.017 0.017 0.056 0.059 0.003 0.05 0.037 0.041 0.004 0.11
60 0.000 0.026 0.026 0.044 0.051 0.007 0.16 0.018 0.032 0.014 0.78
90 0.000 0.033 0.033 0.037 0.050 0.013 0.35 0.012 0.035 0.023 1.92
120 0.000 0.040 0.040 0.033 0.052 0.019 0.58 0.009 0.041 0.032 3.56
500 10 0.000 0.010 0.010 0.078 0.078 0.000 0.00 0.120 0.120 0.000 0.00
30 0.000 0.007 0.007 0.056 0.056 0.000 0.00 0.037 0.038 0.001 0.03
60 0.000 0.009 0.009 0.044 0.045 0.001 0.02 0.018 0.021 0.003 0.17
90 0.000 0.012 0.012 0.037 0.039 0.002 0.05 0.012 0.017 0.005 0.42
120 0.000 0.014 0.014 0.033 0.036 0.003 0.09 0.009 0.017 0.008 0.89
1,000 10 0.000 0.007 0.007 0.078 0.078 0.000 0.00 0.120 0.120 0.000 0.00
30 0.000 0.004 0.004 0.056 0.056 0.000 0.00 0.037 0.037 0.000 0.00
60 0.000 0.004 0.004 0.044 0.044 0.000 0.00 0.018 0.019 0.001 0.06
90 0.000 0.005 0.005 0.037 0.038 0.001 0.03 0.012 0.013 0.001 0.08
120 0.000 0.007 0.007 0.033 0.034 0.001 0.03 0.009 0.011 0.002 0.22
Note. RMSEA = root mean square error of approximation; factor loadings = (standardized) factor loadings; N = sample size; P = number of observed variables; RB
= relative bias; AB = absolute bias. POP indicates the population RMSEA; EST indicates the average sample estimates of RMSEA. For POP and EST, values greater
than 0.06 are underlined. |RB| larger than 0.10 (10%) are in boldface.
Shi et al. 17
fixed cutoff score were applied in the strictest sense, by yielding an average sample
TLI of .927. When fitting a one-factor model to two-factor data with r = .90, for p =
90, l = .40 and N = 500, the average sample TLI was .936, suggesting that the model
with the population TLI of .981 may be rejected in the sample according to the .95
cutoff.
When the sample size was small (e.g., N = 200), the population TLI tended to be
substantially underestimated (i.e., RBs were negative), especially if the model size was
large (p 90) and the magnitude of factor loadings was small (l = .40). For example,
when N = 200 and the misspecification was caused by collapsing two highly correlated
factors (r = .90) into one, for l = .40, the absolute values of RBs increased from 1% (p
= 10) to 40% (p = 120). Under the same conditions except for when l = .80, as p
increased from 10 to 120, the RB (in absolute values) increased from 2% to 9%.
column in Table 3). Under both types of specification errors (the second and third
columns), the population RMSEA decreased as p increased. This effect of p was
more pronounced at a higher l. For example, under the condition of omitted residual
correlations, the population RMSEA decreased from .049 to .004 when p increased
from 10 to 120 (l = .40). With a higher factor loading (l = .80), the population
RMSEA decreased from .120 (p = 10) to .009 (p = 120). It was noted that the popu-
lation values for RMSEA were below the conventional cutoff value (i.e., RMSEA
.06) across all conditions except for the two cases where the misspecified models
had a low number of high-quality indicators (underlined values in Table 3). That is,
when p = 10 and l = .80, the population value of RMSEA was .078 under misspeci-
fied dimensionality, and the population RMSEA was .120 under the condition of
omitted residual correlations.
ANOVA results showed that the important sources of the RB variance included
sample size (N, h2 = .18), the model size (p, h2 = .16), the magnitude of factor load-
ings (l, h2 = .08), the interaction between sample size and the model size (N3p, h2
= .13), and the interaction between sample size and magnitude of factor loadings
(N3l, h2 = .06).
Figure 3 shows that the sample estimates of RMSEA tended to be upwardly biased
across all conditions. Figure 3 also makes it clear that as p increases, the difference
between the population RMSEA and the sample average values becomes larger. For
example, under correctly specified models (N = 200, l = .40), the difference between
the sample average RMSEA and the population RMSEA increased from .016 (p =
10) to .040 (p = 120). We also observed that the sample RMSEA could be noticeably
different from their population value (with |RB| 10%, values in bold in Table 3)
when p was high (e.g., p 60), even when the sample size was reasonably large
(e.g., N = 1,000). For example, when the model was misspecified by omitting the resi-
dual covariances (with p = 120, N = 1,000, and l = .40), the average sample RMSEA
= .008, almost twice the value of the corresponding population value (i.e., .004).
However, as discussed earlier, when the population RMSEA values were near zero,
the values of RB and the |RB| 10% criterion may not be an appropriate measure of
‘‘acceptability’’ of the average sample estimates. Therefore, in Table 3, we reported
the AB for estimating the population RMSEA. The largest AB observed across all
simulated conditions was .040 (i.e., correctly specified model, N = 200 and p = 120).
standardized factor loadings were low. It was also noted that the pattern of sample
RMSEA we observed was partially different from the pattern in Kenny and McCoach
(2003), where the authors found that in correctly specified models, the sample
RMSEA could improve as the number of variables increased. We argue that one
likely reason for the reversed effect of p in Kenny and McCoach’s (2003) study is that
the range of the p manipulation in their simulation was relatively low (from 4 to 25).
Moreover, when fitting large SEM models (e.g., p 30) with small samples (e.g.,
N 200), disagreement between the sample CFI/TLI and RMSEA would likely be
observed. Specifically, the sample CFI and TLI could be largely downwardly biased,
even when the models were correctly specified. This is especially true when the qual-
ity of measurement was poor (i.e., the standardized factor loadings were low). For
example, depending on the number of observed variables, correctly specified models
(l = .40, N = 200) could produce an average CFI ranging from .611 to .972. It appears
that a sample of size N 500 may be required to gain relatively accurate estimates
(with |RB| \ 10%) for both CFI and TLI in large models. It was also noted that when
fitting very large models (p 90) of good quality of measurement (l = .80) to a sam-
ple of small to medium size (i.e., less than 1,000), the use of sample CFI/TLI may
reject the model that is known to have a close fit in the population if the fixed cutoff
scores were applied in the strictest sense. Under such conditions (e.g., p 90 and l =
.40), it appears that a sample of size of N 1,000 may be required to safely interpret
CFI and TLI.
In small samples, the average sample RMSEA tends to be upwardly biased, and
the bias increases as p increases (indicating a larger difference between the popula-
tion RMSEA and the average sample estimates). Additionally, when the number of
observed variables is high (p . 30), the sample RMSEA could be noticeably over-
estimated (with an RB 10%), even when the sample size is 1,000. As with the
sample CFI and TLI, the sample RMSEA was sensitive to model size. Nevertheless,
the average sample estimates for RMSEA were below the conventional cutoff value
(i.e., RMSEA .06) under nearly all conditions examined in our study except for
the three conditions where p = 10 and l = .80 (see the underlined values in Table 3).
As noted earlier, the |RB| 10% criterion may not be informative from a practical
viewpoint when the population parameter is zero or near zero.
Methodologists have shown that for a given level of model misspecification, poor
measurement quality is associated with better model fit (i.e., the reliability paradox;
see Hancock & Mueller, 2011). This phenomenon has been derived mathematically
or revealed at the population level (Hancock & Mueller, 2011; Heene, Hilbert,
Draxler, Ziegler, & Bühner, 2011) and also has been demonstrated based on sample
estimates using simulation study (McNeish, An, & Hancock, 2018). The findings in
our study also showed that the reliability paradox may have operated for both popu-
lation RMSEA values and their sample estimates across all conditions of sample size
(N) and model size (p). For CFI and TLI, however, our findings showed that the
effect of measurement quality on model fit evaluations can depend on factors such as
sample size (N) and model size (p), resulting in the disappearance of the reliability
Shi et al. 21
paradox under certain conditions. Specifically, sample estimates of CFI and TLI, on
average, tended to indicate worse fit under the condition of poorer measurement
quality (i.e., l = .40) when a model of large size was fit to a sample of small to
medium size (N = 200, 500).
The findings in the current study are based on the assumption that the observed
data are multivariate normal distributed. In many applications, the assumption of nor-
mal data is likely to be violated (e.g., if ordered categorical data are analyzed), where
the chi-square test statistics with robust corrections are commonly used. Previous
studies have shown that the robust chi-square test statistics could also be influenced
by the number of observed variables in the fitted model (Shi, DiStefano, McDaniel,
& Jiang, 2018; Yuan, Yang, & Jiang, 2017). For future studies, it would be interest-
ing to explore the model size effect on practical model fit indices in the presence of
nonnormal data (DiStefano, Liu, Jiang, & Shi, 2018; DiStefano, McDaniel, Zhang,
Shi, & Jiang, 2018; Maydeu-Olivares, Shi, & Rosseel, 2018). In addition, we only
included two specific types and minor levels of model misspecification. Additional
types of misspecification (e.g., omitted cross-loading values) and levels of misspeci-
fication (e.g., severely misspecified models) should be investigated in future studies,
as little is known about the effect of model size on fit indices under such situations.
In summary, our findings support the idea that the fit indices rely not only on
the model fit or misfit but also on the context of the model, such as the number of
observed variables (p). On one hand, given the same level of model misspecifica-
tion (e.g., fitting a one-factor model to the two-factor data), the population values
of the fit indices can be heavily affected by the model size. On the other hand, in
small samples (N \ 500), as p increases, the estimates of the sample fit indices,
mainly CFI and TLI, are likely to be biased and yield a far worse fit than their
population values. In this sense, there are no ‘‘golden rules.’’ In empirical studies,
researchers should consider the number of observed variables when using the
practical fit indices to assess model fit. That said, we can offer a few cautionary
remarks to the researchers evaluating models with no specification error or with
minor specification errors.
We hope that the results from the current study are informative to applied
researchers who work with imperfect models of various sizes.
22 Educational and Psychological Measurement 00(0)
Authors’ Note
Alberto Maydeu-Olivares is also affiliated to University of Barcelona, Barcelona, Catalonia,
Spain.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship,
and/or publication of this article: This work was supported by the National Research
Foundation of Korea (NRF) grant funded by the Korea government(MSIP) (No.
2017R1C1B2012424). This research was also supported by the National Science Foundation
under Grant No. SES-1659936.
Notes
1. The size of an SEM model has been indicated by different indices, including the number
of observed variables (p), the number of parameters to be estimated (q), the degrees of
freedom (df = p (p + 1)/2 2q), and the ratio of the observed variables to latent factors
(p/f). Recent studies have suggested that the number of observed variables (p) is the most
important determinant of model size effects (Moshagen, 2012; Shi, Lee, et al., 2015,
2017). Therefore, in the current study, we define large models as SEM models with many
observed indicators.
2. When the models are correctly specified, the population CFI, TLI, and RMSEA are con-
stant because FK is zero.
ORCID iD
Taehun Lee https://orcid.org/0000-0001-8261-701X
References
Anderson, J. C., & Gerbling, D. W. (1984). The effects of sampling error on convergence,
improper solutions, and goodness of fit indeces for maximum likelihood confirmatory
factor analysis. Psychometrika, 49, 155-173.
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin,
107, 238-246.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of
covariance structures. Psychological Bulletin, 88, 588-606.
Box, G. E. P. (1979). Some problems of statistics and everyday life. Journal of the American
Statistical Association, 74(365), 1-4.
Breivik, E., & Olsson, U. H. (2001). Adding variables to improve fit: The effect of model size
on fit assessment in LISREL. In R. Cudeck, S. Du Toit & D. Sorbom (Eds.), Structural
Shi et al. 23
equation modeling: Present and future (pp. 169-194). Lincolnwood, IL: Scientific Software
International.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen
& J. S. Long (Eds.), Testing structural equation models (pp. 136-162). Newbury Park, CA:
Sage.
Costa, P., & McCrae, R. (1992). Normal personality assessment in clinical practice: The NEO
Personality Inventory. Psychological Assessment, 4(1), 5-13.
Ding, L., Velicer, W. F., & Harlow, L. L. (1995). The effects of estimation methods, number
of indicators per factor and improper solutions on structural equation modeling fit indices.
Structural Equation Modeling: A Multidisciplinary Journal, 2, 119-144.
DiStefano, C., Liu, J., Jiang, N., & Shi, D. (2018). Examination of the weighted root mean
square residual: Evidence for trustworthiness. Structural Equation Modeling: A
Multidisciplinary Journal, 25, 453-466.
DiStefano, C., McDaniel, H., Zhang, Y., Shi, D., & Jiang, Z. (2017). Fitting large factor
analysis models with ordinal data. Manuscript submitted for publicaiton.
Hancock, G. R., & Mueller, R. O. (2010). The reviewer’s guide to quantitative methods in the
social sciences. New York, NY: Routledge.
Hancock, G. R., & Mueller, R. O. (2011). The reliability paradox in assessing structural
relations within covariance structure models. Educational and Psychological Measurement,
71, 306-324.
Heene, M., Hilbert, S., Draxler, C., Ziegler, M., & Bühner, M. (2011). Masking misfit in
confirmatory factor analysis by increasing unique variances: A cautionary note on the
usefulness of cutoff values of fit indices. Psychological Methods, 16, 319-336.
Herzog, W., Boomsma, A., & Reinecke, S. (2007). The model-size effect on traditional and
modified tests of covariance structures. Structural Equation Modeling: A Multidisciplinary
Journal, 14, 361-390.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling: A
Multidisciplinary Journal, 6, 1-55.
Jackson, D. L., Gillaspy, J. A., & Purc-Stephenson, R. (2009). Reporting practices in
confirmatory factor analysis: An overview and some recommendations. Psychological
Methods, 14, 6-23.
Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor
analysis. Psychometrika, 34, 183-202.
Kenny, D. A., Kaniskan, B., & McCoach, D. B. (2015). The performance of RMSEA in
models with small degrees of freedom. Sociological Methods & Research, 44, 486-507.
Kenny, D. A., & McCoach, D. B. (2003). Effect of the number of variables on measures of fit
in structural equation modeling. Structural Equation Modeling: A Multidisciplinary
Journal, 10, 333-351.
Ling, G. (2012). Why the major field test in business does not report subscores: Reliability
and construct validity evidence. ETS Research Report Series. Retrieved from https://
www.ets.org/Media/Research/pdf/RR-12-11.pdf
Lord, F., & Novick, M. (1968). Statistical theories of mental test scores. Reading, MA:
Addison-Wesley.
MacCallum, R. C. (2003). 2001 Presidential address: Working with imperfect models.
Multivariate Behavioral Research, 38, 113-139.
24 Educational and Psychological Measurement 00(0)
Marsh, H. W., Hau, K. T., Balla, J. R., & Grayson, D. (1998). Is more ever too much? The
number of indicators per factor in confirmatory factor analysis. Multivariate Behavioral
Research, 33, 181-220.
Maydeu-Olivares, A. (2017). Maximum likelihood estimation of structural equation models
for continuous data: Standard errors and goodness of fit. Structural Equation Modeling: A
Multidisciplinary Journal, 24, 383-394.
Maydeu-Olivares, A., Shi, D., & Rosseel, Y. (2018). Assessing fit in structural equation
models: A Monte-Carlo evaluation of RMSEA versus SRMR confidence intervals and tests
of close fit. Structural Equation Modeling: A Multidisciplinary Journal, 25, 389-402.
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that
are not missing completely at random. Psychometrika, 52, 431-462.
Muthén, L., & Muthén, B. (2002). How to use a Monte Carlo study to decide on sample size
and determine power. Structural Equation Modeling: A Multidisciplinary Journal, 9,
599-620.
McDonald, R. P. (1999). Test theory: A unified approach. Mahwah, NJ: Lawrence Erlbaum.
McDonald, R. P., & Ho, M.-H. R. (2002). Principles and practice in reporting structural
equation analyses. Psychological Methods, 7, 64-82.
McNeish, D., An, J., & Hancock, G. R. (2018). The thorny relation between measurement
quality and fit index cutoffs in latent variable models. Journal of Personality Assessment,
100, 43-52.
Moshagen, M. (2012). The model size effect in SEM: Inflated goodness-of-fit statistics are due
to the size of the covariance matrix. Structural Equation Modeling: A Multidisciplinary
Journal, 19, 86-98.
Pornprasertmanit, S., Miller, P., & Schoemann, A. M. (2012). R packagesimsem: SIMulated
structural equation modeling. Retrieved from http://cran.r-project.org
R Development Core Team. (2015). R: A language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing.
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of
Statistical Software, 48(2), 1-36.
Saris, W. E., Satorra, A., & van der Veld, W. M. (2009). Testing structural equation models or
detection of misspecifications? Structural Equation Modeling: A Multidisciplinary Journal,
16, 561-582.
Savalei, V. (2012). The relationship between root mean square error of approximation and
model misspecification in confirmatory factor analysis models. Educational and
Psychological Measurement, 72, 910-932.
Shi, D., DiStefano, C., McDaniel, H. L., & Jiang, Z. (2018). Examining chi-square test
statistics under conditions of large model size and ordinal data. Structural Equation
Modeling: A Multidisciplinary Journal. Advance online publication.
Shi, D., Lee, T., & Terry, R. A. (2015). Abstract: Revisiting the model size effect in structural
equation modeling (SEM). Multivariate Behavioral Research, 50, 142-142.
Shi, D., Lee, T., & Terry, R. A. (2018). Revisiting the model size effect in structural equation
modeling. Structural Equation Modeling: A Multidisciplinary Journal, 25, 21-40.
Shi, D., Maydeu-Olivares, A., & DiStefano, C. (2018). The relationship between the
standardized root mean square residual and model misspecification in factor analysis
models. Multivariate Behavioral Research. Advance online publication.
Shi, D., Song, H., & Lewis, M. D. (2017). The impact of partial factorial invariance on cross-
group comparisons. Assessment. Advance online publication.
Shi et al. 25