Professional Documents
Culture Documents
International Surveys
Abstract
In the field of international educational surveys, equivalence of achievement scale
scores across countries has received substantial attention in the academic literature;
however, only a relatively recent emphasis on scale score equivalence in nonachieve-
ment education surveys has emerged. Given the current state of research in multiple-
group models, findings regarding these recent measurement invariance investigations
were supported with research that was limited in scope to few groups and relatively
small sample sizes. To that end, this study uses data from one large-scale survey as a
basis for examining the extent to which typical fit measures used in multiple-group
confirmatory factor analysis are suitable for detecting measurement invariance in a
large-scale survey context. Using measures validated in a smaller scale context and an
empirically grounded simulation study, our findings indicate that many typical mea-
sures and associated criteria are either unsuitable in a large group and varied sample-
size context or should be adjusted, particularly when the number of groups is large.
We provide specific recommendations and discuss further areas for research.
Keywords
measurement equivalence/invariance, international studies, model fit, multiple groups
1
Indiana University, Bloomington, IN, USA
Corresponding Author:
Leslie Rutkowski, Department of Counseling and Educational Psychology, Indiana University, 201 N. Rose
Avenue, Bloomington, IN 47405, USA.
Email: lrutkows@indiana.edu
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
32 Educational and Psychological Measurement 74(1)
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
Rutkowski and Svetina 33
Background
As international educational surveys have grown in number and in numbers of parti-
cipants, the psychometric and cross-cultural scholarly communities have noted
numerous methodological challenges and opportunities that arise in this particular
context. To point to just a few, issues on instrument adaptation (Hambleton, 2002),
scale score comparability (Oliveri & von Davier, 2011), and even defining ‘‘culture’’
(LeTendre, 2002) are manifold when dozens of countries—with different languages
and cultures—participate. From a psychometric perspective, even on a short scale
the number of possible pairwise, cross-country differences on any parameter can be
large. For example, a four-item scale where only cross-country
factor loadings
are
n n
under consideration, the possible pairwise differences are 3= 3.1 For
k 2
10
10 countries, 3 = 45 3 = 135, for 20 countries, this is 190 * 3 = 570.
2
Additionally, sample sizes in international educational surveys are typically in the
thousands. Taken together, these characteristics render the probability of the chi-
square goodness-of-fit test detecting misfit somewhere in the model quite high
(Bentler & Bonett, 1980).
Given the wealth of studies that examine the performance of MG-CFA in the
two-group case (e.g., Chen, 2007; Cheung & Rensvold, 2002; French & Finch, 2006;
Meade et al., 2008; Meade & Lautenschlager, 2004), we do not delve deeply into the
details surrounding MG-CFA. Instead, we briefly discuss the method in general,
summarize the findings from current research in the field, and describe the approach
used in TALIS, which we consider in the current article. Limiting our empirical
focus provides us with a platform from which to more broadly consider the method
of investigating cultural equivalence in a large-group, varied sample size context.
Furthermore, the scope and scale of TALIS is general enough for the findings to
appeal to a wide audience. Finally, the method adopted by TALIS researchers is one
generally recommended in the literature and used more broadly by researchers who
are interested in providing some evidence of the scale score validity of a measure in
multiple populations.
Cultural/Measurement Equivalence/Invariance
With the growth of cross-cultural studies such as TIMSS and TALIS, an interest in
analyzing these types of data has also burgeoned in recent years. This assertion is
supported by a June 2013 search for scholarly (peer reviewed) journal articles on the
Academic Search Premier database using the Boolean search string ‘‘(cultural or
measurement) and (invariance or equivalence).’’ This search resulted in 40 articles
published from 1980 to 1989, 210 articles published from 1990 to 1999, and an
incredible 2,545 articles published from 2000 to 2009. This search includes articles
undertaking within-country comparisons (e.g., evaluating measurement equivalence
by gender or Mexican compared with European Americans). On the other hand, this
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
34 Educational and Psychological Measurement 74(1)
search might well omit inquiries into broader model invariance, including general
latent variable model invariance. The example nevertheless underscores the scale
and growth of research concerned with this method.
Generally, investigations of measurement invariance focus on the degree to which
comparisons on the latent variable of interest (e.g., teacher beliefs) can be validly
compared across heterogeneous populations. Although approaches vary (Schmitt &
Kuljanin, 2008; Vandenberg & Lance, 2000) and can depend on the research ques-
tion (Bollen, 1989), the influential work by Horn, McArdle, and colleagues (Horn &
McArdle, 1992; Horn, McArdle, & Mason, 1983) and Meredith (1993) guides much
of the current applied measurement invariance work. We first begin with the general
factor model given by S = Lx FL0 x + Θd , where S represents the covariance matrix
of the observed variables, Lx represents the matrix of factor loadings that relate the
vector of latent variables, j, with associated covariance matrix F, to the arbitrary
vector of observed variables, X. Finally, Θd represents the covariance matrix of the
measurement errors for X. Typically, although not necessarily, Θd is assumed to be
diagonal, implying no correlated measurement errors. Not evident in the covariance
representation of this model is the mean structure, which also figures importantly
into investigations of measurement invariance. In the common factor model, we can
represent the mean structure by including a vector of intercepts into the system of
equations, denoted by yx. Then, the mean of the observed variables can be repre-
sented by E(X ) = E(yx + Lx j + d). Under typical assumptions (i.e., E(d) = 0; E(j) =
k = 0), then E(X) = yx. We can generalize this model to the multiple population con-
text by allowing for a separate covariance matrix of observed variables for each pop-
ulation, that is S(g), and a separate mean structure, y(g)
x , g = 1, . . ., G.
Albeit with some variations,2 the general approach followed in practice is that if
the null hypothesis, H0 : S(1) = S(2) = = S(G) for G groups is rejected, a set of
nested tests that proceed from least to most restrictive are conducted. Typically, an
investigator begins by testing a scale across populations for configural invariance
(Horn et al., 1983; Horn & McArdle, 1992), also referred to as a test of ‘‘same form’’
(Bollen, 1989) or the ‘‘practical scientist’s’’ invariance (Horn & McArdle, 1992).
Accepting the hypothesis of configural invariance (via a chi-square test of model fit
with appropriate degrees of freedom) provides evidence that the same number of
latent variables (j) with the same pattern of factor loadings (Lx), intercepts (yx), and
measurement errors (d) underlie a set of indicators. Typically, the second test in the
hierarchy is one of metric invariance or, equivalently, weak factorial invariance
(Meredith, 1993). The null hypothesis for examining metric invariance is
H0 : L(1) (2) (G)
x = Lx = = Lx . In other words, the pattern and value of the salient fac-
tor loadings (Horn et al., 1983) should be statistically equal across populations. The
traditional test is a chi-square difference test with degrees of freedom equal to the
number of imposed parameter constraints. Retaining the hypothesis of metric invar-
iance suggests that the strength of the relationship between the latent variable(s) and
the observed variables is the same across populations. Often in practice, the third test
in the hierarchy is often one of scalar invariance or strong factorial invariance
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
Rutkowski and Svetina 35
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
36 Educational and Psychological Measurement 74(1)
TIMSS 2011 63 No
PIRLS 2011 49 No
PISA 2012 64 Yes, but no reported results at the time of writing.
SITES 2006 20 No
ICCS 2009 38 No
TALIS 2008 24 Yes
Note. TIMSS = Trends in International Mathematics Study; PISA = Programme for International Student
Assessment; TALIS = Teaching and Learning International Survey.
To provide further empirical context for the current article, the number of groups
participating in several of the most current international education surveys is included
in Table 1. We also note that TALIS 2008 is currently the only study with reported
empirical results of measurement invariance analyses. As such, we use these results
as the foundation for our investigation. To that end, TALIS analysts evaluated several
scales for measurement invariance and although the hypothesis of both configural
and metric invariance were supported (OECD, 2010a), scalar invariance was gener-
ally not supported. To get a sense of the degree of difference in TALIS parameter
estimates across countries, we fit configural invariance models to two arbitrarily cho-
sen teacher scales (Classroom Disciplinary Climate, with four items and Structuring
Teacher Practices, with five items). In both cases, we adhered to current operational
procedures and assumed that the observed variables were normally distributed.
Descriptive statistics for the intercept and slope estimates on both scales across the
23 countries can be found in Table 2, where we can see large differences in both
slopes and intercepts. Differences are particularly notable on the Structuring Teacher
Practices scale, where the range of the slopes across countries and items is about 2
and the intercept differences range from 3.30 to 3.94, depending on the item. Given
that these items have five response options (from 1 to 5), cross-country differences of
this magnitude in intercepts and slopes are quite significant.
Method
Study Design
We used a simulation study in order to address our research question as it related to
the performance of MG-CFA and associated fit criteria when the number of groups
was relatively large and sample sizes within each group were varied. To more closely
mimic actual data and operational procedures, data were simulated as ordered catego-
rical but analyzed with a normal model. The process used to select population item-
and person-parameter values is discussed subsequently. To obtain reasonable values
for our population parameters, we selected the same two scales described above
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
Rutkowski and Svetina 37
Table 2. Descriptive Statistics From Normal MG-CFA Models Fit to Two Selected TALIS
Scales Across 23 Countries.
Note. MG-CFA = Multiple-group confirmatory factor analysis; TALIS = Teaching and Learning
International Survey.
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
38 Educational and Psychological Measurement 74(1)
above the estimated parameter mean. For example, noninvariant slopes were selected
pffiffiffiffiffiffiffiffiffiffiffi
from N (1:2556 0:506, 0:506). The number, nature, and assignment of noninvariant
parameters are discussed subsequently.
Latent variable means were sampled once for each group from N(0.223, 0.436),
which helped to ensure meaningful between-group differences. We sampled latent
variable variances for each group from a uniform distribution (U[0.5, 2.5]) because
our originally chosen distribution, x21 , resulted in latent variable variances that were
overly small, which caused a number of numerical difficulties during simulation.
We used the generating distributions from which to create a number of conditions,
where we manipulated several factors, including the length of the scale, the number
of noninvariant items, the nature of noninvariance, and the number of groups. For
each simulated condition, the population model was assumed to be unidimensional,
which is consistent with the approach used by the OECD for developing noncognitive
scales (OECD, 2010a). Furthermore, to create reasonable conditions for examining
the performance of typical multiple-group models and associated evaluation criteria,
the models were considered to be correctly specified for all groups under consider-
ation. In other words, the hypothesis of configural invariance should hold across all
groups. A description and rationale for including each of these study factors values is
provided below.
Number of groups. The number of groups examined in the study was set at 10 or 20.
Recall that one of the purposes of the study was to examine measurement invariance
criteria in a relatively large group setting. Based on this motivation, these group sizes
more closely approximate the operational contexts of large-scale surveys, where
group sizes typically range from around a dozen, particularly in the case of field
trials, to 60 or more countries for the main survey in larger studies. For our particular
analysis, group sizes of 10 and 20 are similar to the TALIS context, which is the only
study that currently has published results associated with establishing measurement
invariance for noncognitive scales (OECD, 2010a).
Scale length. The number of items per scale considered in the study was set at five or
six. This scale length is typical of scales in studies such as TALIS, where most
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
Rutkowski and Svetina 39
noncognitive scales (e.g., teacher beliefs about teaching profession) are composed of
just a few items (OECD, 2010a).
Noninvariant items. The number of items affected by noninvariance was zero, two, or
three, regardless of the length of the scale. In five-item scales with noninvariant items
this implied that either 40% or 60% of the items were affected by cross-group para-
meter noninvariance. In the six-item scales, this meant that 33% or 50% of the items
were noninvariant.
Analysis
Data generation and analysis. Data were simulated in Mplus 6.12 (Muthén & Muthén,
1998-2010) according to an ordered categorical model with parameters as previously
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
40 Educational and Psychological Measurement 74(1)
described (see Table 3 for distribution means and variances for population para-
meters). To set the scale and origin of the latent variable for multiple-groups data
generation we used the methods described in Millsap and Yun-Tein (2004). In line
with operational procedures in TALIS, the data were subsequently analyzed under an
assumption of normality in the observed variables. And the origin and scale of the
latent variable were set by fixing the first group’s latent variable mean to 0 and latent
variable variance to 1 while the first factor loading in each group is fixed at 1.
Although this is a meaningful misspecification and can result in interpretation prob-
lems (Lubke & Muthén, 2004), it is the current method of choice, if investigations of
measurement invariance are done at all in international educational assessments. Our
study design yielded a total of 28 conditions (2 group sizes 3 2 scale lengths 3 2
noninvariant item sets 3 3 sources of noninvariance + 4 fully invariant conditions)
and each condition was replicated 500 times.
Results
In the following section, we present the results for the five-item scale (overall fit at
each assumed level of invariance followed by relative fit for metric and scalar
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
Rutkowski and Svetina 41
invariance). We then present the results in a similar fashion for the six-item scale.
Results for the fit measures are reported as averages across the 500 replications
within each respective condition along with standard deviations to indicate the
degree to which results varied across replications.
Five-Item Conditions
Overall fit. As shown in Table 4, Panel A, the average chi-square values under config-
ural invariance were very large relative to the respective degrees of freedom across
each condition, suggesting that the chi-square test detected misfit at configural invar-
iance in all conditions. Average chi-square values for overall fit under an assumption
of metric invariance were predictable in that the results supported further deteriora-
tion in model fit for all conditions, including when data were generated as fully invar-
iant. We also observed that the chi-square test statistics were consistently higher in
the 20-group conditions than in equivalent 10-group conditions. As discussed in the
‘‘Method’’ section, the data were simulated as ordered categorical but the models
assumed normality of observed variables, which suggested here that the chi-square
had very high power to detect this type of model misspecification.
Although the chi-square test results point to misspecification in all conditions that
assumed configural invariance, the fit indices, namely the RMSEA, CFI, TLI, and
SRMR, yielded better results in all invariance conditions and group sizes. In particu-
lar, the CFI and TLI were all larger than .980 and the SRMR never exceeded .020 in
all conditions, suggesting acceptable model fit. The RMSEA, on the other hand,
yielded values in the range of .059 to .073, which is slightly higher than the typical
.050 cutoff.
As the level of assumed invariance was increased from configural to metric and
from metric to scalar invariance, as expected, the fit indices yielded poorer model fit
(see Table 5). For example, when the slopes were assumed invariant, fit indices
yielded lower CFI and TLI values (ranging between .840 and .900, in Conditions 3
through 6 and 11 through 14) and higher SRMR values and RMSEA values greater
than .20 when metric invariance was examined, whereas in fully noninvariant and
conditions with threshold noninvariance (Conditions 1, 2, and 7 through 10), fit
indices remained at acceptable levels, although the RMSEA values suggested only
marginal fit. Similarly, expected patterns were found when scalar-invariant models
were fit to the data in all conditions. In particular, indices suggested acceptable fit
only in fully noninvariant conditions, particularly when the number of groups was
10, with marginal RMSEA values. In addition, the number of groups had differential
impacts across the fit indices, although in general, 10-group conditions had more
favorable outcomes than their 20-group counterparts. For example, when data were
generated with two noninvariant slopes and thresholds, all fit indices suggested better
model fit in the 10-group condition compared with the 20-group condition. In con-
trast, when two thresholds were modeled as noninvariant, all fit indices except for
the SRMR slightly favored the 20-group over the 10-group condition in terms of bet-
ter fit.
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
42
Table 4. Average Fit Values for Configural Invariance (Overall Fit) in Five-Item (Panel A) and Six-Item (Panel B) Conditions for 10 and 20
Groups.
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
(71.12) (118.63) (.002) (.002) (\.001) (\.001) (.001) (.001) (.002) (.001)
(continued)
Table 4. (continued)
Slopes 2 1,019.45 2,789.49 91 181 .059 .066 .994 .992 .990 .987 .020 .017
(69.79) (132.26) (.002) (.002) (.001) (\.001) (.001) (.001) (.002) (.001)
Slopes 3 1,013.81 2,606.78 91 181 .058 .064 .994 .993 .990 .988 .019 .017
(69.45) (131.14) (.002) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Intercepts 2 1,111.42 3,038.99 91 181 .061 .069 .994 .992 .989 .986 .018 .018
(72.82) (141.35) (.002) (.002) (.001) (\.001) (.001) (.001) (.002) (.001)
Intercepts 3 1,186.49 2,983.74 91 181 .064 .068 .993 .992 .989 .986 .019 .017
(76.03) (134.95) (.002) (.002) (.001) (\.001) (.001) (.001) (.002) (.001)
Intercepts and thresholds 2 1,352.44 3,964.11 91 181 .068 .079 .992 .988 .986 .981 .019 .019
(83.27) (165.43) (.002) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Intercepts and thresholds 3 1,185.17 3,649.52 91 181 .064 .076 .992 .989 .988 .982 .019 .018
(75.28) (154.49) (.002) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual; TLI = Tucker–Lewis
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
index.
43
Table 5. Overall Average Fit Values for Metric (Panel A) and Scalar (Panel B) Invariance in Five-Item Conditions.
44
Condition Level Nj x2 (SD) df RMSEA (SD) CFI (SD) TLI (SD) SRMR (SD)
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
11 2_S&T 10 25,356.78 (306.56) 87 .263 (.002) .784 (.003) .825 (.002) .590 (.014)
12 2_S&T 20 64,718.93 (444.94) 177 .278 (.001) .748 (.002) .801 (.002) .860 (.012)
13 3_S&T 10 22,452.27 (282.97) 87 .248 (.002) .802 (.003) .839 (.002) .213 (.002)
14 3_S&T 20 57,649.78 (454.00) 177 .262 (.001) .769 (.002) .818 (.002) .213 (.002)
Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual; TLI = Tucker–Lewis
index. Under Level heading, 0, 2, and 3 correspond to number of invariant items; N = none, T = thresholds, S = slope, S&T = slopes and thresholds; Nj = number
of groups.
Rutkowski and Svetina 45
Relative fit. As shown in Table 6, the chi-square difference test was statistically sig-
nificant, suggesting rejection of both metric and scalar invariance, for all conditions,
including both fully-invariant conditions. In addition, and as expected, an increase in
the number of groups (from 10 to 20) resulted in consistently larger chi-square differ-
ence test statistics for both metric and scalar invariance models across all studied
conditions.
With respect to the relative fit indices, several interesting findings emerged.
Overall, fit indices performed well at examining hypotheses of metric invariance.
For example, DCFI supported metric-invariant hypotheses for 10-group data that
were compatible with the models (Conditions 1, 7, and 9), whereas all conditions
with noninvariant slopes yielded values much greater in magnitude than the 2.010
cutoff. In the equivalent 20-group conditions, DCFI values were close to or some-
what greater than the 2.010 cutoff value (2.010, 2.014, and 2.014 for Conditions
2, 8, and 10, respectively). This finding provides some evidence that for larger
groups, a more liberal criterion for establishing measurement invariance might be
warranted. Findings associated with the DRMSEA indicated good performance for
the 10-group fully invariant condition (Condition 1) and two-item scalar noninvariant
condition (Condition 7). In both situations, the DRMSEA values were below the gen-
erally accepted .010 criteria. In the 20-group conditions, where data were either fully
invariant (Condition 2) or scalar noninvariant (Conditions 8 and 10), the DRMSEA
indices ranged from .022 to .026, also suggesting that a more liberal criterion is war-
ranted in situations with larger number of groups. Furthermore, in the 10-group con-
dition where three items were scalar noninvariant, the average DRMSEA value was
.022.
In examining relative model fit for scalar invariance (i.e., further imposing equal-
ity constraints on the model parameters), fit indices were largely effective at cor-
rectly identifying fully invariant data and also at discriminating between metric- and
scalar-invariant data. Specifically, the average DRMSEA and DCFI values were
mostly within typically accepted criteria for both 10- and 20-group, fully invariant
data (Conditions 1 and 2), suggesting that the hypothesis of scalar invariance would
be supported for both these conditions, assuming that a metric invariance hypothesis
was supported in a previous step. Additionally, data generated to have only noninvar-
iant thresholds had associated relative fit indices that were outside generally accepta-
ble cutoff values for 10- and 20-group cases (Conditions 7, 8, 9, and 10). Both these
would provide some evidence that typically accepted cutoffs for relative fit indices
were reasonably well suited for retaining a hypothesis of fully invariant data and for
determining if the metric-invariant data were additionally scalar invariant.
As Table 6 shows, in several instances, additional constraints yielded average rela-
tive fit index values that would support a scalar invariance hypothesis for data that
were not commensurate with this assumption. In general, however, this finding is not
consequential as long as evidence against metric invariance was observed in a previ-
ous step. For example, in Condition 3, where two slopes were noninvariant, average
DCFI reported was 2.006. Similarly, in Condition 6, where three slopes are
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
Table 6. Relative Fit Tests Results for Five-Item Conditions.
46
Metric invariance Scalar invariance
2 2
Invariance No. of x difference DRMSEA DCFI x difference DRMSEA DCFI
Condition location noninvariant items Groups (SD) df (SD) (SD) (SD) df (SD) (SD)
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
12 Slopes and thresholds 2 20 29,932.41 76 .155 2.117 32,626.84 76 .044 2.127
(338.82) (.002) (.001) (383.26) (.001) (.002)
13 Slopes and thresholds 3 10 13,481.24 36 .174 2.119 8,396.81 36 .015 2.074
(221.79) (.003) (.002) (169.70) (.001) (.002)
14 Slopes and thresholds 3 20 25,119.49 76 .141 2.101 30,665.01 76 .048 2.123
(329.46) (.003) (.001) (341.58) (.001) (.002)
Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual.
Rutkowski and Svetina 47
noninvariant, the average DRMSEA reported was .001. But in both conditions, the fit
indices associated with a test for metric invariance were well outside acceptable cri-
teria, rendering a further test of scalar invariance unnecessary.
Six-Item Conditions
Overall fit. Under the assumption of configural invariance (Table 4, Panel B), average
chi-square values were very high and relative to the degrees of freedom, average val-
ues were highly statistically significant for each condition. This suggests that, accord-
ing to the chi-square test of model fit, none of these data meet minimum criteria for
configural invariance. According to Table 7, Panel A, chi-square values for overall fit
under an assumption of metric invariance are predictable, in that imposing restrictions
on the data resulted in a decrement in fit under all conditions, even for data that are
fully invariant. Similarly, when a model that assumed scalar invariance was fit to data
for all conditions (Table 7, Panel B), model fit declined, according to the chi-square
test. Furthermore, the chi-square test exhibited typically expected performance in that
for data generated under equivalent conditions (e.g., full invariance), the test statistics
were consistently higher when the number of groups was greater (i.e., 10 compared
with 20). Similar to the five-item conditions, these findings suggest that the chi-
square has very high power to detect this sort of model misspecification. And, given
that in each condition the simulated data are commensurate with an assumption of
configural invariance,3 the chi-square test will generally not be well suited for evalu-
ating the hypothesis of configural invariance in this sort of situation.
Despite the finding that the chi-square test of model fit rejects all models under
the assumption of configural invariance, the fit indices, including the RMSEA, CFI,
TLI, and SRMR performed much better in all conditions and group sizes. In particu-
lar, the CFI and TLI were all larger than .98 and the SRMR was not larger than .02,
suggesting that these indices support the hypothesis of configural invariance. One
exception to this trend was the RMSEA, none of which averaged below the typically
accepted cutoff of .05. Instead, under all conditions, this index varied between .058
and .079, indicating less-than-optimal model fit. And, as expected, further restrictions
on the models resulted in deteriorations in these indices. Furthermore, violations of
hypothesized levels of invariance were easily detected by all fit indices. As an exam-
ple, data that were simulated such that three slopes and three sets of thresholds varied
across the groups had associated overall fit indices that fell well outside any conven-
tionally accepted criteria in both the 10- and 20-group conditions when a model that
assumed metric invariance was fit to the data. In contrast, data that were commensu-
rate with a more stringent level of invariance than the level under consideration only
experienced slight deteriorations in model fit (e.g., a metric-invariant model fit to
fully invariant data). One notable finding here is that when only the number of groups
differed but all other parameters were equal across a pair of conditions, the SRMR
tended to be larger for the 20-group condition and this was particularly pronounced
for models that assumed scalar invariance. This finding is somewhat surprising given
that this index is not explicitly a function of sample size (Bentler, 1995).
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
Table 7. Overall Average Fit Values for Metric (Panel A) and Scalar (Panel B) Invariance in Six-Item Conditions.
48
Condition Level Nj x2(SD) df RMSEA (SD) CFI (SD) TLI (SD) SRMR (SD)
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
11 2_S&T 10 26,491.31 (304.51) 181 .222 (.001) .824 (.002) .854 (.002) .541 (.013)
12 2_S&T 20 6,970.93 (478.74) 371 .238 (.001) .787 (.002) .828 (.001) .789 (.010)
13 3_S&T 10 32,195.82 (402.82) 181 .244 (.002) .778 (.003) .816 (.002) .277 (.004)
14 3_S&T 20 80,978.64 (563.38) 371 .256 (.001) .747 (.002) .795 (.002) .288 (.005)
Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual; TLI = Tucker–Lewis
index. Under Level heading, 0, 2, and 3 correspond to number of invariant items; N = none, T = thresholds, S = slope, S&T = slopes and thresholds; Nj = number
of groups.
Rutkowski and Svetina 49
Relative fit. Based on the results presented in Table 8, the chi-square difference test is
strongly statistically significant, suggesting rejection of both metric and scalar invar-
iance hypotheses, for all conditions, including for fully invariant data. And, as
expected, the test statistic values were larger when only group sizes differed across
two comparable conditions (e.g., when two slopes are metric noninvariant). Similar
to the findings of overall fit, the chi-square difference test is overly powerful under
the conditions examined and is likely not a useful measure for examining measure-
ment invariance under these conditions.
A number of interesting findings emerged with respect to the relative fit indices,
which we discuss currently. Based on the six-item conditions examined, the
DRMSEA generally performed very well at examining hypotheses of metric invar-
iance in the 10-group condition. In particular, fully invariant (Condition 1) and scalar
noninvariant data (Conditions 7 and 9) had associated DRMSEA well below the .010
criteria. And all conditions with noninvariant slopes had fit indices larger than .10
(more than 10 times typical cutoff values). In contrast, the DRMSEA values for 20-
group conditions where the data were commensurate with a hypothesis of equal
slopes (Conditions 2, 8, and 10) were all larger than .025, suggesting that a more lib-
eral criterion might be warranted with larger numbers of groups. Similarly, the DCFI
supported metric-invariant hypotheses of 10-group data that were commensurate
with the models (Conditions 1, 7, and 9), whereas all conditions with noninvariant
slopes had relative fit index values much greater in magnitude than the 2.010 cutoff.
And similar to the DRMSEA measure, the DCFI values for commensurate 20-group
data (Conditions 2, 8, and 10) were somewhat over the 2.010 cutoff. By condition,
average values were 2.013, 2.016, and 2.017, respectively. This again indicates
that a more liberal criterion is warranted for larger numbers of groups.
Added equality constraints on the model parameters to examine the hypothesis of
scalar invariance resulted in fit indices that were largely effective at identifying fully
invariant data and at discriminating between metric- and scalar-invariant data.
Specifically, the average DRMSEA and values were well within typically accepted
criteria for both 10- and 20-group, fully invariant data, suggesting that the hypothesis
of scalar invariance would be supported for both these conditions, assuming that a
metric invariance hypothesis was supported in a previous step. Furthermore, data
generated to have only noninvariant thresholds had associated relative fit indices that
were well outside generally accepted cutoff values for both group sizes (Conditions
7, 8, 9, and 10). As with the five-item conditions, these results offer some evidence
that generally accepted cutoffs for both relative fit indices are reasonably well suited
for retaining fully invariant data and for determining whether metric-invariant data
are additionally scalar invariant. And although in some conditions, additional con-
straints resulted in average relative fit index values that would generally provide evi-
dence in favor of the scalar invariance hypothesis, this is of little consequence as
long as evidence against metric invariance were heeded in a previous analytic step.
Examples of this include the average DCFI value for Condition 3, where two slopes
are noninvariant (2.005) and the average DRMSEA value where three slopes are
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
Table 8. Relative Fit Tests Results for Six-Item Conditions.
50
Condition Invariance location No. of noninvariant items Groups Metric invariance Scalar invariance
x2 difference df DRMSEA DCFI x2 difference df DRMSEA DCFI (SD)
(SD) (SD) (SD) (SD) (SD)
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
12 Slopes and thresholds 2 20 31,959.44 95 .118 2.098 33,777.39 95 .040 2.103
(39.57) (.002) (.001) (407.23) (.001) (.001)
13 Slopes and thresholds 3 10 18,891.13 45 .159 2.131 12,119.52 45 .022 2.084
(279.75) (.003) (.002) (239.89) (.001) (.002)
14 Slopes and thresholds 3 20 34,691.26 95 .128 2.109 42,637.86 95 .052 2.134
(421.90) (.002) (.001) (435.46) (.001) (.002)
Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual.
Rutkowski and Svetina 51
noninvariant (.006). We note one seemingly odd finding: that the DRMSEA is nega-
tive (albeit small) for some conditions. Given that the RMSEA is a function of the
ratio of the chi-square statistic to the degrees of freedom for the model under consid-
eration (Bentler, 1995; Hu & Bentler, 1999), this finding can be explained by the fact
that this ratio was, on average, smaller for the scalar-invariant models due to a rela-
tively large increase in the degrees of freedom obtained by imposing equality con-
straints on the intercepts (1975.61/181 = 5.38 compared with 1583.61/136 = 12.12,
respectively). And given that the chi-square difference test and DCFI are in expected
directions, this finding should be interpreted as within the acceptable cutoff and
points to conditions under which this measure can be slightly negative.
Discussion
In making comparisons cross-nationally, there is ambiguity regarding whether differ-
ences in scale score means can be attributed to authentic differences between coun-
tries or to cross-country measurement differences, because of cultural response
biases, translation errors, or cultural differences in understanding the underlying con-
struct. Thus, without evidence to support measurement equivalence, any claims or
conclusions regarding comparative differences are necessarily weak (Horn, 1991;
Vandenberg & Lance, 2000). One means for investigating the comparability of scale
scores is through investigation of measurement invariance; however, little is known
about the performance of this method and typically used fit measures when the num-
ber of groups is large and the sample sizes are large and varied. As such, the current
article provides some evidence regarding the performance of MG-CFA in a large-
scale assessment context.
To provide some insights into the degree to which typically accepted MG-CFA
model-to-data fit measures are applicable in contexts where the number of groups is
large and sample sizes are varied, we conducted a simulation study that mimics
operational procedures currently used by the OECD, an international organization
involved in educational surveys and evaluation. To that end, we generated multiple-
group measurement models that drew on an empirical analysis of two TALIS scales
to arrive at plausible item- and person-parameters. We examined several factors in
this study including two different scale lengths (five and six items), different sources
of noninvariance (in slopes, thresholds, and both slopes and thresholds), and two dif-
ferent numbers of groups (10 and 20). To further align our study with authentic con-
ditions, we simulated unidimensional ordered categorical data; however, in line with
operational OECD procedures, we applied normal models to these data. We then
evaluated several widely accepted measures of model fit in the multiple-group con-
text. We subsequently summarize several interesting findings along with implica-
tions for applied researchers and future areas of research.
Before proceeding to a more detailed discussion we draw attention to one impor-
tant point: We recognize that the disjuncture between the generating models (ordered
categorical) and the analytic models (normal) are inconsistent and that this choice is
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
52 Educational and Psychological Measurement 74(1)
theoretically not best practice. However, given the current state of the practice, this
is a purposeful choice and reflects currently used methods by agencies that conduct
investigations of measurement invariance, if they are conducted at all.
Under the studied conditions, our results provide some evidence that the chi-
square test is likely not useful as a test of overall model fit, as this test strongly sug-
gested model misfit across all studied conditions. Furthermore, the chi-square differ-
ence test was also not a suitable measure in this context for the same reasons. These
findings are not surprising, given that the studied conditions included both large
number of groups, large sample sizes, and we began with a theoretically misspecified
model (one that assumes normality when the data are ordered categorical). Despite
this result, we do note that the chi-square test statistics (overall and relative) behaved
as expected across all conditions in that these values were generally larger for larger
numbers of groups and as more constraints were placed on these models. We also
acknowledge that our results support earlier findings (Cheung & Rensvold, 2002;
Meade et al., 2008), which found that alternative fit indices are often preferable over
chi-square tests of model fit in the multiple-group context.
Given the results of the chi-square test statistics, the following discussion is
focused on the utility of the overall and relative fit indices in the studied conditions.
Of the fit indices considered, our findings suggested that currently accepted cutoffs
for the CFI, TLI, and SRMR are generally suitable. In particular, for overall fit eva-
luations of configural, metric, and scalar invariance, commensurate models fit to data
were overwhelmingly supported by CFI greater than .95 and TLI greater than .95. In
fact, under every condition studied, neither of these statistics was less than .97 when
the model was consistent with the data, providing evidence that a slightly more strin-
gent cutoff would be reasonable in this context. In most cases, an SRMR of less than
.08 was also suitable for determining overall fit at different levels of invariance;
however, a few exceptions existed. For example, in the 20-group, six-item, fully
invariant data condition, a metric-invariant model resulted in an SRMR of .104.
Likewise, a few other conditions produced similar results. On first inspection, this
might suggest that a more liberal SRMR criterion should be considered (e.g., .10 or
.11); however, in several instances, hypothesized models that did not correspond to
the data resulted in SRMR values of around .10. As such, we recommend that the
SRMR is not used in isolation, if it is used at all. Rather, it should be used in con-
junction with the CFI and TLI. And where inconsistencies in these measures arise,
the analyst would be better served by relying on the CFI and TLI. In contrast, our
findings lead us to recommend a slightly more liberal criterion for the RMSEA, par-
ticularly when groups are relatively large. In particular, we found several instances
where a model-to-data match did not meet the minimum RMSEA criterion and this
finding was more pronounced in the 20-group conditions; however, the RMSEA was
highly sensitive to model-to-data mismatches. This leads us to suggest an RMSEA
cutoff of around .10 when there are at least 10 groups.
In terms of the relative fit indices (DCFI and DRMSEA), we note several recom-
mendations based on our findings. Regardless of scale length, we observed a
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
Rutkowski and Svetina 53
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
54 Educational and Psychological Measurement 74(1)
traditional cutoffs are suitable for determining scalar invariance. We also recommend
that the overall SRMR be used in conjunction with other measures or not at all for
determining the overall plausibility of configural, metric, and scalar invariance.
Finally, we note that our article was concerned only with the detection, not the cause,
of noninvariance. As such, we advocate for an approach whereby once measurement
nonvariance is detected at some level of the hierarchy, follow-up analyses are neces-
sary to attempt to locate the source of noninvariance in the scale (e.g., in consultation
with culture studies experts and/or linguists to examine potential sources of
variability).
Funding
The authors(s) declared receipt of the following financial support for the research, authorship,
and/or publication of this article: This work was partially funded by a contract from the
Organisation for Economic Co-operation & Development.
Notes
1. This assumes the latent variable is scaled by fixing the first factor loading to 1, then
three loadings are free to differ across groups.
2. In the tradition of Jöreskog (1971), researchers might begin their measurement invar-
iance investigation for G groups by first testing the most restrictive null hypothesis
H0 : S(1) = S(2) = = S(G) . Assuming this hypothesis is rejected, the researcher either
proceeds by releasing constraints on this most restrictive model or by adding constraints to
the least restrictive (configural invariance) model. In some situations (OECD, 2010a), the
Jöreskog null hypothesis is not tested and instead the first test of invariance is configural.
3. Given that the data are generated as ordered categorical, it is more correct to say that
these data meet the criteria to serve as a baseline model (Millsap & Yun-Tein, 2004); how-
ever, to avoid confusion, we use the term configural invariance.
References
Allalouf, A., Hambleton, R. K., & Sireci, S. G. (1999). Identifying the causes of DIF in
translated verbal items. Journal of Educational Measurement, 36, 185-198. doi:
10.1111/j.1745-3984.1999.tb00553.x
Bagozzi, R. P. (1977). Structural equation models in experimental research. Journal of
Marketing Research, 14, 209-226. doi:10.2307/3150471
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin,
107, 238-246. doi:10.1037/0033-2909.107.2.238
Bentler, P. M. (1995). EQS structural equations program manual. Encino, CA: Multivariate
Software.
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
Rutkowski and Svetina 55
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of
covariance structures. Psychological Bulletin, 88, 588-606. doi:10.1037/0033-2909.88
.3.588
Bollen, K. (1989). Structural equations with latent variables. New York, NY: Wiley.
Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance.
Structural Equation Modeling, 14, 464-504. doi:10.1080/10705510701301834
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing
measurement invariance. Structural Equation Modeling, 9, 233-255. doi:
10.1207/S15328007SEM0902_5
De Beuckelaer, A., Lievens, F., & Swinnen, G. (2007). Measurement equivalence in the
conduct of a global organizational survey across countries in six cultural regions. Journal
of Occupational and Organizational Psychology, 80, 575-600.
Deshon, R. P. (2004). Measures are not invariant across groups without error variance
homogeneity. Psychology Science, 46, 137-149.
Ercikan, K. (2002). Disentangling sources of differential item functioning in multilanguage
assessments. International Journal of Testing, 2, 199-215.
Ercikan, K. (2003). Are the English and French versions of the Third International
Mathematics and Science Study administered in Canada comparable? Effects of
adaptations. International Journal of Educational Policy, Research and Practice, 4, 55-76.
French, B. F., & Finch, W. H. (2006). Confirmatory factor analytic procedures for the
determination of measurement invariance. Structural Equation Modeling, 13, 378-402. doi:
10.1207/s15328007sem1303_3
Grisay, A., & Monseur, C. (2007). Measuring the equivalence of item difficulty in the various
versions of an international test. Studies in Educational Evaluation, 33, 69-86. doi:
10.1016/j.stueduc.2007.01.006
Hambleton, R. K. (2002). Adapting achievement tests into multiple languages for international
assessments. In A. C. Porter & A. Gamoran (Eds.), Methodological advances in cross-
national surveys of educational achievement (pp. 58-79). Washington, DC: National
Academies Press.
Hambleton, R. K., Merenda, P. F., & Spielberger, C. D. (Eds.). (2005). Adapting educational
and psychological tests for cross-cultural assessment. New York, NY: Psychology Press.
Hambleton, R. K., & Rogers, H. J. (1989). Detecting potentially biased test items: Comparison
of IRT area and Mantel-Haenszel methods. Applied Measurement in Education, 2, 313-334.
doi:10.1207/s15324818ame0204_4
Hancock, G. R. (1997). Structural equation modeling methods of hypothesis testing of latent
variable means. Measurement and Evaluation in Counseling and Development, 30, 91-105.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-
Haenszel procedure. In H. Wainer & H. Braun (Eds.), Test validity (pp. 129-145). Hillsdale,
NJ: Lawrence Erlbaum.
Horn, J. L. (1991). Comments on ‘‘Issues in factorial invariance.’’ In L. M. Collins & J. L.
Horn (Eds.), Best methods for the analysis of change (pp. 114-125). Washington, DC:
American Psychological Association.
Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement
invariance in aging research. Experimental Aging Research, 18, 117-144. doi:
10.1080/03610739208253916
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
56 Educational and Psychological Measurement 74(1)
Horn, J. L., McArdle, J., & Mason, R. (1983). When is invariance not invariant: A practical
scientist’s look at the ethereal concept of factor invariance. Southern Psychologist, 1,
179-188.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55. doi:
10.1080/10705519909540118
Jöreskog K. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36,
409-426. doi:10.1007/BF02291366
LeTendre, G. K. (2002). Advancements in conceptualizing and analyzing cultural effects in
cross-national studies of educational achievement. In A. C. Porter & A. Gamoran (Eds.),
Methodological advances in cross-national surveys of educational achievement (pp. 198-
228). Washington, DC: National Academies Press.
Little, T. D. (1997). Mean and Covariance Structures (MACS)analyses of cross-cultural data:
Practical and theoretical issues. Multivariate Behavioral Research, 32, 53-76. doi:
10.1207/s15327906mbr3201_3
Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Mahwah, NJ: Lawrence Erlbaum.
Lubke, G. H., & Dolan, C. V. (2003). Can unequal residual variances across groups mask
differences in residual means in the common factor model?Structural Equation Modeling,
10, 175-192. doi:10.1207/S15328007SEM1002_1
Lubke, G. H., & Muthén, B. O. (2004). Applying multigroup confirmatory factor models for
continuous outcomes to Likert scale data complicates meaningful group comparisons.
Structural Equation Modeling, 11, 514-534. doi:10.1207/s15328007sem1104_2
Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternative fit
indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568-592.
doi:10.1037/0021-9010.93.3.568
Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of item response theory and
confirmatory factor analytic methodologies for establishing measurement equivalence/
invariance. Organizational Research Methods, 7, 361-388. doi:10.1177/1094428104268027
Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of
Educational Statistics, 7, 105-118.
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance.
Psychometrika, 58, 525-543. doi:10.1007/BF02294825
Millsap, R. E., & Yun-Tein, J. (2004). Assessing factorial invariance in ordered-categorical
measures. Multivariate Behavioral Research, 39, 479-515. doi:10.1207/S15327906MBR3903_4
Muthén, L. K., & Muthén, B. O. (1998-2010). Mplus user’s guide (6th ed.). Los Angeles, CA:
Author. Retrieved from http://www.statmodel.com/download/usersguide/Mplus%20Users%
20Guide%20v6.pdf
Oliveri, M. E., & von Davier, M. (2011). Investigation of model fit and score scale
comparability in international assessments. Psychological Test and Assessment Modeling,
53, 315-333.
Olson, J., Martin, M., & Mullis, I. V. S. (2008). TIMSS 2007 technical report. Chestnut Hill,
MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston
College.
Organisation for Economic Co-operation and Development. (2010a). TALIS technical report.
Paris, France: Author.
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
Rutkowski and Svetina 57
Organisation for Economic Co-operation and Development. (2010b). PISA 2009 technical
report. Paris, France: Author. Retrieved from http://www.oecd.org/edu/preschoolandschool/
programmeforinternationalstudentassessmentpisa/pisa2009technicalreport.htm
Schmitt, N., & Kuljanin, G. (2008). Measurement invariance: Review of practice and
implications. Human Resource Management Review, 18, 210-222. doi:
10.1016/j.hrmr.2008.03.003
Steiger, J. H., & Lind, J. C. (1980). Statistically based tests for the number of common factors.
Presented at the Meeting of the Psychometric Society, Iowa City, IA.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using
logistic regression procedures. Journal of Educational Measurement, 27, 361-370. doi:
10.1111/j.1745-3984.1990.tb00754.x
Thompson, M. S., & Green, S. B. (2006). Evaluating between-group differences in latent
variable means. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling: A
second course (1st ed., pp. 119-169). Charlotte, NC: Information Age.
Torsheim, T., Samdal, O., Rasmussen, M., Freeman, J., Griebler, R., & Dür, W. (2012). Cross-
national measurement invariance of the teacher and classmate support scale. Social
Indicators Research, 105, 145-160. doi:10.1007/s11205-010-9770-9
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement
invariance literature: Suggestions, practices, and recommendations for organizational
research. Organizational Research Methods, 3, 4-70. doi:10.1177/109442810031002
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015