Rutkowski Invarianza

Article
Educational and Psychological

Measurement
Assessing the Hypothesis of 2014, Vol 74(1) 31–57
Ó The Author(s) 2013
Measurement Invariance in Reprints and permissions:
sagepub.com/journalsPermissions.nav
the Context of Large-Scale DOI: 10.1177/0013164413498257
epm.sagepub.com
International Surveys
Leslie Rutkowski1 and Dubravka Svetina1
Abstract
In the field of international educational surveys, equivalence of achievement scale
scores across countries has received substantial attention in the academic literature;
however, only a relatively recent emphasis on scale score equivalence in nonachieve-
ment education surveys has emerged. Given the current state of research in multiple-
group models, findings regarding these recent measurement invariance investigations
were supported with research that was limited in scope to few groups and relatively
small sample sizes. To that end, this study uses data from one large-scale survey as a
basis for examining the extent to which typical fit measures used in multiple-group
confirmatory factor analysis are suitable for detecting measurement invariance in a
large-scale survey context. Using measures validated in a smaller scale context and an
empirically grounded simulation study, our findings indicate that many typical mea-
sures and associated criteria are either unsuitable in a large group and varied sample-
size context or should be adjusted, particularly when the number of groups is large.
We provide specific recommendations and discuss further areas for research.
Keywords
measurement equivalence/invariance, international studies, model fit, multiple groups
In the field of education, international large-scale assessments, such as the Trends in

International Mathematics Study (TIMSS) and the Programme for International
Student Assessment (PISA), are tasked with measuring what students know and can
1
Indiana University, Bloomington, IN, USA
Corresponding Author:
Leslie Rutkowski, Department of Counseling and Educational Psychology, Indiana University, 201 N. Rose
Avenue, Bloomington, IN 47405, USA.
Email: lrutkows@indiana.edu
Downloaded from epm.sagepub.com at Univ of Connecticut / Health Center / Library on May 22, 2015
32 Educational and Psychological Measurement 74(1)
do internationally. Further, non-performance based education studies, such as the

Teaching and Learning International Survey (TALIS), seek to measure study partici-
pants on latent variables that deal with attitudes, perceptions, and experiences. In both
types of study, performance on tasks or scores on other latent variables are typically
summarized in terms of measurement model-based scale scores (Olson, Martin, &
Mullis, 2008; Organisation for Economic Co-operation and Development [OECD],
2010a).
Internationally, surveys are not limited to the educational sphere. Indeed, interna-
tionally comparative studies abound. For example, the World Health Organization
administers a World Health Survey. Similarly, UNICEF is responsible for three
cycles of the Multiple Indicator Cluster Survey, which monitors the situation of
women and children globally. Regardless of the survey topic or specific data source,
an important criterion for comparing scale scores in an international context is that
the latent variable is understood and measured equivalently across all countries. This
property is typically (although not only) referred to as measurement invariance
(Meredith, 1993), lack of bias (Lord, 1980) or absence of differential item functioning
(Hambleton & Rogers, 1989; Mellenbergh, 1982; Swaminathan & Rogers, 1990).
Although equivalence of scale scores in large-scale achievement tests has received
substantial attention in the academic literature (e.g., Allalouf, Hambleton, & Sireci,
1999; Ercikan, 2002, 2003; Grisay & Monseur, 2007; Hambleton, Merenda, &
Spielberger, 2005), only a relatively recent emphasis on scale score equivalence in
international nonachievement education surveys has emerged (OECD, 2010a).
To that end, the 2008 cycle of TALIS used multiple-group confirmatory factor
analysis (MG-CFA; Jöreskog, 1971) to provide evidence of score comparability on a
number of scales designed to measure and compare teachers internationally in areas
such as beliefs and practices (OECD, 2010a). TALIS researchers supported their
findings regarding measurement invariance with research that was limited in scope to
few groups and relatively small sample sizes (Chen, 2007; French & Finch, 2006). It
is notable that relying on these limited studies is not a deficiency of TALIS, but
rather an indication of a general lack of scholarly evidence regarding the feasibility
and tolerance of multiple-group models in international large-scale assessment or
other large-scale survey-type contexts. This paucity is reflected in relatively recent
calls for more research into the performance of fit indices when the number of groups
is greater than two (Meade, Johnson, & Braddy, 2008) and when the sample sizes are
large (French & Finch, 2006). In response, the proposed study examines the extent to
which typical model evaluation measures in MG-CFA are suitable for detecting mea-
surement invariance in large-scale educational studies such as TIMSS, PISA, and
TALIS, where the number of groups is large and the sample size within groups is
varied—from relatively small to relatively large. Although, as noted, there exist
many other international surveys, we use TALIS as a prominent example and one
from which findings should generalize to other international surveys and multiple-
group comparisons with large numbers of groups and varied sample sizes.
Rutkowski and Svetina 33
Background
As international educational surveys have grown in number and in numbers of parti-
cipants, the psychometric and cross-cultural scholarly communities have noted
numerous methodological challenges and opportunities that arise in this particular
context. To point to just a few, issues on instrument adaptation (Hambleton, 2002),
scale score comparability (Oliveri & von Davier, 2011), and even defining ‘‘culture’’
(LeTendre, 2002) are manifold when dozens of countries—with different languages
and cultures—participate. From a psychometric perspective, even on a short scale
the number of possible pairwise, cross-country differences on any parameter can be
large. For example, a four-item scale where only cross-country
factor loadings
are
n n
under consideration, the possible pairwise differences are 3= 3.1 For
k 2

10
10 countries, 3 = 45 3 = 135, for 20 countries, this is 190 * 3 = 570.
2
Additionally, sample sizes in international educational surveys are typically in the
thousands. Taken together, these characteristics render the probability of the chi-
square goodness-of-fit test detecting misfit somewhere in the model quite high
(Bentler & Bonett, 1980).
Given the wealth of studies that examine the performance of MG-CFA in the
two-group case (e.g., Chen, 2007; Cheung & Rensvold, 2002; French & Finch, 2006;
Meade et al., 2008; Meade & Lautenschlager, 2004), we do not delve deeply into the
details surrounding MG-CFA. Instead, we briefly discuss the method in general,
summarize the findings from current research in the field, and describe the approach
used in TALIS, which we consider in the current article. Limiting our empirical
focus provides us with a platform from which to more broadly consider the method
of investigating cultural equivalence in a large-group, varied sample size context.
Furthermore, the scope and scale of TALIS is general enough for the findings to
appeal to a wide audience. Finally, the method adopted by TALIS researchers is one
generally recommended in the literature and used more broadly by researchers who
are interested in providing some evidence of the scale score validity of a measure in
multiple populations.
Cultural/Measurement Equivalence/Invariance
With the growth of cross-cultural studies such as TIMSS and TALIS, an interest in
analyzing these types of data has also burgeoned in recent years. This assertion is
supported by a June 2013 search for scholarly (peer reviewed) journal articles on the
Academic Search Premier database using the Boolean search string ‘‘(cultural or
measurement) and (invariance or equivalence).’’ This search resulted in 40 articles
published from 1980 to 1989, 210 articles published from 1990 to 1999, and an
incredible 2,545 articles published from 2000 to 2009. This search includes articles
undertaking within-country comparisons (e.g., evaluating measurement equivalence
by gender or Mexican compared with European Americans). On the other hand, this
search might well omit inquiries into broader model invariance, including general
latent variable model invariance. The example nevertheless underscores the scale
and growth of research concerned with this method.
Generally, investigations of measurement invariance focus on the degree to which
comparisons on the latent variable of interest (e.g., teacher beliefs) can be validly
compared across heterogeneous populations. Although approaches vary (Schmitt &
Kuljanin, 2008; Vandenberg & Lance, 2000) and can depend on the research ques-
tion (Bollen, 1989), the influential work by Horn, McArdle, and colleagues (Horn &
McArdle, 1992; Horn, McArdle, & Mason, 1983) and Meredith (1993) guides much
of the current applied measurement invariance work. We first begin with the general
factor model given by S = Lx FL0 x + Θd , where S represents the covariance matrix
of the observed variables, Lx represents the matrix of factor loadings that relate the
vector of latent variables, j, with associated covariance matrix F, to the arbitrary
vector of observed variables, X. Finally, Θd represents the covariance matrix of the
measurement errors for X. Typically, although not necessarily, Θd is assumed to be
diagonal, implying no correlated measurement errors. Not evident in the covariance
representation of this model is the mean structure, which also figures importantly
into investigations of measurement invariance. In the common factor model, we can
represent the mean structure by including a vector of intercepts into the system of
equations, denoted by yx. Then, the mean of the observed variables can be repre-
sented by E(X ) = E(yx + Lx j + d). Under typical assumptions (i.e., E(d) = 0; E(j) =
k = 0), then E(X) = yx. We can generalize this model to the multiple population con-
text by allowing for a separate covariance matrix of observed variables for each pop-
ulation, that is S(g), and a separate mean structure, y(g)
x , g = 1, . . ., G.
Albeit with some variations,2 the general approach followed in practice is that if
the null hypothesis, H0 : S(1) = S(2) = = S(G) for G groups is rejected, a set of
nested tests that proceed from least to most restrictive are conducted. Typically, an
investigator begins by testing a scale across populations for configural invariance
(Horn et al., 1983; Horn & McArdle, 1992), also referred to as a test of ‘‘same form’’
(Bollen, 1989) or the ‘‘practical scientist’s’’ invariance (Horn & McArdle, 1992).
Accepting the hypothesis of configural invariance (via a chi-square test of model fit
with appropriate degrees of freedom) provides evidence that the same number of
latent variables (j) with the same pattern of factor loadings (Lx), intercepts (yx), and
measurement errors (d) underlie a set of indicators. Typically, the second test in the
hierarchy is one of metric invariance or, equivalently, weak factorial invariance
(Meredith, 1993). The null hypothesis for examining metric invariance is
H0 : L(1) (2) (G)
x = Lx = = Lx . In other words, the pattern and value of the salient fac-
tor loadings (Horn et al., 1983) should be statistically equal across populations. The
traditional test is a chi-square difference test with degrees of freedom equal to the
number of imposed parameter constraints. Retaining the hypothesis of metric invar-
iance suggests that the strength of the relationship between the latent variable(s) and
the observed variables is the same across populations. Often in practice, the third test
in the hierarchy is often one of scalar invariance or strong factorial invariance
(Meredith, 1993). The null hypothesis for examining scalar invariance is

H0 : y(1) (2) (G)
x = yx = = yx . Again, a chi-square difference test with degrees of free-
dom equal to the number of additional parameter constraints is used to evaluate the
tenability of this hypothesis. Although some scholars advocate for strict factorial
invariance (Meredith, 1993) or equality of residual variances as a condition for com-
paring latent means (Deshon, 2004; Lubke & Dolan, 2003), scalar invariance gener-
ally supports cross-group comparisons of manifest (or latent) variable means on the
latent variable of interest (Hancock, 1997; Little, 1997; Thompson & Green, 2006).
Given the well-known sensitivity of the chi-square test of model fit to sample size
(Bagozzi, 1977; Bentler & Bonett, 1980), dozens of fit-evaluation alternatives have
been offered. In practice, a small handful of these are recommended for use in
empirical research using multiple-group models. In particular, Cheung and Rensvold
(2002) recommended a change in the comparative fit index (CFI; Bentler, 1990)
equal to or greater than 2.010 as evidence of non-invariance. This finding was sup-
ported in further research by French and Finch (2006) in the multivariate normal
context. Furthermore, Chen (2007) suggested that changes in CFI (DCFI) equal to or
greater than 2.010 supplemented by a change in the root mean square error of
approximation (RMSEA; Steiger & Lind, 1980) less than or equal to .015 were indi-
cative of noninvariance when sample sizes were equal across groups and larger than
300 in each group. Chen also recommended DCFI not less than 2.005 and and
DRMSEA at least as small as .010 when sample sizes were unequal and each sample
size was smaller than 300. Finally, Chen recommended changes in the standardized
root mean square residual (SRMR; Bentler, 1995) no larger than .005 to test for sca-
lar invariance in the case of small, uneven samples to DSRMR of less than or equal
to .030 when testing for metric invariance in the case of large, even sample sizes.
Operationally, the OECD has adopted the following criteria: DCFI at least as large
as 2.010 and DRMSEA less than or equal to .010 as sufficient evidence of noninvar-
iance of either slopes or intercepts (OECD, 2010a). We maintain consistency with
the OECD in the current article and adopt the same criteria for evaluating levels of
invariance.
For a number of practical reasons, much methodological research in the area of
measurement invariance has been limited to the two-population case, where there is a
reference population and a focal population (Holland & Thayer, 1988). In spite of this
limitation in the methodological literature, applied researchers can and do compare
larger numbers of populations (De Beuckelaer, Lievens, & Swinnen, 2007; Torsheim
et al., 2012) but they typically use criteria validated for two-group investigations
(Chen, 2007; Cheung & Rensvold, 2002; French & Finch, 2006). As such, little is
known about the performance of these fit statistics and indices for determining mea-
surement invariance when the number of groups is greater than two and the sample
sizes are varied. The current article begins to investigate this issue. In particular, we
pose the following research question: Are currently accepted measures for evaluating
measurement invariance suitable when the number of groups and within-group sam-
ple sizes are more typical of those found in cross-national surveys?
Table 1. Number of System-Level Participants in Several International Education Studies.
Study N Formal investigation of measurement invariance?
TIMSS 2011 63 No
PIRLS 2011 49 No
PISA 2012 64 Yes, but no reported results at the time of writing.
SITES 2006 20 No
ICCS 2009 38 No
TALIS 2008 24 Yes
Note. TIMSS = Trends in International Mathematics Study; PISA = Programme for International Student
Assessment; TALIS = Teaching and Learning International Survey.
To provide further empirical context for the current article, the number of groups
participating in several of the most current international education surveys is included
in Table 1. We also note that TALIS 2008 is currently the only study with reported
empirical results of measurement invariance analyses. As such, we use these results
as the foundation for our investigation. To that end, TALIS analysts evaluated several
scales for measurement invariance and although the hypothesis of both configural
and metric invariance were supported (OECD, 2010a), scalar invariance was gener-
ally not supported. To get a sense of the degree of difference in TALIS parameter
estimates across countries, we fit configural invariance models to two arbitrarily cho-
sen teacher scales (Classroom Disciplinary Climate, with four items and Structuring
Teacher Practices, with five items). In both cases, we adhered to current operational
procedures and assumed that the observed variables were normally distributed.
Descriptive statistics for the intercept and slope estimates on both scales across the
23 countries can be found in Table 2, where we can see large differences in both
slopes and intercepts. Differences are particularly notable on the Structuring Teacher
Practices scale, where the range of the slopes across countries and items is about 2
and the intercept differences range from 3.30 to 3.94, depending on the item. Given
that these items have five response options (from 1 to 5), cross-country differences of
this magnitude in intercepts and slopes are quite significant.
Method
Study Design
We used a simulation study in order to address our research question as it related to
the performance of MG-CFA and associated fit criteria when the number of groups
was relatively large and sample sizes within each group were varied. To more closely
mimic actual data and operational procedures, data were simulated as ordered catego-
rical but analyzed with a normal model. The process used to select population item-
and person-parameter values is discussed subsequently. To obtain reasonable values
for our population parameters, we selected the same two scales described above
Table 2. Descriptive Statistics From Normal MG-CFA Models Fit to Two Selected TALIS
Scales Across 23 Countries.
Scale Parameter Minimum Maximum Range Mean SD
Structuring Teacher Practices Slope 1 1.00 1.00 0.00 1.00 0.00

Slope 2 0.78 3.37 2.60 1.55 0.56
Slope 3 0.87 2.73 1.85 1.50 0.40
Slope 4 0.86 2.72 1.86 1.38 0.43
Slope 5 0.78 3.00 2.22 1.27 0.45
Intercept 1 3.48 3.48 0.00 3.48 0.00
Intercept 2 0.44 4.11 3.67 2.93 0.86
Intercept 3 0.86 4.15 3.30 3.21 0.73
Intercept 4 0.16 3.91 3.75 2.62 0.87
Intercept 5 1.11 5.05 3.94 3.88 0.83
Disciplinary Climate Slope 1 1.00 1.00 0.00 1.00 0.00
Slope 2 1.12 1.92 0.80 1.41 0.19
Slope 3 1.06 1.99 0.93 1.41 0.22
Slope 4 0.83 1.43 0.60 1.09 0.16
Intercept 1 2.22 2.22 0.00 2.22 0.00
Intercept 2 2.22 2.60 0.38 2.40 0.10
Intercept 3 1.93 2.51 0.59 2.24 0.13
Intercept 4 1.82 2.62 0.80 2.29 0.20
Note. MG-CFA = Multiple-group confirmatory factor analysis; TALIS = Teaching and Learning
International Survey.
(Structuring Teacher Practices and Disciplinary Climate), to which we fit unidimen-

sional ordered categorical MG-CFA models to the TALIS 2008 data. Based on the
empirical results, we then calculated the mean and variance of each of the parameter
values across countries and scales to arrive at a reasonable basis for our generating
distributions. For example, we estimated the mean and variance for all slopes across
the 23 participating countries and selected scales to be 1.255 and 0.506, respectively.
See Table 3 for population means and variances along with generating distributions
used in the simulation study. We then randomly drew from each generating distribu-
tion to create population values for each of the relevant item parameters. Generating
population values were largely assumed to be normally distributed with the exception
of the residual variance parameters, which we assumed to be distributed as x22 , given
that variance is often assumed to follow a chi-square distribution and the empirical
mean of the residual variance estimates was 2.414. And because we specified
observed variables as ordered categorical, we used the average estimated low thresh-
old value as the first threshold for an item (e.g., the threshold between Categories 0
and 1) and then took a random draw from the estimated distance between thresholds
and added that value to the first threshold parameter to locate the next threshold (e.g.,
between Categories 1 and 2). We repeated this step for each threshold and item, with
the number of thresholds corresponding to the number of categories minus 1. Then,
to simulate noninvariant parameters, we selected either 1 standard deviation below or
Table 3. Parameter-Generating Distributions Based on Empirical Analysis of TALIS Scales.
Parameters Mean Variance Distribution
Estimated slopes 1.255 0.506 Normal

Estimated low thresholds 22.058 0.336 Normal
Estimated distance between thresholds 1.256 0.842 Normal
Estimated residual variance 2.414 x22
Latent variable means 0.223 0.436 Normal
Latent variable variance 0.861 Uniform [0.5, 2.5]
Note. TALIS = Teaching and Learning International Survey.
above the estimated parameter mean. For example, noninvariant slopes were selected
pffiffiffiffiffiffiffiffiffiffiffi
from N (1:2556 0:506, 0:506). The number, nature, and assignment of noninvariant
parameters are discussed subsequently.
Latent variable means were sampled once for each group from N(0.223, 0.436),
which helped to ensure meaningful between-group differences. We sampled latent
variable variances for each group from a uniform distribution (U[0.5, 2.5]) because
our originally chosen distribution, x21 , resulted in latent variable variances that were
overly small, which caused a number of numerical difficulties during simulation.
We used the generating distributions from which to create a number of conditions,
where we manipulated several factors, including the length of the scale, the number
of noninvariant items, the nature of noninvariance, and the number of groups. For
each simulated condition, the population model was assumed to be unidimensional,
which is consistent with the approach used by the OECD for developing noncognitive
scales (OECD, 2010a). Furthermore, to create reasonable conditions for examining
the performance of typical multiple-group models and associated evaluation criteria,
the models were considered to be correctly specified for all groups under consider-
ation. In other words, the hypothesis of configural invariance should hold across all
groups. A description and rationale for including each of these study factors values is
provided below.
Number of groups. The number of groups examined in the study was set at 10 or 20.
Recall that one of the purposes of the study was to examine measurement invariance
criteria in a relatively large group setting. Based on this motivation, these group sizes
more closely approximate the operational contexts of large-scale surveys, where
group sizes typically range from around a dozen, particularly in the case of field
trials, to 60 or more countries for the main survey in larger studies. For our particular
analysis, group sizes of 10 and 20 are similar to the TALIS context, which is the only
study that currently has published results associated with establishing measurement
invariance for noncognitive scales (OECD, 2010a).
Scale length. The number of items per scale considered in the study was set at five or
six. This scale length is typical of scales in studies such as TALIS, where most
noncognitive scales (e.g., teacher beliefs about teaching profession) are composed of
just a few items (OECD, 2010a).
Noninvariant items. The number of items affected by noninvariance was zero, two, or
three, regardless of the length of the scale. In five-item scales with noninvariant items
this implied that either 40% or 60% of the items were affected by cross-group para-
meter noninvariance. In the six-item scales, this meant that 33% or 50% of the items
were noninvariant.
Nature of noninvariance. To examine the impact of different sources of parameter non-

invariance on fit evaluation measures, three types of noninvariance were simulated
for the affected items: slope, threshold, or both slope and threshold. For the relevant
conditions we used an approximate one-third split across groups for the purposes of
simulating noninvariance. The noninvariant parameters were assigned such that in
the 10-group condition, 3 randomly selected groups served as the reference group, 4
randomly chosen groups were assigned higher values of noninvariant parameters,
whereas the remaining 3 groups were assigned lower values of noninvariant para-
meters. Similarly, in the 20-group condition, using random assignment again, 7
groups served as the reference group, whereas 7 groups were assigned higher valued
noninvariant parameters, followed by the remaining 6 groups that were assigned
lower population values on the parameter of interest. For all conditions, we fixed the
number of thresholds per item to 4 (implying five response options), which is gener-
ally representative of noncognitive scales in TALIS and other international studies.
For both referent and noninvariant items, all parameters were sampled once.
Sample sizes for the current study varied from 600 to 6,000 per group in both the
10- and 20-group conditions. In particular, sample sizes started with 600 and increased
by 450 and 250 in the 10- and 20-group conditions, respectively. Selected sample
sizes were randomly assigned to each group to avoid confounding sample sizes with
either invariance or values of the latent variable means or variances. In addition,
assigned sample sizes were a fixed characteristic of a group across conditions within a
set group size. For example, the sample size assigned to Group1 was 1,170 across all
10-group conditions and 1,500 across all 20-group conditions. In other words, whereas
the sample size was randomly assigned to each group, it was fixed across conditions
for that particular group. These sample sizes and group numbers were considerably
different from previous simulation research that examined measurement invariance
issues (Cheung & Rensvold, 2002; French & Finch, 2006; Lubke & Muthén, 2004;
Meade & Lautenschlager, 2004); however, they better represented empirical data
structures in the international studies of interest including TALIS, PISA, and TIMSS,
where group sample sizes range from the hundreds to thousands.
Analysis
Data generation and analysis. Data were simulated in Mplus 6.12 (Muthén & Muthén,
1998-2010) according to an ordered categorical model with parameters as previously
described (see Table 3 for distribution means and variances for population para-
meters). To set the scale and origin of the latent variable for multiple-groups data
generation we used the methods described in Millsap and Yun-Tein (2004). In line
with operational procedures in TALIS, the data were subsequently analyzed under an
assumption of normality in the observed variables. And the origin and scale of the
latent variable were set by fixing the first group’s latent variable mean to 0 and latent
variable variance to 1 while the first factor loading in each group is fixed at 1.
Although this is a meaningful misspecification and can result in interpretation prob-
lems (Lubke & Muthén, 2004), it is the current method of choice, if investigations of
measurement invariance are done at all in international educational assessments. Our
study design yielded a total of 28 conditions (2 group sizes 3 2 scale lengths 3 2
noninvariant item sets 3 3 sources of noninvariance + 4 fully invariant conditions)
and each condition was replicated 500 times.
Evaluation criteria. In order to examine the performance of MG-CFA fit measures in a

large-scale assessment context, we proceeded according to the operational methods
used by the OECD (2010a). In particular, MG-CFA models were fit to all groups’
data simultaneously, by condition, where first a test of configural invariance was fol-
lowed by increasingly restrictive tests of metric and scalar invariance. We assess the
fit of the configural model to each condition’s data via the chi-square test of model
fit, CFI, Tucker–Lewis Index (TLI), RMSEA, and SRMR. Accordingly, the chi-
square test should be statistically not significant, CFI and TLI should be no smaller
than .950, RMSEA should be no larger than .050, and SRMR should be no larger than
.080. Similarly, for each subsequently restrictive set of models (metric followed by
scalar invariance), we began the model evaluation by first examining the overall fit of
the model to the data via the five fit measures just listed. We term the fit measures
that result from the test of configural invariance and overall model fit measures for
metric and scalar invariance the overall fit measures. Next, we tested for metric invar-
iance followed by scalar invariance. To examine the plausibility of metric and scalar
invariance, we used the chi-square difference test, DCFI, and DRMSEA. Again, simi-
lar to Chen (2007) and in line with operational OECD procedures, we considered
changes in CFI not larger in magnitude than 2.010 and changes in RMSEA less than
or equal to .010 as reasonable evidence of invariance. We refer to the results of this
group of tests as relative fit results to differentiate between these results (where a
more restrictive set of assumptions are compared with a less restrictive set of assump-
tions) and the overall fit results just noted. For all results, we report the average fit
statistics and indices across the 500 replications for each condition and invariance
test.
Results
In the following section, we present the results for the five-item scale (overall fit at
each assumed level of invariance followed by relative fit for metric and scalar
invariance). We then present the results in a similar fashion for the six-item scale.
Results for the fit measures are reported as averages across the 500 replications
within each respective condition along with standard deviations to indicate the
degree to which results varied across replications.
Five-Item Conditions
Overall fit. As shown in Table 4, Panel A, the average chi-square values under config-
ural invariance were very large relative to the respective degrees of freedom across
each condition, suggesting that the chi-square test detected misfit at configural invar-
iance in all conditions. Average chi-square values for overall fit under an assumption
of metric invariance were predictable in that the results supported further deteriora-
tion in model fit for all conditions, including when data were generated as fully invar-
iant. We also observed that the chi-square test statistics were consistently higher in
the 20-group conditions than in equivalent 10-group conditions. As discussed in the
‘‘Method’’ section, the data were simulated as ordered categorical but the models
assumed normality of observed variables, which suggested here that the chi-square
had very high power to detect this type of model misspecification.
Although the chi-square test results point to misspecification in all conditions that
assumed configural invariance, the fit indices, namely the RMSEA, CFI, TLI, and
SRMR, yielded better results in all invariance conditions and group sizes. In particu-
lar, the CFI and TLI were all larger than .980 and the SRMR never exceeded .020 in
all conditions, suggesting acceptable model fit. The RMSEA, on the other hand,
yielded values in the range of .059 to .073, which is slightly higher than the typical
.050 cutoff.
As the level of assumed invariance was increased from configural to metric and
from metric to scalar invariance, as expected, the fit indices yielded poorer model fit
(see Table 5). For example, when the slopes were assumed invariant, fit indices
yielded lower CFI and TLI values (ranging between .840 and .900, in Conditions 3
through 6 and 11 through 14) and higher SRMR values and RMSEA values greater
than .20 when metric invariance was examined, whereas in fully noninvariant and
conditions with threshold noninvariance (Conditions 1, 2, and 7 through 10), fit
indices remained at acceptable levels, although the RMSEA values suggested only
marginal fit. Similarly, expected patterns were found when scalar-invariant models
were fit to the data in all conditions. In particular, indices suggested acceptable fit
only in fully noninvariant conditions, particularly when the number of groups was
10, with marginal RMSEA values. In addition, the number of groups had differential
impacts across the fit indices, although in general, 10-group conditions had more
favorable outcomes than their 20-group counterparts. For example, when data were
generated with two noninvariant slopes and thresholds, all fit indices suggested better
model fit in the 10-group condition compared with the 20-group condition. In con-
trast, when two thresholds were modeled as noninvariant, all fit indices except for
the SRMR slightly favored the 20-group over the 10-group condition in terms of bet-
ter fit.
42
Table 4. Average Fit Values for Configural Invariance (Overall Fit) in Five-Item (Panel A) and Six-Item (Panel B) Conditions for 10 and 20
Groups.
x2 (SD) df RMSEA (SD) CFI (SD) TLI (SD) SRMR (SD)

No. of Groups
Invariance No. of
location variant items 10 20 10 20 10 20 10 20 10 20 10 20
Panel A: Five items

None 0 678.42 1,454.68 51 101 .064 .064 .995 .995 .990 .990 .018 .016
(56.56) (96.48) (.003) (.002) (\.001) (\.001) (.001) (.001) (.002) (.001)
Slopes 2 640.71 1,535.71 51 101 .062 .066 .995 .995 .990 .989 .020 .016
(53.04) (98.14) (.003) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Slopes 3 570.7 1,596.89 51 101 .059 .067 .996 .994 .991 .988 .018 .016
(50.77) (103.03) (.003) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Intercepts 2 708.45 1,860.52 51 101 .066 .073 .995 .994 .990 .987 .018 .017
(56.35) (106.48) (.003) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Intercepts 3 650.49 1,831.14 51 101 .063 .072 .995 .994 .991 .987 .018 .016
(53.52) (98.02) (.003) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Slopes and intercepts 2 787.65 2,159.68 51 101 .070 .078 .994 .992 .988 .984 .019 .017
(63.57) (124.18) (.003) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Slopes and intercepts 3 574.22 1,865.27 51 101 .059 .073 .995 .993 .991 .986 .018 .017
(51.15) (116.11) (.003) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Panel B: Six items
None 0 1,068.70 2,296.15 91 181 .060 .059 .994 .994 .990 .990 .018 .017
(71.12) (118.63) (.002) (.002) (\.001) (\.001) (.001) (.001) (.002) (.001)
(continued)
Table 4. (continued)
x2 (SD) df RMSEA (SD) CFI (SD) TLI (SD) SRMR (SD)

No. of Groups
Invariance No. of
location variant items 10 20 10 20 10 20 10 20 10 20 10 20
Slopes 2 1,019.45 2,789.49 91 181 .059 .066 .994 .992 .990 .987 .020 .017
(69.79) (132.26) (.002) (.002) (.001) (\.001) (.001) (.001) (.002) (.001)
Slopes 3 1,013.81 2,606.78 91 181 .058 .064 .994 .993 .990 .988 .019 .017
(69.45) (131.14) (.002) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Intercepts 2 1,111.42 3,038.99 91 181 .061 .069 .994 .992 .989 .986 .018 .018
(72.82) (141.35) (.002) (.002) (.001) (\.001) (.001) (.001) (.002) (.001)
Intercepts 3 1,186.49 2,983.74 91 181 .064 .068 .993 .992 .989 .986 .019 .017
(76.03) (134.95) (.002) (.002) (.001) (\.001) (.001) (.001) (.002) (.001)
Intercepts and thresholds 2 1,352.44 3,964.11 91 181 .068 .079 .992 .988 .986 .981 .019 .019
(83.27) (165.43) (.002) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Intercepts and thresholds 3 1,185.17 3,649.52 91 181 .064 .076 .992 .989 .988 .982 .019 .018
(75.28) (154.49) (.002) (.002) (.001) (.001) (.001) (.001) (.002) (.001)
Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual; TLI = Tucker–Lewis
index.
43
Table 5. Overall Average Fit Values for Metric (Panel A) and Scalar (Panel B) Invariance in Five-Item Conditions.
44
Condition Level Nj x2 (SD) df RMSEA (SD) CFI (SD) TLI (SD) SRMR (SD)
Panel A: Metric invariance

1 0_N 10 1,116.72 (68.41) 87 .063 (.002) .992 (.001) .991 (.001) .035 (.002)
2 0_N 20 4,451.5 (154.5) 177 .085 (.002) .985 (.001) .983 (.001) .090 (.005)
3 2_S 10 15,278.98 (252.75) 87 .243 (.002) .872 (.002) .853 (.002) .400 (.010)
4 2_S 20 30,853.79 (371.78) 177 .229 (.001) .883 (.001) .867 (.002) .387 (.007)
5 3_S 10 12,930.24 (238.52) 87 .223 (.002) .888 (.002) .872 (.002) .240 (.006)
6 3_S 20 25,902.09 (334.91) 177 .210 (.001) .900 (.001) .886 (.001) .226 (.004)
7 2_T 10 1,324.65 (76.26) 87 .069 (.002) .990 (.001) .989 (.001) .043 (.002)
8 2_T 20 5,829.23 (176.67) 177 .098 (.002) .979 (.001) .977 (.001) .098 (.004)
9 3_T 10 1,932.06 (90.39) 87 .085 (.002) .985 (.001) .983 (.001) .061 (.002)
10 3_T 20 5,778.91 (176.39) 177 .098 (.002) .979 (.001) .977 (.001) .093 (.004)
11 2_S&T 10 16,389.46 (239.5) 87 .251 (.002) .861 (.002) .840 (.002) .424 (.008)
12 2_S&T 20 32,092.1 (341.63) 177 .233 (.001) .875 (.001) .859 (.001) .390 (.005)
13 3_S&T 10 14,055.46 (225.06) 87 .233 (.002) .876 (.002) .857 (.002) .199 (.003)
14 3_S&T 20 26,984.77 (324.57) 177 .214 (.001) .892 (.001) .878 (.002) .200 (.003)
Panel B: Scalar invariance
1 0_N 10 1,331.96 (73.79) 87 .058 (.002) .991 (.001) .992 (.001) .034 (.002)
2 0_N 20 6,009.98 (154.08) 177 .083 (.001) .979 (.001) .984 (.001) .079 (.005)
3 2_S 10 1,608.84 (247.63) 87 .209 (.002) .866 (.002) .891 (.002) .416 (.011)
4 2_S 20 41,101.46 (374.26) 177 .221 (.001) .844 (.002) .876 (.001) .511 (.011)
5 3_S 10 14,305.30 (239.72) 87 .197 (.002) .877 (.002) .900 (.002) .238 (.006)
6 3_S 20 37,389.95 (363.47) 177 .211 (.001) .855 (.001) .885 (.001) .229 (.005)
7 2_T 10 13,741.83 (209.09) 87 .193 (.002) .893 (.002) .913 (.001) .109 (.002)
8 2_T 20 29,435.20 (341.56) 177 .187 (.001) .894 (.001) .916 (.001) .134 (.003)
9 3_T 10 14,299.85 (223.61) 87 .197 (.002) .887 (.002) .908 (.002) .111 (.002)
10 3_T 20 31,543.47 (373.75) 177 .193 (.001) .885 (.001) .909 (.001) .121 (.003)
11 2_S&T 10 25,356.78 (306.56) 87 .263 (.002) .784 (.003) .825 (.002) .590 (.014)
12 2_S&T 20 64,718.93 (444.94) 177 .278 (.001) .748 (.002) .801 (.002) .860 (.012)
13 3_S&T 10 22,452.27 (282.97) 87 .248 (.002) .802 (.003) .839 (.002) .213 (.002)
14 3_S&T 20 57,649.78 (454.00) 177 .262 (.001) .769 (.002) .818 (.002) .213 (.002)
index. Under Level heading, 0, 2, and 3 correspond to number of invariant items; N = none, T = thresholds, S = slope, S&T = slopes and thresholds; Nj = number
of groups.
Relative fit. As shown in Table 6, the chi-square difference test was statistically sig-
nificant, suggesting rejection of both metric and scalar invariance, for all conditions,
including both fully-invariant conditions. In addition, and as expected, an increase in
the number of groups (from 10 to 20) resulted in consistently larger chi-square differ-
ence test statistics for both metric and scalar invariance models across all studied
conditions.
With respect to the relative fit indices, several interesting findings emerged.
Overall, fit indices performed well at examining hypotheses of metric invariance.
For example, DCFI supported metric-invariant hypotheses for 10-group data that
were compatible with the models (Conditions 1, 7, and 9), whereas all conditions
with noninvariant slopes yielded values much greater in magnitude than the 2.010
cutoff. In the equivalent 20-group conditions, DCFI values were close to or some-
what greater than the 2.010 cutoff value (2.010, 2.014, and 2.014 for Conditions
2, 8, and 10, respectively). This finding provides some evidence that for larger
groups, a more liberal criterion for establishing measurement invariance might be
warranted. Findings associated with the DRMSEA indicated good performance for
the 10-group fully invariant condition (Condition 1) and two-item scalar noninvariant
condition (Condition 7). In both situations, the DRMSEA values were below the gen-
erally accepted .010 criteria. In the 20-group conditions, where data were either fully
invariant (Condition 2) or scalar noninvariant (Conditions 8 and 10), the DRMSEA
indices ranged from .022 to .026, also suggesting that a more liberal criterion is war-
ranted in situations with larger number of groups. Furthermore, in the 10-group con-
dition where three items were scalar noninvariant, the average DRMSEA value was
.022.
In examining relative model fit for scalar invariance (i.e., further imposing equal-
ity constraints on the model parameters), fit indices were largely effective at cor-
rectly identifying fully invariant data and also at discriminating between metric- and
scalar-invariant data. Specifically, the average DRMSEA and DCFI values were
mostly within typically accepted criteria for both 10- and 20-group, fully invariant
data (Conditions 1 and 2), suggesting that the hypothesis of scalar invariance would
be supported for both these conditions, assuming that a metric invariance hypothesis
was supported in a previous step. Additionally, data generated to have only noninvar-
iant thresholds had associated relative fit indices that were outside generally accepta-
ble cutoff values for 10- and 20-group cases (Conditions 7, 8, 9, and 10). Both these
would provide some evidence that typically accepted cutoffs for relative fit indices
were reasonably well suited for retaining a hypothesis of fully invariant data and for
determining if the metric-invariant data were additionally scalar invariant.
As Table 6 shows, in several instances, additional constraints yielded average rela-
tive fit index values that would support a scalar invariance hypothesis for data that
were not commensurate with this assumption. In general, however, this finding is not
consequential as long as evidence against metric invariance was observed in a previ-
ous step. For example, in Condition 3, where two slopes were noninvariant, average
DCFI reported was 2.006. Similarly, in Condition 6, where three slopes are
Table 6. Relative Fit Tests Results for Five-Item Conditions.
46
Metric invariance Scalar invariance
2 2
Invariance No. of x difference DRMSEA DCFI x difference DRMSEA DCFI
Condition location noninvariant items Groups (SD) df (SD) (SD) (SD) df (SD) (SD)
1 None 0 10 438.30 36 2.001 2.003 215.24 36 2.006 2.001

(39.42) (.002) (\.001) (31.00) (.001) (\.001)
2 None 0 20 2,996.81 76 .022 2.010 1,558.49 76 2.003 2.005
(161.89) (.002) (.001) (89.39) (.001) (.001)
3 Slopes 2 10 14,638.27 36 .180 2.123 801.86 36 2.034 2.006
(247.13) (.003) (.002) (61.57) (.001) (.001)
4 Slopes 2 20 29,318.08 76 .163 2.112 10,247.66 76 2.008 2.039
(369.70) (.002) (.001) (221.37) (.001) (.001)
5 Slopes 3 10 12,359.54 36 .165 2.107 1,375.06 36 2.026 2.012
(233.96) (.003) (.002) (76.62) (.001) (.001)
6 Slopes 3 20 24,305.20 76 .143 2.095 11,487.87 76 .001 2.045
(343.31) (.003) (.001) (228.63) (.001) (.001)
7 Thresholds 2 10 616.21 36 .003 2.005 12,417.17 36 .124 2.097
(50.66) (.002) (.001) (187.74) (.002) (.002)
8 Thresholds 2 20 3,968.71 76 .026 2.014 23,605.97 76 .089 2.086
(184.65) (.002) (.001) (291.44) (.001) (.001)
9 Thresholds 3 10 1,281.57 36 .022 2.010 12,367.78 36 .113 2.098
(74.65) (.002) (.001) (194.81) (.002) (.002)
10 Thresholds 3 20 3,947.77 76 .026 2.014 25,764.56 76 .096 2.095
(174.23) (.002) (.001) (316.41) (.001) (.001)
11 Slopes and thresholds 2 10 15,601.81 36 .182 2.133 8,967.31 36 .012 2.076
(232.95) (.003) (.002) (197.08) (.001) (.002)
(338.82) (.002) (.001) (383.26) (.001) (.002)
(221.79) (.003) (.002) (169.70) (.001) (.002)
(329.46) (.003) (.001) (341.58) (.001) (.002)
Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual.
noninvariant, the average DRMSEA reported was .001. But in both conditions, the fit
indices associated with a test for metric invariance were well outside acceptable cri-
teria, rendering a further test of scalar invariance unnecessary.
Six-Item Conditions
Overall fit. Under the assumption of configural invariance (Table 4, Panel B), average
chi-square values were very high and relative to the degrees of freedom, average val-
ues were highly statistically significant for each condition. This suggests that, accord-
ing to the chi-square test of model fit, none of these data meet minimum criteria for
configural invariance. According to Table 7, Panel A, chi-square values for overall fit
under an assumption of metric invariance are predictable, in that imposing restrictions
on the data resulted in a decrement in fit under all conditions, even for data that are
fully invariant. Similarly, when a model that assumed scalar invariance was fit to data
for all conditions (Table 7, Panel B), model fit declined, according to the chi-square
test. Furthermore, the chi-square test exhibited typically expected performance in that
for data generated under equivalent conditions (e.g., full invariance), the test statistics
were consistently higher when the number of groups was greater (i.e., 10 compared
with 20). Similar to the five-item conditions, these findings suggest that the chi-
square has very high power to detect this sort of model misspecification. And, given
that in each condition the simulated data are commensurate with an assumption of
configural invariance,3 the chi-square test will generally not be well suited for evalu-
ating the hypothesis of configural invariance in this sort of situation.
Despite the finding that the chi-square test of model fit rejects all models under
the assumption of configural invariance, the fit indices, including the RMSEA, CFI,
TLI, and SRMR performed much better in all conditions and group sizes. In particu-
lar, the CFI and TLI were all larger than .98 and the SRMR was not larger than .02,
suggesting that these indices support the hypothesis of configural invariance. One
exception to this trend was the RMSEA, none of which averaged below the typically
accepted cutoff of .05. Instead, under all conditions, this index varied between .058
and .079, indicating less-than-optimal model fit. And, as expected, further restrictions
on the models resulted in deteriorations in these indices. Furthermore, violations of
hypothesized levels of invariance were easily detected by all fit indices. As an exam-
ple, data that were simulated such that three slopes and three sets of thresholds varied
across the groups had associated overall fit indices that fell well outside any conven-
tionally accepted criteria in both the 10- and 20-group conditions when a model that
assumed metric invariance was fit to the data. In contrast, data that were commensu-
rate with a more stringent level of invariance than the level under consideration only
experienced slight deteriorations in model fit (e.g., a metric-invariant model fit to
fully invariant data). One notable finding here is that when only the number of groups
differed but all other parameters were equal across a pair of conditions, the SRMR
tended to be larger for the 20-group condition and this was particularly pronounced
for models that assumed scalar invariance. This finding is somewhat surprising given
that this index is not explicitly a function of sample size (Bentler, 1995).
Table 7. Overall Average Fit Values for Metric (Panel A) and Scalar (Panel B) Invariance in Six-Item Conditions.
48
Condition Level Nj x2(SD) df RMSEA (SD) CFI (SD) TLI (SD) SRMR (SD)
Panel A: Metric invariance

1 0_N 10 1,649.53 (85.52) 136 .061 (.002) .991 (.001) .990 (.001) .037 (.002)
2 0_N 20 6,875.56 (208.60) 276 .085 (.001) .981 (.001) .979 (.001) .104 (.005)
3 2_S 10 16,275.70 (244.40) 136 .200 (.002) .893 (.002) .882 (.002) .381 (.009)
4 2_S 20 34,485.08 (398.40) 276 .194 (.001) .897 (.001) .888 (.001) .380 (.006)
5 3_S 10 1,635.80 (248.46) 136 .201 (.002) .890 (.002) .878 (.002) .303 (.007)
6 3_S 20 33,719.17 (402.96) 276 .191 (.001) .897 (.001) .888 (.001) .294 (.005)
7 2_T 10 187.13 (91.52) 136 .066 (.002) .989 (.001) .988 (.001) .044 (.002)
8 2_T 20 8,682.25 (228.03) 276 .096 (.001) .976 (.001) .974 (.001) .110 (.005)
9 3_T 10 2,669.92 (106.89) 136 .079 (.002) .984 (.001) .982 (.001) .060 (.002)
10 3_T 20 8,865.57 (229.82) 276 .097 (.001) .975 (.001) .973 (.001) .118 (.006)
11 2_S&T 10 17,366.06 (239.10) 136 .207 (.001) .885 (.002) .873 (.002) .388 (.008)
12 2_S&T 20 35,923.54 (387.11) 276 .198 (.001) .891 (.001) .881 (.001) .362 (.005)
13 3_S&T 10 20,076.30 (281.50) 136 .222 (.002) .862 (.002) .848 (.002) .282 (.006)
14 3_S&T 20 3,834.78 (439.45) 276 .204 (.001) .880 (.001) .870 (.002) .288 (.004)
Panel B: Scalar invariance
1 0_N 10 1,974.05 (94.14) 181 .058 (.002) .989 (.001) .991 (.001) .037 (.002)
2 0_N 20 9,475.65 (22.76) 371 .086 (.001) .974 (.001) .979 (.001) .092 (.005)
3 2_S 10 17,144.66 (248.32) 181 .178 (.001) .888 (.002) .907 (.001) .397 (.010)
4 2_S 20 45,885.09 (41.46) 371 .193 (.001) .863 (.001) .889 (.001) .514 (.009)
5 3_S 10 17,953.45 (267.83) 181 .182 (.001) .879 (.002) .900 (.002) .309 (.007)
6 3_S 20 48,208.87 (443.95) 371 .197 (.001) .853 (.002) .881 (.001) .334 (.006)
7 2_T 10 14,756.80 (226.82) 181 .165 (.001) .909 (.002) .925 (.001) .100 (.002)
8 2_T 20 33,484.54 (373.34) 371 .164 (.001) .904 (.001) .923 (.001) .134 (.004)
9 3_T 10 16,836.49 (255.39) 181 .176 (.001) .895 (.002) .913 (.002) .110 (.002)
10 3_T 20 37,629.21 (399.33) 371 .174 (.001) .891 (.001) .912 (.001) .136 (.004)
11 2_S&T 10 26,491.31 (304.51) 181 .222 (.001) .824 (.002) .854 (.002) .541 (.013)
12 2_S&T 20 6,970.93 (478.74) 371 .238 (.001) .787 (.002) .828 (.001) .789 (.010)
13 3_S&T 10 32,195.82 (402.82) 181 .244 (.002) .778 (.003) .816 (.002) .277 (.004)
14 3_S&T 20 80,978.64 (563.38) 371 .256 (.001) .747 (.002) .795 (.002) .288 (.005)
index. Under Level heading, 0, 2, and 3 correspond to number of invariant items; N = none, T = thresholds, S = slope, S&T = slopes and thresholds; Nj = number
of groups.
Relative fit. Based on the results presented in Table 8, the chi-square difference test is
strongly statistically significant, suggesting rejection of both metric and scalar invar-
iance hypotheses, for all conditions, including for fully invariant data. And, as
expected, the test statistic values were larger when only group sizes differed across
two comparable conditions (e.g., when two slopes are metric noninvariant). Similar
to the findings of overall fit, the chi-square difference test is overly powerful under
the conditions examined and is likely not a useful measure for examining measure-
ment invariance under these conditions.
A number of interesting findings emerged with respect to the relative fit indices,
which we discuss currently. Based on the six-item conditions examined, the
DRMSEA generally performed very well at examining hypotheses of metric invar-
iance in the 10-group condition. In particular, fully invariant (Condition 1) and scalar
noninvariant data (Conditions 7 and 9) had associated DRMSEA well below the .010
criteria. And all conditions with noninvariant slopes had fit indices larger than .10
(more than 10 times typical cutoff values). In contrast, the DRMSEA values for 20-
group conditions where the data were commensurate with a hypothesis of equal
slopes (Conditions 2, 8, and 10) were all larger than .025, suggesting that a more lib-
eral criterion might be warranted with larger numbers of groups. Similarly, the DCFI
supported metric-invariant hypotheses of 10-group data that were commensurate
with the models (Conditions 1, 7, and 9), whereas all conditions with noninvariant
slopes had relative fit index values much greater in magnitude than the 2.010 cutoff.
And similar to the DRMSEA measure, the DCFI values for commensurate 20-group
data (Conditions 2, 8, and 10) were somewhat over the 2.010 cutoff. By condition,
average values were 2.013, 2.016, and 2.017, respectively. This again indicates
that a more liberal criterion is warranted for larger numbers of groups.
Added equality constraints on the model parameters to examine the hypothesis of
scalar invariance resulted in fit indices that were largely effective at identifying fully
invariant data and at discriminating between metric- and scalar-invariant data.
Specifically, the average DRMSEA and values were well within typically accepted
criteria for both 10- and 20-group, fully invariant data, suggesting that the hypothesis
of scalar invariance would be supported for both these conditions, assuming that a
metric invariance hypothesis was supported in a previous step. Furthermore, data
generated to have only noninvariant thresholds had associated relative fit indices that
were well outside generally accepted cutoff values for both group sizes (Conditions
7, 8, 9, and 10). As with the five-item conditions, these results offer some evidence
that generally accepted cutoffs for both relative fit indices are reasonably well suited
for retaining fully invariant data and for determining whether metric-invariant data
are additionally scalar invariant. And although in some conditions, additional con-
straints resulted in average relative fit index values that would generally provide evi-
dence in favor of the scalar invariance hypothesis, this is of little consequence as
long as evidence against metric invariance were heeded in a previous analytic step.
Examples of this include the average DCFI value for Condition 3, where two slopes
are noninvariant (2.005) and the average DRMSEA value where three slopes are
Table 8. Relative Fit Tests Results for Six-Item Conditions.
50
Condition Invariance location No. of noninvariant items Groups Metric invariance Scalar invariance
x2 difference df DRMSEA DCFI x2 difference df DRMSEA DCFI (SD)
(SD) (SD) (SD) (SD) (SD)
1 None 0 10 58.82 45 .001 2.003 324.53 45 2.003 2.002

(5.42) (.001) (.001) (36.16) (.001) (\.001)
2 None 0 20 4,579.42 95 .026 2.013 260.08 95 .001 2.007
(204.47) (.002) (.001) (117.36) (.001) (.001)
3 Slopes 2 10 15,256.25 45 .141 2.101 868.96 45 2.022 2.005
(232.71) (.002) (.002) (61.36) (.001) (.001)
4 Slopes 2 20 31,695.58 95 .128 2.095 1,140.01 95 2.001 2.034
(397.85) (.002) (.001) (267.25) (.001) (.001)
5 Slopes 3 10 15,336.99 45 .142 2.104 1,602.65 45 2.019 2.011
(243.58) (.002) (.002) (83.42) (.001) (.001)
6 Slopes 3 20 31,112.39 95 .128 2.095 14,489.70 95 .006 2.044
(402.60) (.002) (.001) (281.15) (.001) (.001)
7 Thresholds 2 10 758.71 45 .004 2.004 12,886.66 45 .099 2.080
(61.39) (.001) (.001) (198.42) (.002) (.001)
8 Thresholds 2 20 5,643.26 95 .027 2.016 24,802.29 95 .068 2.071
(223.88) (.002) (.001) (294.65) (.001) (.001)
9 Thresholds 3 10 1483.43 45 .016 2.009 14,166.57 45 .097 2.089
(8.64) (.002) (.001) (208.25) (.001) (.002)
10 Thresholds 3 20 5,881.84 95 .029 2.017 28,763.64 95 .077 2.084
(227.14) (.002) (.001) (321.00) (.001) (.001)
(233.84) (.002) (.002) (194.99) (.001) (.001)
(39.57) (.002) (.001) (407.23) (.001) (.001)
(279.75) (.003) (.002) (239.89) (.001) (.002)
(421.90) (.002) (.001) (435.46) (.001) (.002)
Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual.
noninvariant (.006). We note one seemingly odd finding: that the DRMSEA is nega-
tive (albeit small) for some conditions. Given that the RMSEA is a function of the
ratio of the chi-square statistic to the degrees of freedom for the model under consid-
eration (Bentler, 1995; Hu & Bentler, 1999), this finding can be explained by the fact
that this ratio was, on average, smaller for the scalar-invariant models due to a rela-
tively large increase in the degrees of freedom obtained by imposing equality con-
straints on the intercepts (1975.61/181 = 5.38 compared with 1583.61/136 = 12.12,
respectively). And given that the chi-square difference test and DCFI are in expected
directions, this finding should be interpreted as within the acceptable cutoff and
points to conditions under which this measure can be slightly negative.
Discussion
In making comparisons cross-nationally, there is ambiguity regarding whether differ-
ences in scale score means can be attributed to authentic differences between coun-
tries or to cross-country measurement differences, because of cultural response
biases, translation errors, or cultural differences in understanding the underlying con-
struct. Thus, without evidence to support measurement equivalence, any claims or
conclusions regarding comparative differences are necessarily weak (Horn, 1991;
Vandenberg & Lance, 2000). One means for investigating the comparability of scale
scores is through investigation of measurement invariance; however, little is known
about the performance of this method and typically used fit measures when the num-
ber of groups is large and the sample sizes are large and varied. As such, the current
article provides some evidence regarding the performance of MG-CFA in a large-
scale assessment context.
To provide some insights into the degree to which typically accepted MG-CFA
model-to-data fit measures are applicable in contexts where the number of groups is
large and sample sizes are varied, we conducted a simulation study that mimics
operational procedures currently used by the OECD, an international organization
involved in educational surveys and evaluation. To that end, we generated multiple-
group measurement models that drew on an empirical analysis of two TALIS scales
to arrive at plausible item- and person-parameters. We examined several factors in
this study including two different scale lengths (five and six items), different sources
of noninvariance (in slopes, thresholds, and both slopes and thresholds), and two dif-
ferent numbers of groups (10 and 20). To further align our study with authentic con-
ditions, we simulated unidimensional ordered categorical data; however, in line with
operational OECD procedures, we applied normal models to these data. We then
evaluated several widely accepted measures of model fit in the multiple-group con-
text. We subsequently summarize several interesting findings along with implica-
tions for applied researchers and future areas of research.
Before proceeding to a more detailed discussion we draw attention to one impor-
tant point: We recognize that the disjuncture between the generating models (ordered
categorical) and the analytic models (normal) are inconsistent and that this choice is
theoretically not best practice. However, given the current state of the practice, this
is a purposeful choice and reflects currently used methods by agencies that conduct
investigations of measurement invariance, if they are conducted at all.
Under the studied conditions, our results provide some evidence that the chi-
square test is likely not useful as a test of overall model fit, as this test strongly sug-
gested model misfit across all studied conditions. Furthermore, the chi-square differ-
ence test was also not a suitable measure in this context for the same reasons. These
findings are not surprising, given that the studied conditions included both large
number of groups, large sample sizes, and we began with a theoretically misspecified
model (one that assumes normality when the data are ordered categorical). Despite
this result, we do note that the chi-square test statistics (overall and relative) behaved
as expected across all conditions in that these values were generally larger for larger
numbers of groups and as more constraints were placed on these models. We also
acknowledge that our results support earlier findings (Cheung & Rensvold, 2002;
Meade et al., 2008), which found that alternative fit indices are often preferable over
chi-square tests of model fit in the multiple-group context.
Given the results of the chi-square test statistics, the following discussion is
focused on the utility of the overall and relative fit indices in the studied conditions.
Of the fit indices considered, our findings suggested that currently accepted cutoffs
for the CFI, TLI, and SRMR are generally suitable. In particular, for overall fit eva-
luations of configural, metric, and scalar invariance, commensurate models fit to data
were overwhelmingly supported by CFI greater than .95 and TLI greater than .95. In
fact, under every condition studied, neither of these statistics was less than .97 when
the model was consistent with the data, providing evidence that a slightly more strin-
gent cutoff would be reasonable in this context. In most cases, an SRMR of less than
.08 was also suitable for determining overall fit at different levels of invariance;
however, a few exceptions existed. For example, in the 20-group, six-item, fully
invariant data condition, a metric-invariant model resulted in an SRMR of .104.
Likewise, a few other conditions produced similar results. On first inspection, this
might suggest that a more liberal SRMR criterion should be considered (e.g., .10 or
.11); however, in several instances, hypothesized models that did not correspond to
the data resulted in SRMR values of around .10. As such, we recommend that the
SRMR is not used in isolation, if it is used at all. Rather, it should be used in con-
junction with the CFI and TLI. And where inconsistencies in these measures arise,
the analyst would be better served by relying on the CFI and TLI. In contrast, our
findings lead us to recommend a slightly more liberal criterion for the RMSEA, par-
ticularly when groups are relatively large. In particular, we found several instances
where a model-to-data match did not meet the minimum RMSEA criterion and this
finding was more pronounced in the 20-group conditions; however, the RMSEA was
highly sensitive to model-to-data mismatches. This leads us to suggest an RMSEA
cutoff of around .10 when there are at least 10 groups.
In terms of the relative fit indices (DCFI and DRMSEA), we note several recom-
mendations based on our findings. Regardless of scale length, we observed a
tendency for the DRMSEA associated with a hypothesis of metric invariance to

increase as the number of groups increased. To that end, in all 20-group conditions
and in one 10-group condition DRMSEA was at least twice the magnitude of the tra-
ditionally accepted value of .010. Furthermore, when a hypothesis of metric invar-
iance was under consideration, the DRMSEA was effective at detecting
misspecification, with associated values well outside accepted cutoffs. As such, we
recommend that a more liberal DRMSEA value be used for evaluating metric invar-
iance when large numbers of groups are under consideration. In particular, .030
appears to be a sensible cutoff in these contexts. This finding did not translate to
investigations of scalar invariance, where the traditional cutoff of .010 performed
well at identifying scalar invariance.
In terms of results for the DCFI, we found that the typical value of 2.010 suffered
some deficiencies in identifying models that were metric invariant, with what also
appears to be a slight dependency on the number of groups. Furthermore, when the
models were not consistent with a hypothesis of metric invariance, associated DCFI
values were well outside generally accepted criteria. As such, we suggest that a
slightly more liberal criterion of around 2.020 be adopted, especially for larger
group sizes. Similar to the results for the DRMSEA, this finding did not extend to
tests for scalar invariance and the status quo is a reasonable and safe choice under
the studied context.
In spite of findings that suggest some revised guidelines when considering mea-
surement invariance in larger numbers of groups, any simulation study is subject to
limitations. In particular, our analysis only looked at unidimensional scales, in line
with OECD operational procedures. Additional research may be needed to examine
the performance of these statistics and indices for contexts where scales are typically
multidimensional. Furthermore, we only considered two possible group sizes (10 and
20) and given that prominent international educational studies are increasingly devel-
oping an interest in examining measurement invariance, it would also be worthwhile
to examine this study for even larger numbers of groups. Given the typical number of
items comprising noncognitive scales in many international surveys (OECD, 2010a,
2010b), we examined scales of only two lengths: five and six items. In other contexts,
longer scales might be more prevalent and further research with longer scales is then
likely necessary.
In spite of these limitations, we provided some evidence that particular well-
established criteria for determining model fit in the multiple-group case may not be
the most suitable under the conditions studied (relatively large numbers of groups,
large sample sizes, ordinal data modeled as normal). To that end, slightly more liberal
RMSEA measures are recommended as a means for establishing configural and
metric invariance but traditionally accepted changes in RMSEA are recommended
for determining scalar invariance. In contrast, more stringent CFI and TLI values
should be expected to perform equally well at determining overall fit for configural-,
metric-, and scalar-invariant models. And similar to the DRMSEA, we suggest a more
relaxed cutoff for the DCFI measures when testing for metric invariance but
traditional cutoffs are suitable for determining scalar invariance. We also recommend
that the overall SRMR be used in conjunction with other measures or not at all for
determining the overall plausibility of configural, metric, and scalar invariance.
Finally, we note that our article was concerned only with the detection, not the cause,
of noninvariance. As such, we advocate for an approach whereby once measurement
nonvariance is detected at some level of the hierarchy, follow-up analyses are neces-
sary to attempt to locate the source of noninvariance in the scale (e.g., in consultation
with culture studies experts and/or linguists to examine potential sources of
variability).
Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship,
and/or publication of this article.
Funding
The authors(s) declared receipt of the following financial support for the research, authorship,
and/or publication of this article: This work was partially funded by a contract from the
Organisation for Economic Co-operation & Development.
Notes
1. This assumes the latent variable is scaled by fixing the first factor loading to 1, then
three loadings are free to differ across groups.
2. In the tradition of Jöreskog (1971), researchers might begin their measurement invar-
iance investigation for G groups by first testing the most restrictive null hypothesis
H0 : S(1) = S(2) = = S(G) . Assuming this hypothesis is rejected, the researcher either
proceeds by releasing constraints on this most restrictive model or by adding constraints to
the least restrictive (configural invariance) model. In some situations (OECD, 2010a), the
Jöreskog null hypothesis is not tested and instead the first test of invariance is configural.
3. Given that the data are generated as ordered categorical, it is more correct to say that
these data meet the criteria to serve as a baseline model (Millsap & Yun-Tein, 2004); how-
ever, to avoid confusion, we use the term configural invariance.
References
Allalouf, A., Hambleton, R. K., & Sireci, S. G. (1999). Identifying the causes of DIF in
translated verbal items. Journal of Educational Measurement, 36, 185-198. doi:
10.1111/j.1745-3984.1999.tb00553.x
Bagozzi, R. P. (1977). Structural equation models in experimental research. Journal of
Marketing Research, 14, 209-226. doi:10.2307/3150471
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin,
107, 238-246. doi:10.1037/0033-2909.107.2.238
Bentler, P. M. (1995). EQS structural equations program manual. Encino, CA: Multivariate
Software.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of
covariance structures. Psychological Bulletin, 88, 588-606. doi:10.1037/0033-2909.88
.3.588
Bollen, K. (1989). Structural equations with latent variables. New York, NY: Wiley.
Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance.
Structural Equation Modeling, 14, 464-504. doi:10.1080/10705510701301834
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing
measurement invariance. Structural Equation Modeling, 9, 233-255. doi:
10.1207/S15328007SEM0902_5
De Beuckelaer, A., Lievens, F., & Swinnen, G. (2007). Measurement equivalence in the
conduct of a global organizational survey across countries in six cultural regions. Journal
of Occupational and Organizational Psychology, 80, 575-600.
Deshon, R. P. (2004). Measures are not invariant across groups without error variance
homogeneity. Psychology Science, 46, 137-149.
Ercikan, K. (2002). Disentangling sources of differential item functioning in multilanguage
assessments. International Journal of Testing, 2, 199-215.
Ercikan, K. (2003). Are the English and French versions of the Third International
Mathematics and Science Study administered in Canada comparable? Effects of
adaptations. International Journal of Educational Policy, Research and Practice, 4, 55-76.
French, B. F., & Finch, W. H. (2006). Confirmatory factor analytic procedures for the
determination of measurement invariance. Structural Equation Modeling, 13, 378-402. doi:
10.1207/s15328007sem1303_3
Grisay, A., & Monseur, C. (2007). Measuring the equivalence of item difficulty in the various
versions of an international test. Studies in Educational Evaluation, 33, 69-86. doi:
10.1016/j.stueduc.2007.01.006
Hambleton, R. K. (2002). Adapting achievement tests into multiple languages for international
assessments. In A. C. Porter & A. Gamoran (Eds.), Methodological advances in cross-
national surveys of educational achievement (pp. 58-79). Washington, DC: National
Academies Press.
Hambleton, R. K., Merenda, P. F., & Spielberger, C. D. (Eds.). (2005). Adapting educational
and psychological tests for cross-cultural assessment. New York, NY: Psychology Press.
Hambleton, R. K., & Rogers, H. J. (1989). Detecting potentially biased test items: Comparison
of IRT area and Mantel-Haenszel methods. Applied Measurement in Education, 2, 313-334.
doi:10.1207/s15324818ame0204_4
Hancock, G. R. (1997). Structural equation modeling methods of hypothesis testing of latent
variable means. Measurement and Evaluation in Counseling and Development, 30, 91-105.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-
Haenszel procedure. In H. Wainer & H. Braun (Eds.), Test validity (pp. 129-145). Hillsdale,
NJ: Lawrence Erlbaum.
Horn, J. L. (1991). Comments on ‘‘Issues in factorial invariance.’’ In L. M. Collins & J. L.
Horn (Eds.), Best methods for the analysis of change (pp. 114-125). Washington, DC:
American Psychological Association.
Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement
invariance in aging research. Experimental Aging Research, 18, 117-144. doi:
10.1080/03610739208253916
Horn, J. L., McArdle, J., & Mason, R. (1983). When is invariance not invariant: A practical
scientist’s look at the ethereal concept of factor invariance. Southern Psychologist, 1,
179-188.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55. doi:
10.1080/10705519909540118
Jöreskog K. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36,
409-426. doi:10.1007/BF02291366
LeTendre, G. K. (2002). Advancements in conceptualizing and analyzing cultural effects in
cross-national studies of educational achievement. In A. C. Porter & A. Gamoran (Eds.),
Methodological advances in cross-national surveys of educational achievement (pp. 198-
228). Washington, DC: National Academies Press.
Little, T. D. (1997). Mean and Covariance Structures (MACS)analyses of cross-cultural data:
Practical and theoretical issues. Multivariate Behavioral Research, 32, 53-76. doi:
10.1207/s15327906mbr3201_3
Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Mahwah, NJ: Lawrence Erlbaum.
Lubke, G. H., & Dolan, C. V. (2003). Can unequal residual variances across groups mask
differences in residual means in the common factor model?Structural Equation Modeling,
10, 175-192. doi:10.1207/S15328007SEM1002_1
Lubke, G. H., & Muthén, B. O. (2004). Applying multigroup confirmatory factor models for
continuous outcomes to Likert scale data complicates meaningful group comparisons.
Structural Equation Modeling, 11, 514-534. doi:10.1207/s15328007sem1104_2
Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternative fit
indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568-592.
doi:10.1037/0021-9010.93.3.568
Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of item response theory and
confirmatory factor analytic methodologies for establishing measurement equivalence/
invariance. Organizational Research Methods, 7, 361-388. doi:10.1177/1094428104268027
Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of
Educational Statistics, 7, 105-118.
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance.
Psychometrika, 58, 525-543. doi:10.1007/BF02294825
Millsap, R. E., & Yun-Tein, J. (2004). Assessing factorial invariance in ordered-categorical
measures. Multivariate Behavioral Research, 39, 479-515. doi:10.1207/S15327906MBR3903_4
Muthén, L. K., & Muthén, B. O. (1998-2010). Mplus user’s guide (6th ed.). Los Angeles, CA:
Author. Retrieved from http://www.statmodel.com/download/usersguide/Mplus%20Users%
20Guide%20v6.pdf
Oliveri, M. E., & von Davier, M. (2011). Investigation of model fit and score scale
comparability in international assessments. Psychological Test and Assessment Modeling,
53, 315-333.
Olson, J., Martin, M., & Mullis, I. V. S. (2008). TIMSS 2007 technical report. Chestnut Hill,
MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston
College.
Organisation for Economic Co-operation and Development. (2010a). TALIS technical report.
Paris, France: Author.
Organisation for Economic Co-operation and Development. (2010b). PISA 2009 technical
report. Paris, France: Author. Retrieved from http://www.oecd.org/edu/preschoolandschool/
programmeforinternationalstudentassessmentpisa/pisa2009technicalreport.htm
Schmitt, N., & Kuljanin, G. (2008). Measurement invariance: Review of practice and
implications. Human Resource Management Review, 18, 210-222. doi:
10.1016/j.hrmr.2008.03.003
Steiger, J. H., & Lind, J. C. (1980). Statistically based tests for the number of common factors.
Presented at the Meeting of the Psychometric Society, Iowa City, IA.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using
logistic regression procedures. Journal of Educational Measurement, 27, 361-370. doi:
10.1111/j.1745-3984.1990.tb00754.x
Thompson, M. S., & Green, S. B. (2006). Evaluating between-group differences in latent
variable means. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling: A
second course (1st ed., pp. 119-169). Charlotte, NC: Information Age.
Torsheim, T., Samdal, O., Rasmussen, M., Freeman, J., Griebler, R., & Dür, W. (2012). Cross-
national measurement invariance of the teacher and classmate support scale. Social
Indicators Research, 105, 145-160. doi:10.1007/s11205-010-9770-9
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement
invariance literature: Suggestions, practices, and recommendations for organizational
research. Organizational Research Methods, 3, 4-70. doi:10.1177/109442810031002

Rutkowski Invarianza

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rutkowski Invarianza

Uploaded by

Copyright:

Available Formats

Article

Educational and Psychological

Leslie Rutkowski1 and Dubravka Svetina1

In the field of education, international large-scale assessments, such as the Trends in

do internationally. Further, non-performance based education studies, such as the

(Meredith, 1993). The null hypothesis for examining scalar invariance is

Table 1. Number of System-Level Participants in Several International Education Studies.

Study N Formal investigation of measurement invariance?

Scale Parameter Minimum Maximum Range Mean SD

Structuring Teacher Practices Slope 1 1.00 1.00 0.00 1.00 0.00

(Structuring Teacher Practices and Disciplinary Climate), to which we fit unidimen-

Table 3. Parameter-Generating Distributions Based on Empirical Analysis of TALIS Scales.

Parameters Mean Variance Distribution

Estimated slopes 1.255 0.506 Normal

Note. TALIS = Teaching and Learning International Survey.

Nature of noninvariance. To examine the impact of different sources of parameter non-

Evaluation criteria. In order to examine the performance of MG-CFA fit measures in a

x2 (SD) df RMSEA (SD) CFI (SD) TLI (SD) SRMR (SD)

Panel A: Five items

x2 (SD) df RMSEA (SD) CFI (SD) TLI (SD) SRMR (SD)

Panel A: Metric invariance

1 None 0 10 438.30 36 2.001 2.003 215.24 36 2.006 2.001

Panel A: Metric invariance

1 None 0 10 58.82 45 .001 2.003 324.53 45 2.003 2.002

tendency for the DRMSEA associated with a hypothesis of metric invariance to

Declaration of Conflicting Interests

You might also like