You are on page 1of 20

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/273003228

Confirmatory factor analysis: a brief introduction and


critique

Article · August 2013

CITATIONS READS

20 17,050

1 author:

Peter Prudon
FZP-press
43 PUBLICATIONS   113 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Optimization of predicted clusters of test items. View project

Heterogeneity and etiology of OCD View project

All content following this page was uploaded by Peter Prudon on 17 July 2015.

The user has requested enhancement of the downloaded file.


ORIGINALITY | CREATIVITY | UNDERSTANDING

Comprehensive Psychology
Comprehensive Psychology is an Open Access peer-reviewed publication and
operates under the CC-BY-NC-ND Creative Commons License. The Author(s)
retains copyright to this article and all accompanying intellectual property rights.
Attribution — You must give appropriate credit, provide a link to the
license, and indicate if changes were made. You may do so in any
reasonable manner, but not in any way that suggests the licensor
endorses you or your use.
NonCommercial — You may not use the material for commercial purposes.
NoDerivatives — If you remix, transform, or build upon the material, you may
not distribute the modified material.’

Additional Information about the terms and conditions of this Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International Public License
can be found at www.CreativeCommons.org

COMPREHENSIVE
PSYCHOLOGY

www.AmSci.com
Comprehensive Confirmatory factor analysis as a tool in research
Psychology using questionnaires: a critique1, 2
2015, Volume 4, Article 10
ISSN 2165-2228
Peter Prudon
DOI: 10.2466/03.CP.4.10 Independent Researcher in Psychology, Amsterdam, The Netherlands
© Peter Prudon 2015
Attribution-NonCommercial-
NoDerivs CC-BY-NC-ND
Abstract
Received March 2, 2015 Predicting the factor structure of a test and comparing this with the factor struc-
Accepted June 23, 2015 ture, empirically derived from the item scores, is a powerful test of the content
Published July 15, 2015 validity of the test items, the theory justifying the prediction, and the test's con-
struct validity. For the last two decades, the preferred method for such testing
has often been confirmatory factor analysis (CFA). CFA expresses the degree of
discrepancy between predicted and empirical factor structure in χ2 and indices
of “goodness of fit” (GOF), while primary factor loadings and modification in-
dices provide some feedback on item level. However, the latter feedback is very
limited, while χ2 and the GOF indices appear to be problematic. This will be
demonstrated by a selective review of the literature on CFA.

For construct validation of psychopathology and personality questionnaires, research-


Citation ers often make use of confirmatory factor analysis (CFA), especially when the tests are
Prudon, P. (2015) Confirma- supposed to be multidimensional. For this, a covariance matrix is calculated over the
tory factor analysis as scores of a number of subjects and CFA is then applied to test whether a presumed fac-
a tool in r­ esearch using
­questionnaires: a critique.
tor structure or pattern is not contradicted by this matrix. CFA is executed by means of
Comprehensive Psychology, structural equation modeling (SEM), a very sophisticated statistical procedure for test-
4, 10 ing complex theoretical models on data. CFA is only used for the measurement part of
the models. Since a computer program became available for SEM (LISREL; Jöreskog &
Sörbom, 1974), this method has gained much in popularity. LISREL has been updated
several times (Jöreskog & Sörbom, 1988), and there are several similar programs avail-
able now, e.g., Amos (Arbuckle, 2004; now included in SPSS), EQS (Bentler, 2000-2008),
and Mplus (Muthén and Muthén, 1998-2010). All these can run on current personal
computers, so the threshold to using CFA has become very low.
However, problems with the method have also increasingly been reported, espe-
cially since the turn of the century (e.g., Breivik & Olsson, 2001; Browne, MacCallum,
Kim, Andersen, & Glaser, 2002; Tomarken & Waller, 2003). Current users of the method,
working in the applied social sciences, may be less aware of these limitations of CFA
than statisticians are and be over-optimistic about the reliability of the method when
striving to validate questionnaires. The current paper is primarily meant for them. Re-
searchers well initiated in statistics will probably find little new in it, but to see the prob-
lems recapitulated in a systematic way may motivate them to take note of the alterna-
tive approach to goodness-of-fit which I briefly sketch in the last section. Also, the two
cautious explanations of puzzling findings, suggested in later sections, may invite them
to dispute.
Test scales are devised to measure certain abilities or skills, whereas questionnaire
scales are devised to measure, for instance, certain personality traits, diagnostic catego-
ries, or psychological conditions. So there is much difference between test items which
provide an objective correct–incorrect score and questionnaire items which provide a
subjective rating of oneself or another person, often on a quasi-interval scale of 3 to 10

1
Address correspondence to Peter Prudon, Sajetplein 31, 1091DB Amsterdam, The Netherlands or e-mail (p.
prudon@hetnet.nl).
2
The dissertation by Stuive (2007) stimulated and molded my interest in this topic. I received helpful com-
ments on my thoughts about this issue from Lesley Hayduk on SEMNET and Stanley Mulaik on and beyond
SEMNET, June–July 2011. Herbert Marsh, Feinian Chen, and Ilse Stuive kindly granted me permission to
reproduce and summarize some of their quantitative results. Thanks. I am grateful for the professional com-
Ammons Scientific ments of peer reviewers, editors and colleagues on forerunners of this paper. Among the latter were Paul Bar-
www.AmmonsScientific.com rett, Steven Blinkhorn, and William Robert Nugent, who made stimulating comments, while Robert Brooks
and Patrick Malone took the trouble of making useful critical comments.
Confirmatory Factor Analysis: A Critique / P. Prudon

points. Nevertheless, questionnaires are often referred gradual manner (Stouthard, 2006), because after each
to as tests in literature. For reasons of convenience the modification the entire picture of cluster-item correla-
term “test” will be used interchangeably with “ques- tions will change. Test devisers do not go to extremes
tionnaires” in this paper as well, yet it should never be with these modifications because it is not wise to mold
interpreted as meaning a measure of aptitudes and abil- one's well thought-through test to every whim of imper-
ities. This paper is about questionnaire validation only. fect empirical data.
Although this type of item analysis appears to be
Item Clustering as Feedback on a Test's tailored for testing (and revising) cluster predictions,
­Theoretical Basis and Validity statisticians are uncomfortable about relying so totally
Because they are meant to tap an aspect of a certain con- on zero-order correlations between raw item scores and
struct, the items within a questionnaire are supposed to on equating cluster scores with the unweighted sum of
have at least modest inter-correlations and should clus- the test scale items. Common factor analysis seems a
ter. If a questionnaire is supposed to measure several better option because in this approach the variance per
distinct qualities, then the items should show a cluster- item is divided into a common part (common with the
ing corresponding to these various subscales. factor on which the item loads) and a unique part (item-
It follows that an empirically found item clustering specific variance plus error). Principal axis factor anal-
that corresponds to the ideas that guided the construc- ysis is the most applied form of common factor analy-
tion of the questionnaire is strong support for these sis. It has partly replaced principal component analysis,
ideas, as well as for the content validity of the items which is based on the undivided variance of variables.
and the construct validity of the questionnaire. If the In factor analysis all variables contribute—with a great-
predicted clustering differs vastly from the empirical- er or smaller weight—to each factor.
ly found clustering, the theory behind it could be con- However, these are examples of exploratory factor
sidered faulty, and/or the test scale a mistaken opera- analysis (EFA). EFA is applied to data without an a pri-
tionalization, provided a good sample has been drawn. ori model. It traces dimensions within a covariance or
However, if the difference is moderate the discrepancies correlation matrix to the point at which enough vari-
could be used to refine the theory and/or for further ance has been explained. To further optimize these di-
improvement of the measuring instrument. mensions rotation is performed: orthogonal, when in-
The discrepancy could involve a low or opposite dependent scales are expected, and oblique, when some
correlation of an item with its predicted scale. This may of the dimensions are expected to correlate moder-
imply that the item is either poorly formulated or a poor ately or highly (up to a ceiling). The resulting factors
operationalization of the phenomenon it represents, or may be compared to the presumed test scales: are the
that there is a flaw in the theory. The discrepancy could items of Scale A the ones loading highly on Factor X,
also involve a high correlation between an item and a those of Scale B on Factor Y, etc.? And are they loading
scale to which it had not been assigned. In that case, the much lower on a different factor? Is the correlation be-
theory probably needs modification. Other flaws are: a tween the factors in line with the expected scale cor-
correlation between two scales is much higher than ex- relations?
pected, even to the point that the two scales could be The disadvantage of this often applied approach
considered as forming a single one-dimensional scale; may be that the empirical factor structure/pattern3 is
or it is much lower than expected. In both cases, the the- too much affected by incidentally extreme item inter-
ory will need to be reformulated to some extent. correlations or by over- or under-representation of cer-
tain items, yielding factors that differ in number and
Methods For Testing a Predicted Item content from the test scales. When a result is due to un-
­Clustering representative properties of the correlation/covariance
How does one test and evaluate the empirical cluster- matrix, such a result need not indicate a faulty predic-
ing in relation to an a priori cluster prediction? A very tion. However, adjusting the predicted structure/pat-
direct and controlled way would be to examine the cor- tern to the data without unnecessarily giving up the
relations between the items and the predicted scales (the prediction, on the one hand, and not violating the data
latter are mostly operationalized as the unweighted sum on the other seems a less vulnerable procedure. CFA
of the item scores). These correlations should be in line offers a way to achieve this, based on common factor
with the prediction, and to the extent that they are not analysis with its division in common and unique vari-
they offer a basis for revising the scales in the direction ance of the variables. How does this work?
of greater homogeneity and independence. This method
is known as item analysis (along the lines of classical test
theory, not of item response theory!); many test devel- 3
Structure refers to the matrix of factor–item correlations (factor load-
ings) after both an orthogonal and oblique rotation; pattern refers to
opers make use of it. If the clusters are indeed revised, the matrix of pattern coefficients (standardized beta weights) which is
the revision must be performed iteratively in a very part of the output after an oblique rotation only.

Comprehensive Psychology 2 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

Testing Predicted Factor Structure by Means factor structure, the factor covariances are fixed
of CFA at 0; if the model prescribes a clearly oblique
structure, they are fixed at 1; otherwise, they
The prediction of the factor structure/pattern of a test
are to be estimated. Error covariances that are
involves the number of factors and the specification of
thought to be negligible have to be fixed at 0.
the test items that define each factor (the so-called indi-
(Fixing at 1 does not imply a correlation of 1.)
cators), i.e., those which are expected to have high to
6. The fixed parameters strongly affect the adjust-
moderately high loadings (or beta coefficients) on the
ment procedure mentioned in point 2 by con-
factor. The investigator will most likely also have expec-
straining it. The estimation of the non-fixed pa-
tations about the correlation between the factors and
rameters is free, albeit steered by the model.
perhaps about some cross loadings. Unique variance of
7. This procedure of iteratively adjusting the model
the observed variables is also part of the model. These
matrix to the sample matrix replaces the orthog-
covariances, as well as the variances of the variables
onal or oblique rotation of factors (maximizing
involved, are the parameters of the measurement model.
high primary factor loadings and minimizing
To test this predicted factor structure/pattern with
low cross-loadings), which is typical of EFA.
CFA, the following procedure is applied:
8. The resulting implied covariance matrix is com-
1. The measurement model is translated “back” pared with the sample covariance matrix. The
into a crude covariance matrix over all mea- difference is the residual covariance matrix. It may
sured variables. be standardized to facilitate interpretation.
2. This covariance matrix is then adjusted to the 9. This provides a basis for calculating the degree
empirically found sample covariance matrix in a in which the predicted factor structure/pattern
number of iterations, mostly by means of maxi- fits the data. If the residual covariances are, on
mum likelihood estimation (MLE). This is done average, small enough, then the model fits the
in such a way that the difference between the data well.
two is minimized without violating the data too 10. These differences are expressed in χ2, with de-
much. The final result is the implied covariance grees of freedom (df) equaling the number of
matrix. known parameters minus the number of un-
3. For this process to come to an end and produce a known, non-fixed parameters, thus parameters
result that helps to evaluate the model, the pre- that will be estimated. See Points 4 and 5 for
diction must have been detailed enough, or, in more details.
SEM jargon, the model has to be over-identified. 11. This χ2 should be small enough—in relation to
A model is over-identified when the number of df—to be merely the result of chance deviations
known elements (non-redundant variances plus in the sample with respect to the population
covariances of the sample matrix4) exceeds the (when larger, it is ascribed to prediction errors).
number of unknown parameters that have to be
The main source for the account above is Brown and
estimated by the iterative process.
Moore (2012; also see Brown, 2015).
4. Parameters that have to be estimated are: pri-
mary factor loadings of the indicators (mostly Complications With χ2
except one), factor variances and covariances
The residuals, and thereby χ2, will often reflect an
(when not explicitly predicted to be very high or
imperfection of the model, especially in the primary
low), and error variances. The error covariances
and intermediate phases of a research project. When
that are expected to correlate non-negligibly (an
this is the case, this feedback should be used for a revi-
indicator inter-correlation beyond the result due
sion of the theory and the SEM model. However, in the
to the common factor), if any, have to be added
final stage of the research project, when the model has
to this number; the others will not be estimated.
become well thought-out, the residuals will probably,
5. Parameters that are fixed at a certain value will
on average, still deviate from zero. Now, the difference
not be estimated. Usually, one indicator per fac-
may merely or mainly be due to sample imperfections
tor is fixed at 1 (non-standardized value) for
and fluctuations with respect to the population. Cudeck
the sake of adjusting the scaling of the factor
and Henly (1991) called the model imperfection error
to that of the indicators. Secondary factor load-
the approximation discrepancy, and the sample imperfec-
ings (non-indicators) are fixed at 0, except when
tion error the estimation discrepancy. When applying the
a cross-loading is probable; then it has to be es-
χ2 test, the assumption is that, on the population level,
timated. If the model prescribes an orthogonal
the approximation discrepancy is zero (the prediction is
perfect), so the empirically found difference on the sam-
4
The lower half of the matrix, variances included. It equals p(p + 1)/2.
p = number of variables.
ple level has to be only due to the estimation discrep-

Comprehensive Psychology 3 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

ancy (sample fluctuations). If the latter is small, then the Indices of Goodness of Fit
null hypothesis, “the model does not hold for the popu- To cope with these complications and this problem, SEM
lation,” can be rejected. experts have tried to devise other indices of “goodness
How suited is the χ2 test for demonstrating statisti- of fit” or “approximate fit.” These should express the
cal significance of a predicted factor structure/pattern degree of approximation plus estimation discrepancy,
by means of CFA? Conventionally, χ2 is used to express and provide an additional basis for the acceptance or
a difference between an empirically found distribution rejection of a model. All but one of these goodness-of-
and the distribution to be expected based on a null hy- fit indices are based on χ2 and df, and some also include
pothesis. This difference should be robust enough to N in the formula. The remaining one (SRMR) is based
hold—within the boundaries of the confidence inter- directly on the residuals. Several suggestions have been
val—for the population as well. However, in CFA the made regarding their critical cutoff values (determining
difference should be the opposite of robust; i.e., the dif- acceptance or rejection of a model), among which those
ference should be small enough to accept both the pre- of Hu and Bentler (1998, 1999) have been very influential.
dicted factor structure/pattern and its generalization to Over the years, these indices have been investi-
the population. Therefore, paradoxically, χ2 should be gated in numerous studies using empirical data and,
statistically non-significant to indicate a statistically sig- more often, simulated data. Time and again they have
nificant fit. been shown to be unsatisfactory in some respect; thus,
Demanding this may be asking for trouble. Indeed, adapted and new ones have been devised. Now, many
with large samples (and SEM demands large samples), of them are available. Only four of them will be men-
even very small differences may be deemed significant tioned below because they are often reported in CFA
by current χ2 tables, suggesting a poor fit, in spite of the studies and they suffice to make my point. The formu-
greater representativeness of a large sample. Many ex- las are derived from Kenny (2012), who, it should be
perts in multivariate analysis have thought this to be a noted, briefly makes several critical remarks in his dis-
problem. An exception to them is Hayduk. He argues cussion of the indices.
that we should profit from χ2's sensitivity to model er-
ror and take the rejection of a model as an invitation Standardized Root Mean Square Residual (SRMR;
for further investigation and improvement of the model Jöreskog & Sörbom, 1988)
(see Hayduk, Cummings, Boadu, Pazderka-Robinson, The most direct way of measuring discrepancy between
& Boulianne, 2007.) On SEMNET, 3 June 2005, Hayduk model and data is averaging the residuals of the resid-
notes that “χ2 locates more problems when N is larger, ual correlation matrix. This is what is done in SRMR: the
so that some people blame chi-square (the messenger) residuals (Sij–Iij) are squared and then summed. This sum
rather than the culprit (probably the model).” is divided by the number of residuals, q. (The residuals
The other side of the coin is that models can nev- include the diagonal with communalities, so q = p (p + 1) / 
er be perfect, as MacCallum (2003) contended, because 2, where p is the number of variables.) Then, the square
they are simplifications of reality (see also Rasch, 1980, root of this mean is drawn. (In the formula below, S
p. 92). Therefore, models unavoidably contain minor er- denotes sample correlation matrix, and I stands for implied
ror. In line with this argument, there are always a few correlation matrix.)
factors in exploratory factor analysis that still contrib-
ute to the total explained variance but so little that it 1
( )
2
makes no sense to take them into account. Thus, these SRMR =
2
∑ Sij − Iij .
factors should be ignored in CFA as well. Nevertheless,
if the measurements are reliable and the sample is very A value of 0 indicates perfect fit. Hu and Bentler
large, such minor model error may yield a significant χ2 (1998, 1999) suggest a cutoff value of ≤ .08 for a good fit.
value, urging the rejection of a model that cannot fur- Notice that χ2 is not used to calculate SRMR.
ther be improved.
What about expressing the approximation discrep- Root Mean Square Error of Approximation (RMSEA,
ancy? Is χ2 suited for that job? No, its absolute value Steiger, 1990)
is not interpretable: it must always be evaluated with Root mean square error of approximation (RMSEA) has
respect to df and N. χ2 can only be used to determine much more indirect relation with the residuals because
the statistical significance of an empirically found val- it is based on χ2, df, and N. Its formula is
ue, the estimation discrepancy. Therefore, some statisti-
cians also report χ2/df, because both χ2 and df increase
as a function of the number of variables. This quotient  2 
is somewhat easier to interpret, but there is no consen-  df − 1
,
sus among SEM experts about which value represents ( N − 1)
what degree of fit.

Comprehensive Psychology 4 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

which could also be expressed as: will. This affects the fractions in CFI and TLI, moving
the quotient in the direction of 1. Rigdon (1996) was the
 2 − df first to raise this argument; later he advised using a dif-
.
df ( N − 1) ferent null model in which all variables have an equal
correlation above zero (Rigdon, 1998).
By dividing by df, RMSEA penalizes free parame-
ters. It also rewards a large sample size because N is in Determination of Cutoff Values by Simulation
the denominator. A value of 0 indicates perfect fit. Hu Studies
and Bentler (1998, 1999) suggested ≤ .06 as a cutoff value How do multivariate experts such as Hu and Bentler
for a good fit. (1998, 1999) determine what values of goodness-of-fit
indices represent the boundary between the acceptance
Tucker-Lewis Index (TLI; Tucker & Lewis, 1973)
and rejection of a model? They do so mainly on the
The Tucker-Lewis Index, also known as non-normed basis of simulation studies. In such studies, the inves-
fit index (NNFI), belongs to the class of comparative fit tigator generates data in agreement with a predefined
indices, which are all based on a comparison of the χ2 of factor structure/pattern, formulates correct and incor-
the implied matrix with that of a null model (the most rect factor models, draws a great many samples of dif-
typical being that all observed variables are uncorre- ferent sizes and observes what the values of a number
lated). Those indices that do not belong to this class, such of fit indices of interest will do.
as RMSEA and SRMR, are called absolute fit indices. Com- One of the conveniences of simulation studies is that
parative fit indices have an even more indirect relation the correct and incorrect models are known beforehand,
with the residuals than RMSEA. The formula of TLI is which provides a basis, independently of the fit index
values, for determining whether a predicted model
2 2
null − implied should be rejected or accepted. To express the suitabil-
df df ity of the selected cutoff value of a goodness-of-fit in-
.
 2  dex, the percentage of rejected samples is reproduced,
 df null − 1 i.e., the samples for which the fit index value is on the
“rejection side” of the cutoff value. The rejection rate
Dividing by df penalizes free parameters to some de- should be very small for correct models and very large
gree. A value of 1 indicates perfect fit. TLI is called non- for incorrect models. What percentage is to be demand-
normed because it may assume values < 0 and >1. Hu ed as a basis for recommending a certain cutoff value is
and Bentler (1998, 1999) proposed ≥ .95 as a cutoff value often not stated explicitly by many researchers. It seems
for a good fit. It is similar to the next index: reasonable to demand a rate of ≤ 10% or even ≤ 5% for
Comparative fit index (CFI; Bentler, 1990) correct models and at least ≥ 90% or even ≥ 95% for in-
correct models, considering that mere guessing would
Here, subtracting df from χ2 provides some penalty for
lead to a rate of 50% and p ≤ .05 is conventionally ap-
free parameters. The formula for CFI is
plied in significance testing. A limitation of most simu-
(  implied − df )
2 lation studies is that there are very few indicators (e.g.,
1− . 3 to 6) per factor; usually, the number of items per scale
 2 null − df of the typical tests or questionnaires is larger, especially
in the initial stage of its development.
Values > 1 are truncated to 1, and values < 0 are raised
The study performed by Marsh, Hau, and Wen
to 0. Without this “normalization,” this fit index is the
(2004), who replicated the studies of Hu and Bentler
one devised by McDonald and Marsh (1990), the Rela-
(1998, 1999), may serve as an example of this method
tive Non-centrality Index (RNI). Hu and Bentler (1998,
for determining cutoff values.
1999) suggested CFI ≥ .95 as a cutoff value for a good
fit. Marsh, Hau, and Grayson (2005, p. 295) warned that Hu and Bentler had set up a population that
CFI has a slight downward bias, due to the truncation corresponded to the following models: three
of values greater than 1.0. correlated factors with five indicators each, with
Kenny (2012) warned that CFI and TLI are artificial- either (1) no cross-loadings, the simple model,
ly increased (suggesting better fit) when the correlations or (2) three cross-loadings, the complex mod-
between the variables are generally high. The reason is el. So the simple model had 33 parameters to
that the customary null model (all variables are uncor- estimate (3 × 4 factor loadings + 3 × 5 indicator
related) has a large discrepancy with the empirical cor- variances + 3 factor covariances + 3 factor vari-
relation matrix in the case of high correlations between ances), whereas the complex model had 36 pa-
the variables within the clusters, which will give rise rameters to estimate (these 33, plus 3 additional
to a much larger χ2 than the implied correlation matrix factor loadings). With 15 variables, there were

Comprehensive Psychology 5 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

TABLE 1
Cutoff Values and Related Percentage Rejection For RMSEA and SRMR (Marsh, et al. 2004)
Predicted Model
Cutoff Misspecifi- Simple (Factor Correlations Wrong) Complex (Factor Loadings Wrong)
Values cation Popul. M M Reject (%; Reject (%; Popul. M M Reject (%; Reject (%;
Value (N = 150) (N = 1000) N = 150) N = 1000) Value (N = 150) (N = 1000) N = 150) N = 1000)
RMSEA None 0.001 0.01 0.001 1 0 0.00 0.01 0.001 0 0
≤ .06 Smallest 0.05 0.04 0.05 11 0 0.07 0.07 0.07 67 99
Severest 0.05 0.05 0.05 19 0 0.09 0.09 0.09 100 100
SRMR ≤ .08 None 0.001 0.05 0.02 0 0 0.001 0.04 0.02 0 0
Smallest 0.14 0.14 0.14 100 100 0.06 0.07 0.06 19 0
Severest 0.17 0.17 0.17 100 100 0.07 0.08 0.07 62 0
RNI ≥ .95* None 1.00 0.99 1.00 1 0 1.00 0.99 1.00 0 0
Smallest 0.97 0.97 0.97 18 0 0.95 0.95 0.95 53 39
Severest 0.96 0.96 0.96 36 2 0.91 0.91 0.91 98 100
χ2/df None 1.05 1.03 14.5 11.5 1.05 1.03 11.0 8.0
α = .05 Smallest 1.40 3.39 73.5 100 1.76 5.87 98.5 100
Severest 1.52 4.21 92.0 100 2.34 9.90 100 100
Note  Popul. Value: population value (N = 500,000); Reject: percentage rejected models for this sample size.
*RNI replaces CFI here; see the section “Determination of Cutoff Values by Simulation Studies."

15 × 16 / 2 = 120 known parameters. So the df (b) It is further demonstrated that the population
for the true simple model was 120–33 = 87, and values (and mean sample values) of RMSEA for
for the true complex model 120–36 = 84. the simple model are below the advised cutoff
values of Hu and Bentler (1999). In line with this
The misspecification in the simple model
observation, the rejection rate for N = 1,000 is
involved one or two factor correlations misspec-
0%, whereas it should be ≥ 90%.
ified to be zero (orthogonal instead of oblique),
(c) For the complex model, the population value
whereas in the complex model one or two
(and mean sample values if N = 1,000) of SRMR
cross-loadings were overlooked (in other words:
is below the advised cutoff value; thus, the rejec-
incorrectly held to be zero). So the df for the
tion rate is 0% whereas it should be ≥ 90%.
false simple models were 88 and 89, respective-
(d) For RNI (comparable with CFI), the population
ly, whereas those for the false complex models
values are on the acceptance side of the cutoff
were 85 and 86, respectively.
values in the two misspecified simple models
The population of Marsh, et al. (2004) in- and in the least misspecified complex model,
volved 500,000 cases. Samples of 150, 250, 500, and the rejection rates are accordingly.
1,000, and 5,000 cases were drawn. (The num- (e) Table 1 further shows that, ironically, χ2 per-
ber of samples was not mentioned.) MLE was forms better with respect to both models simul-
applied. The dependent variable was the rejec- taneously than RMSEA and RNI, although it
tion rate per goodness-of-fit index with the cut- leads to more than 10% rejection of the correct
off value advised by Hu and Bentler (1999). Un- models in three out of four cases.
like Hu and Bentler (1999), Marsh, et al. (2004)
This replication by Marsh, et al. (2004), therefore, did
calculated the population values of χ2 and the
not confirm the advised cutoff values of Hu and Bentler
indices, which allows for a direct comparison
(1999). Several of the misspecified models scored on the
of sample and population.
acceptance side of the cutoff values of the goodness-of-
To provide the less initiated reader an idea of the re- fit indices. Thus, what should be a cutoff value depends,
sults of such studies, a small portion of Tables 1a and 1b first, on the type of misspecification one is interested in,
of Marsh, et al. (2004) is reproduced5 in Table 1. and, secondly, on the degree of misspecification one is
willing to tolerate for each type (Marsh, et al., 2004).
(a) Table 1 shows that the mean sample values of
It can further be concluded from Table 1 that small-
the first two fit indices are closer to the popula-
er samples may be problematic when assessing the
tion values for N = 1,000 than those for N = 150.
correctness of a model: for N = 150, the mean RMSEA
This result is observed because smaller samples
values for incorrect models are somewhat lower than
have greater sample fluctuations.
the population values (decreasing the rejection rates),
5
With the kind permission of Herbert Marsh (May, 2013). whereas the mean SRMR values for both correct and

Comprehensive Psychology 6 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

i­ ncorrect models are higher than the population values, samples? Miles and Shevlin (2007) and Saris, Satorra,
increasing rejection rates for incorrect models (albeit in- and van der Veld (2009) refrained from drawing any
sufficiently in this replication study). χ2, too, leads to an real sample and contented themselves with an imagi-
insufficient rejection rate for incorrect simple models in nary sample of N = 500 and N = 400 respectively, perfect-
the case of the smallest misspecification. ly “mirroring” the population values, saving themselves
SRMR seems to do very well in the case of the sim- much calculation time [see Sections “Saris, Satorra, &
ple model (one or two factor correlations are zero in- van der Veld (2009)” and “Large Unique Variance Pro-
stead of moderately positive), but Fan and Sivo (2005) motes an Illusory Good Fit” in the present article]. Sam-
showed that this was due to the fact that these zero fac- ple drawing is only useful when the cutoff value is
tor correlations produce many zero variable correla- rather close to the population value because then the
tions in the implied matrix, leading to large residuals rejection rates, especially in the case of the smaller sam-
and, consequently, a high SRMR because of its very di- ples, are not obvious.
rect relation to the residuals, more so than the other fit
indices. If the factor correlations were misspecified to Reconciling Avoidance of Both Type I Error
be 1 instead of merely high (meaning that the factors and Type II Error
had to be fused), then SRMR no longer showed a special In addition, cutoff values should be such that two types
sensitivity to misspecified factor loadings. of error are avoided at the same time. They should be
Finally, note that only two kinds of misspecifications strict enough to avoid accepting a model when in fact
were investigated. Incorrectly assigned indicators and it is incorrect, which represents a Type I error (a false
correlated error terms, for instance, were not modeled. positive). Alternatively, the cutoff value should be
This limits the degree to which the cutoff values may lenient enough to avoid rejecting a model that is cor-
be generalized even more. Note further that even when rect, which represents a Type II error (a false negative).
the fit index values are beyond their cutoff criterion, Can these two opposing interests always be reconciled,
they still hold some utility as measures of the approxi- for any sample size, in CFA? Table 1 shows that the
mation discrepancy. answer depends on the combination of model, sample
Why Should One Draw Samples at All? size, and severity of misspecification for RMSEA, RNI,
and SRMR. χ2 does not appear to be suited to reach 95%
As indicated, Marsh, et al. (2004) calculated the pop-
acceptance of correct models. In this case, the goodness-
ulation values of the fit indices, and when the values
of-fit indices do better.
appeared to be rather far removed from an advised cut-
off value, the rejection rates were close to 0% or 100%, The study by Chen, et al. (2008), referred to
depending on which side of the cutoff value each popu- in the previous section, offered an excellent op-
lation value was. Chen, Curran, Bollen, Kirby, and Pax- portunity for determining the degree to which
ton (2008) reported similar experiences in their simula- both errors can be avoided at the same time,
tion study, which was designed to test the cutoff values at any rate in the case of RMSEA. Chen, et al.
for RSMEA (see the next section for more details regard- (2008) investigated three models: (1) in Model
ing that study). In Table 2, the population values for the 1, three factors with three indicators each plus
three correct and misspecified models are reproduced. one cross-loading indicator each; (2) in Model
Six out of nine population values for the misspecified 2, three factors with five indicators each plus
models were below a cutoff value of 0.06 or even 0.05. one cross-loading indicator each; and (3) like
The information provided by the population values Model 1 but with four inter-correlating exog-
places one in a position to determine whether a select- enous variables correlated with one factor and
ed cutoff value is too liberal, even before having drawn of which two variables also correlate with the
any sample. Thus, what is the use of drawing all those other two factors. In Models 1 and 2, the mis-
specification involved omitting one, two, or
TABLE 2 three of the cross-loading indicators from the
Population Values in the Study of Chen, et al. (2008, prediction. In Model 3, the smallest misspecifi-
Derived From Their Table 1)
cation was omitting all three cross-loadings
Population Values RMSEA from the prediction, the moderate misspecifi-
Misspecification
Model 1 Model 2 Model 3 cation was omitting the four correlations of
None 0.00 0.00 0.00 exogenous variables with the factors, and the
Smallest 0.03 0.02 0.05 largest misspecification combined the small and
Moderate 0.04 0.03 0.08 moderate misspecifications. The sample sizes
Severest 0.06 0.04 0.10 were 50, 75, 100, 200, 400, 800, and 1000. The in-
Note  Misspecification: overlooking cross-loadings and/or vestigators generated 800 samples for each of
correlations with exogenous variables. the 84 experimental conditions.

Comprehensive Psychology 7 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

TABLE 3
Cutoff Values of RMSEA Required to Reach Good Rejection Rates (Chen, et al., 2008, Inferred From Figs. 4–15)
Model (N = 200) Model (N = 1000)
Misspecification
1 2 3 1 2 3
To effect ≥ 90% acceptance of correct models, computed RMSEA must be:
None ≥ 0.05 ≥ 0.04 ≥ 0.04 ≥ 0.02 ≥ 0.02 ≥ 0.02
To effect ≥ 90% rejection of incorrect models, computed RMSEA is restricted to be:
Smallest fails, ± 64% fails, ± 77% ≤ 0.03, no ≤ 0.01, no ≤ 0.02, dubious ≤ 0.04, yes
Moderate fails, ± 80% ≤ 0.01, no ≤ 0.07, yes ≤ 0.03, yes ≤ .03, yes ≤ 0.08, yes
Severest ≤ 0.04, no ≤ 0.03, no ≤ 0.08, yes ≤ 0.06, yes ≤ 0.04, yes ≤ 0.09, yes
Note  The word “fails” in a cell indicates that even a cutoff value of 0 is not suited to effect 90% rejection of incorrect models.
The actual rejection rate observed for RMSEA = 0 is printed in these cases. “No” in a cell indicates that the cutoff value for 90%
rejection of incorrect models is irreconcilable with that for 90% acceptance of correct models. “Yes” in a cell indicates that the
cutoff value for 90% rejection of incorrect models is reconcilable with that for 90% acceptance of correct models.

The authors investigated the rejection rates for between the linked variables are attenuated. In psycho-
RMSEA cutoff points ranging from 0 to 0.15 with in- logical research, making use of questionnaires or tests,
crements of 0.005. From their results it was clear that there is often moderate reliability at best. However, in
avoiding Type I and Type II error at the same time was biological or physical research very reliable and homo-
often not possible for N ≤ 200. As an illustration of their geneous measurements are within reach. One would
findings, Table 3 shows what is inferred from their Figs. assume that this would be ideal for model testing.
4–15 for N = 200 and N = 1000 only (this table is not a lit-
Paradoxical Effect of a High Reliability
eral reproduction of one of their tables or a part of these,
but is inferred from their Figs. 4–15).6 However, an article by Browne, et al. (2002) seemed to
If N = 200, only the moderate and severest misspecifi- demonstrate the opposite. The researchers discussed
cation in Model 3 allow for cutoff values that are suit- data obtained from a clinical trial of the efficacy of a psy-
ed for both accepting 90% or more correct models and chological intervention in reducing stress and improving
rejecting 90% or more of incorrect models. In the seven health behaviors for women with breast cancer (Ander-
other cases, no such cutoff values can be found. In three sen, Farrar, Golden-Kreutz, Katz, MacCallum, Court-
cases, even RMSEA = 0 is not strict enough to arrive at ney, et al., 1998). One focus was on two types of biologi-
90% rejection! For N = 1000, the results are much better: cal responses of the immune system to the intervention.
seven out of nine cases survived. However, the cutoff For each response there were four replicates. These 2 × 4
value that seems advisable based on this study would replicates were treated as indicators of two correspond-
be 0.025 for N = 1000, which is most likely too strict for ing and related but distinct factors. Because the mea-
many other studies, especially for those with real data sures were biological and replicates, one may expect a
in which minor model error is unavoidable (MacCal- high reliability and homogeneity among them, which
lum, 2003; see also Section “Three Degrees of Unique implies a small unique variance. Indeed, on the face
Variance and Their Effect on Goodness of Fit.” More- of it, the correlation matrix showed two distinct clus-
over, a sample size of N = 1,000 is often out of reach in ters of highly inter-correlating variables: 1–4 (M inter-
empirical studies with real data, especially in the field correlation = .85) and 5–8 (M inter-correlation = .96). An
of psychopathology. MLE CFA was carried out hypothesizing these two fac-
tors. In spite of the clear picture (the residual matrix
High Reliability (Small Unique Variance) only contained values very close to zero) and the small
Spoils the Fit sample (N = 72), χ2 was significant, urging rejection of
The variance in values for a variable within a set of vari- the model, and so did RMSEA and three other abso-
ables is divided into common variance and unique vari- lute GOF indices. The comparative fit indices RNI and
ance. Common variance is a function of one or more NFI showed better performance but were nevertheless
features of the variables they have in common, and below the advised cutoff values. Only SRMR, with a
unique variance is a result of the influence of specific value of 0.02, indicated acceptance of the model unam-
factors and measuring error. High common variance biguously, in line with the fact that this index is based
implies high correlations between the variables. If there on the residuals alone.
is much error, i.e., low reliability, then the correlations The researchers went on to frame a correlation ma-
trix, consisting again of two clusters, but with much
6
Findings reproduced with the kind permission of Feinian Chen (May, lower inter-correlations between the variables. Within
2013). the first “cluster” in particular the mean variable inter-

Comprehensive Psychology 8 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

correlation (its cohesion) was only .14, and the cohesion Saris, Satorra, and van der Veld (2009)
of the second cluster was .61, which is quite high but Saris, et al. (2009) reported a similar study: they devised
much lower than that of the original cluster. MLE CFA population data for a one-factor model. Strictly speak-
with a model for two related factors should yield the ing, the data were more precisely described by a two-
same residual matrix as above, and therefore the same factor model in which the correlation between the two
SRMR, as indeed it did. The result: χ2 was no longer sig- factors was .95. This difference they considered triv-
nificant, thus indicating a good fit, in spite of the first ial, so the one-factor model should have been deemed
cluster's unconvincing cohesion. The same held for the acceptable. The authors drew an imaginary, perfectly
goodness-of-fit indices. representative sample (N = 400) from this population
Browne, et al. (2002) concluded that χ2 and the fit in- and calculated χ2, RMSEA, CFI, and SRMR. If the factor
dices based upon it measure detectability of misfit rather loadings were .85 or .90—which presupposes high reli-
than misfit directly. In other words, the high statistical ability and low specific variance—then the values of χ2
power of a test may easily lead to the rejection of a mod- and RMSEA led to the rejection of the one-factor model.
el, whereas a mediocre statistical power leads to the ac- Only CFI (in line with the expectations of Miles & Shev-
ceptance of the model (see also Steiger, 2000). In social lin, 2007; see Section “Large Unique Variance Promotes
science research, one is rarely confronted with this un- an Illusory Good Fit”) and SRMR (not being χ2 based)
desirable phenomenon because most measures are of had values in favor of acceptance of the model. If the
moderate quality. Nevertheless, they may vary from loadings were decreased to  ≤ .80, then χ2 and RMSEA
rather reliable and homogeneous to mediocre. It would also indicated acceptability of the two-factor model.
be ironic if only the latter would show good fit index
values. Browne, et al. (2002) reasoned that this discrep- Conclusion
ancy is caused by the fact that, due to the procedure in The lesson to be learned from the foregoing is that one
CFA, χ2 is affected not only by the residual matrix but should not be too eager to resist indications of ill fit sim-
also by the sample matrix. Thus, it can be influenced by ply because the unique variance is particularly small. A
the degree of unique variance in the observed variables. serious search for an alternative model should always
Because all GOF indices but SRMR are based on χ2, they be undertaken (Hayduk, Cummings, Pazderka-Robin-
will indicate a poor fit as well. son, & Boulianne, 2007; McIntosh, 2007). That does not
mean, however, that indications of ill fit should never be
Hayduk, Pazderka-Robinson, Cummings, Levers, considered trivial. However, the latter cannot automati-
and Beres (2005) against Browne, et al. (2002) cally be concluded from a very small unique variance; it
The data gathered in the study by Browne, et al. (2002) must be supported by substantive arguments.
enjoyed a remarkable reanalysis by Hayduk, et al. (2005).
These investigators had reason to believe that the two- Large Unique Variance Promotes an Illusory
factor model of Browne, et al. had not been correct and Good Fit
that two “progressively interfering” factors should have A much more important lesson to be learned from
been included, which were thought to affect the respec- Browne, et al. (2002), however, is that non-significant χ2
tive two clusters of measurements. This alternative model and favorable goodness-of-fit index values cannot be
appeared to fit the data very well, such that χ2 was far trusted uncritically because these may merely be pro-
from significant in spite of the minute unique variance. moted by low inter-variable correlations as a result of a
Eliminating one of the progressively interfering factors large unique variance. Their findings do not stand alone.
from the model spoiled the results. The same held for a Some simulation studies have detected similar problems.
few other variations of the model. Miles and Shevlin (2007) reasoned that comparative
From the section, “A Spurious Influence of the Num- fit indices might be more robust against the paradoxical
ber of Factors and Variables,” we can learn that a greater influence of reliability because the former depend on
number of factors in proportion to the number of observed the comparison of two χ2 values that should be approx-
variables promotes the acceptance of incorrect models. So imately equally affected by the degree of unique vari-
improving fit by introducing two additional factors into ance in the observed variables. To test this, they devised
the model does, in itself, not prove the superiority of the χ2 a study in which they compared χ2, RMSEA, SRMR,
test. The added factors should first make sense theoretical- and a number of comparative fit indices.
ly. I am not the one to judge that in the present case. The authors set up a population, corresponding to
Mulaik (2010) proposed a further refinement of the two related factors (correlation 0.3) with four high-load-
model, which involved a correlation of .40 between each ing indicators each and two minor factors, the one load-
of the progressively interfering factors and the clusters ing very low on two indicators of the one major factor
of measures they were supposed to affect. This refine- and the other loading very low on two indicators of the
ment did not improve the already good fit but was the- other major factor. Miles and Shevlin (2007) “drew” an
oretically more plausible. imaginary sample of N = 500, perfectly mirroring the

Comprehensive Psychology 9 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

population values. The tested model was a two-factor The dependent variables were: the percent-
structure omitting the two minor factors. With a perfect age of (a) accepted correct assignments and (b)
reliability of 1.0, the disturbance by the minor factors rejected incorrect assignments. Acceptance of
was enough to have the model rejected by χ2. RMSEA the assignments depended on the p value (> .05)
indicated a doubtful fit, but the other fit indices indi- of MLE χ2, as well as on different cutoff values
cated a good fit, as they were predicted to do. Howev- for three fit indices and combinations thereof.
er, decreasing the reliability to a modest 0.8 resulted in
Figure 5.2 in Stuive (2007) shows that introducing
χ2 and RMSEA both indicating a good fit, whereas the
10% model error had a very strong effect on χ2 (p > .05),
comparative indices and SRMR continued to indicate a
resulting in a decrease in the acceptance rate of correct
good fit.
models from approximately 92% to approximately 58%,
Miles and Shevlin (2007) performed a second study
over all unique variance conditions. Introducing 20%
using this model, but left out the two minor factors
model error resulted in an acceptance rate of only 40%.
while the correlation between the two factors was in-
If one finds 10% and 20% model error too large to be
creased to 0.5. This time, the model tested was severely
considered “minor,” then these results are in favor of
misspecified, a one-factor model. If the reliability was
using χ2 (p > .05), not against it. In spite of these strong
0.8, χ2 indicated misfit, as it should, and so did all of the
effects, Stuive and her team joined together the cases
fit indices. However, when the reliability was decreased
with 0, 10, and 20% model error in order to study what
to a meager 0.5, χ2 suddenly indicated a good fit and
the three degrees of unique variance did to the statisti-
so did RMSEA and—in spite of not being χ2 based—
cal power of CFA.
SRMR. It was hoped that the comparative fit indices
would be robust against the spurious influence of reli- Rejection of Correct Models (Type II Error)
ability on χ2, but only two of them were—the normed The samples N = 50 and N = 100 will be ignored in this
fit index (NFI; Bentler & Bonett, 1980) and relative fit section and Section “Acceptation of Incorrect Models (Type
index (RFI; Bollen, 1986). Three of them did not indi- I Error)” because they caused too much erroneous judg-
cate misfit: CFI, TLI, and incremental fit index (IFI; Bol- ments in the cases of moderate and high unique vari-
len, 1989). ance. What was the effect of the degree of unique vari-
ance on the rejection of correct models (Fig. 5.1 in Stuive,
Three Degrees of Unique Variance and Their et al., 2008)?
Effect on Goodness of Fit
Stuive (2007; see also Stuive, Kiers, Timmermans, & ten (a) If the unique variance was 25%, approximately
Berge, 2008) investigated unique variance systemati- 70% of the correct models were rejected7 based
cally as an independent variable. on χ2, p > .05. Of the fit indices, RMSEA and CFI
needed their most lenient cutoff values (0.10
Stuive's study was performed with contin- and 0.94, respectively) to reduce the rejection
uous data, simulating a 12-item questionnaire of correct models to approximately 10%. SRMR
with three subtests of four items each. The mis- did not seem affected by a small unique vari-
specification here was an incorrect assignment ance: 0% of the correct models was rejected for
of one or more items to subtests (three levels; N ≥ 200, even with the strict SRMR ≤ .06.
note that this is a much more serious predic-
tion error than misspecified cross-loadings). In (b) In the case of 49% unique variance, the rejection
addition, 10% “minor model error” was intro- rate of correct models based on χ2 dropped only
duced in one-third of the data and 20% in an- slightly below 70% for N = 1,000. For N = 400 it de-
other one-third of the data, following the argu- creased to approximately 50%, a score not above
ment of MacCallum (2003) that 0% model error chance level. RMSEA and CFI, on the contrary,
can never be realized in studies with real data. demanded little or no rejection of correct models,
This “minor model error” consisted of the ef- provided N ≥ 400. SRMR did well for N ≥ 200.
fect of nine unmodeled factors. (c) In the case of 81% unique variance, the rejec-
The independent variables were: (1) unique tion of correct models, based on χ2, dropped
variance (25, 49, or 81%), (2) amount of mi- dramatically to approximately 15% for N = 1000
nor model error, and (3) correlations between and 10% for N = 400 (Stuive, 2007, Fig. 5.1c). In
the three factors (.0, .3, or .7). Variables 1 and other words, the minor model error in the cor-
3 were parameters to be estimated, not fixed;
Variable 2 was not part of the model. The sam- 7
This value seems a poor score, but if one regards Stuive's (2007) cor-
ple size was another independent variable: 50, rect models for the most part as incorrect because 10 and 20% minor
model error is considered to be too large, then this rate is in line with
100, 200, 400, and 1,000 cases. For each combi- 33% fully correct models. The introduction of minor (?) model error
nation of conditions, 50 samples were drawn. makes Stuive's results somewhat difficult to interpret.

Comprehensive Psychology 10 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

rect models no longer mattered much. The three into perspective the favorable acceptance rates of SRMR
goodness-of-fit indices scored well. regarding correct models, mentioned in the study of
Browne, et al. (2002).
In all, a higher unique variance promoted the accep-
tance of correct models (decreasing Type II error). But Large Unique Variance Combined With High
did it also promote the acceptance of incorrect models Factor Correlation Promotes Type I Error
(increasing Type I error)?
The results of Stuive (2007) with respect to incorrect
Acceptation of Incorrect Models (Type I Error) models were reasonable when the unique variance was
The results are inferred from Stuive (2007, Fig. 5.4) and 49%, especially for CFI. However, this only held when
Stuive, et al. (2008, Fig. 2). the factors correlated 0.3 or 0.0, but not when the factors
correlated 0.7. Then, a large unique variance caused the
(a) With 25% unique variance, the rejection rate was fit indices, including CFI, to lose their power to reject
approximately 100%—justly so—for χ2, RMSEA, incorrect models altogether (see Fig. 5.5 in Stuive, 2007).
SRMR, and CFI, even with the most lenient cut- Let us have a more detailed look.
off values.
(a) With 25% unique variance, χ2 and RMSEA con-
(b) With 49% unique variance, the cutoff value for tinued to reject incorrect models perfectly. SRMR
RMSEA had to be somewhat stricter (reject the rejected incorrect models perfectly only with the
model if RMSEA > .08) to reach approximately strictest cutoff value (reject model if ≤ .06), but was
90% rejection with N = 400 and N = 1000. The same below the chance level with more lenient cutoff
held for SRMR (reject the model if SRMR > .08). values. CFI, too, required the strictest cutoff val-
CFI and χ2 were still performing perfectly. ue (reject model if CFI ≥ .96) to reject enough cases
(c) However, with 81% unique variance only the (approximately 90% with N = 400 and N = 1000).
strictest cutoff value of RMSEA (reject the mod-
(b) With 49% unique variance, only χ2 for N ≥ 200
el if RMSEA > .03) led to an acceptable rejection
led to (almost) perfect rejection of incorrect mod-
rated (95% for N = 1,000 and 85% for N = 400). Re-
els. RMSEA then required the stricter cutoff val-
member, using this strict value led to the rejection
ue of 0.06, in combination with N ≥ 400, to reach
of “correct” models as well, in the case of unique
95–100% rejection (when the factor correlations
variances of 25 and 49%. SRMR performed even
were 0.0 or 0.3 the cutoff value 0.8 sufficed). The
worse: even the strictest cutoff value (reject the
rejection rate for SRMR and CFI, however, was
model if SRMR > .06) led to acceptance of (almost)
below chance.
all incorrect models, except for the small sam-
ple sizes (sic). CFI, on the contrary, continued to (c) With 81% unique variance, χ2 with N = 1000 sank
show good rejection rates, at least for N = 400 and to 85%, but below chance level for smaller sam-
especially N = 1000; this is in line with its great- ple sizes. RMSEA, SRMR, and CFI were no lon-
er robustness against a high unique variance, as ger suited to reject incorrect models (almost 0%
assumed by Miles and Shevlin (2007). As to χ2, for the larger sample sizes). Remember, on top
for N = 1,000 it accomplished 100% rejection, as it of the incorrect item assignments there was the
should have; for N = 400 it yielded 85% rejection, minor model error in 67% of the models, which
while for N ≤ 200 it rejected only 50% and less of affected χ2 and RMSEA considerably if the fac-
the incorrect models. tor correlation was still ≤ .3.
To summarize: in line with the findings and reason- In sum, from Stuive's (2007) simulation study (Sec-
ing reviewed in sections “High Reliability (Small Unique tions “Three Degrees of Unique Variance and Their Ef-
Variance) Spoils the Fit” and “Large Unique Variance fect on Goodness of Fit” and “Large Unique Variance
Promotes an Illusory Good Fit,” MLE χ2 (p > .05), in case Combined With High Factor Correlation Promotes
of small unique variance, demanded too much rejec- Type I Error”) and from the studies discussed in Section
tion of correct models (only if the data contained minor “Large Unique Variance Promotes an Illusory Good
model error), whereas in case of large unique variance Fit,” it can be concluded that the more “messy” the em-
it allowed too much acceptance of incorrect models. Be- pirical factor structure/pattern is, the better the fit is
ing χ2-based in a very direct manner, the same held for suggested by SRMR and CFI and to a lesser degree by
RMSEA, albeit to a much more attenuated degree. CFI χ2 and RMSEA, even for the rather severely misspeci-
was rather robust against the undesirable influence of fied models that were typical for her study. Goodness
unique variance. SRMR, however, appeared to lose its of fit is increasingly over-estimated with higher unique
power to detect incorrect models as a consequence of variance, especially in combination with an unmodeled
high unique variance (81%, not 49%), in spite of not be- high factor correlation, thereby promoting Type I error
ing χ2-based, even more so than RMSEA. This also put (false positives).

Comprehensive Psychology 11 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

The utmost consequence of this undesirable rela- These findings are in line with those of Meade (2008).
tion between unique variance and goodness of fit was Meade investigated the suitability of various fit indices
drawn by Marsh, Hau, and Wen (2004), repeated in to detect an unmodeled factor present in the data. This
Marsh, et al. (2005, p. 318), where the authors stated, author observed that RMSEA ≤ .06 was strict enough
“… assume a population-generating model in which all to detect the unmodeled factor, only if the factors con-
measured variables were nearly uncorrelated. Almost sisted of four items. However, if the factors consisted
any hypothesized model would be able to fit these data of eight items each, ≤ .06 appeared too liberal; in other
because most of the variance is in the measured vari- words, the calculated RMSEA had become smaller than
able uniqueness terms, and there is almost no co-vari- 0.06 (Fig. 9 in his article).
ation to explain. In a nonsensical sense, a priori mod- The same problem seems to hold for SRMR (An-
els positing one, two, three, or more factors would all derson & Gerbing, 1984; Breivik & Olsson, 2001; Kenny
be able to ‘explain the data’ (as, indeed, would a ‘null’ & McCoach, 2003). Fan and Sivo (2007) observed that
model with no factors). The problem with… this appar- SRMR performed much better for smaller models; in
ently good fit would be obvious in an inspection of… the larger models, the value of SRMR (not reproduced
all factor loadings [which, then, all turned out to be] by the authors) was too small to reject misspecified
close to zero.” models (their Table 3 vs their Table 2).
Explanatory Suggestion Explanatory Suggestion
An explanation of the model-accepting effect of a Why do more factors relative to the number of vari-
“messy” factor structure/pattern is beyond the scope ables result in a smaller SRMR? A full explanation is
of this paper. However, I wonder whether the problems beyond the scope of this paper, but again a cautious
might be due to the element of comparing two com- suggestion can be made. For my argument, have a look
plete correlation matrices. (a) If all correlations are rela- at the Appendix. Four correlation matrices of 12 vari-
tively small (large unique variance), then the residuals ables each are printed. Each matrix is ordered in such a
will also tend to be small. This will artificially decrease way that the clusters of correlations, corresponding to
SRMR and to a lesser degree χ2 and RMSEA, obscuring the factors, can easily be detected. If there are two clus-
real misfit. (b) If the correlations between the indicators ters with six indicators each, there are 42 within-clus-
per factor are high, then their residuals will also tend to ters inter-item correlations (communalities included)
be high, obscuring real fit. For CFI and TLI, however, against 36 between-clusters inter-item correlations. If
the opposite may hold because of a larger distance of there are three clusters of four items each, as in Stuive's
the tested model to the null model (see the argument (2007) study, then there are 30 within-cluster inter-item
of Kenny, 2012, and Rigdon, 1996, mentioned in Sec- correlations against 48 between-clusters inter-item cor-
tion 5, last paragraph.) (c) For item clusters to hold, relations. If there are four clusters of three variables
if the clusters are almost independent then the items each, then there are 24 within-clusters inter-item corre-
within each cluster correlate much more highly with lations against 54 between-clusters inter-item correla-
each other than with the items from the other clusters. tions. Finally, if there are six clusters of two items each,
On the other hand, if the clusters correlate highly, then then there are 18 within-cluster inter-item correlations
the correlations of the items within a cluster will not against 60 between-cluster inter-item correlations.
be much higher than those of the items between clus- So, the more item clusters in relation to the total
ters. Consequently, the residuals of incorrectly assigned number of variables (except in the case of two clusters),
items would be smaller in the case of highly correlat- the more the between-clusters inter-item correlations
ing factors than in the case of independent factors. That outnumber the within-factors inter-item correlations. In
would explain why a high factor correlation spoils the the case of an equal number of variables per cluster, this
detection of misspecified models, as reported in this follows the formula:
section (and promotes the acceptance of correct models:
the other side of the coin!). For the reproduced correla- w v+ f
=
tion matrices in factor analysis of oblique structures/ b v ( f − 1)
patterns this may hold to an attenuated degree, but the
effect will not completely be flattened out. in which w = the number of within-clusters residuals;
b = the number of between-clusters residuals; v = the
A Spurious Influence of the Number of ­Factors number of variables; and f = the number of variable
and Variables clusters (factors).
Breivik and Olsson (2001) reasoned and observed that Unless the factors are severely misspecified, the ab-
RMSEA tends to favor models that include more vari- solute values of the correlations that are not part of the
ables and constructs over models that are simpler, due factors will generally be much lower than those of the
to the parsimony adjustment built in (dividing by df). within-factors correlations, even if the factors inter-cor-

Comprehensive Psychology 12 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

relate moderately high, by definition so. This will hold researcher could “free” the fixed parameters with the
for both the empirical and the implied matrix. poorest values and have them estimated by the iterative
The difference between variable pairs with low cor- procedure used before or fix the free parameters with
relations will mostly be smaller than the difference be- the poorest values. However, after such a modification,
tween variable pairs with high correlations. So the re- the whole picture of modification indices (and primary
siduals of the between-factors inter-item correlations factor loadings of the items) will change, so the picture
will generally be smaller than the residuals of the with- of modification indices in the initial stages cannot be re-
in-factors inter-item correlations (unless the factor cor- lied upon to give accurate information on all items (pa-
relations have been fixed to be 0 (see Fan & Sivo, 2005; rameters) simultaneously.
also discussed in the section “Determination of Cutoff A further disadvantage is that CFA requires rather
Values by Simulation Studies”). large samples to be reliable, especially in the case of a
In the case of many factors with few indicators each, large number of variables; 5–10 observations per vari-
these small between-factor inter-item residuals will out- able is often the advice. For the latter reason, scores of
number the somewhat higher within-factor residuals. subtests or of item “parcels” are often preferred above
The average of all residuals together will then be small- single items to obtain reliable results. The items individ-
er too. And this will result in a smaller value for SRMR, ually, however, are the very thing the cluster predictor
and—with some attenuation—a smaller RMSEA and χ2 or test deviser is interested in, and whether these can be
as well. This suggests a better fit, even if the model is combined into parcels has yet to be proven.8 In addition,
more or less misspecified. especially in the field of psychopathology, large samples
are rarely realizable. Most of the time one has to content
CFA and Its Limited Feedback on Test oneself with “convenience samples” of moderate size,
Item Level i.e., samples selected non-randomly from a “poorly-de-
A limitation of CFA of an altogether different kind fined super-population” (Berk & Freedman, 2003).
has to do with the feedback on the level of individual Factor analysis was invented to deal with a great
items. CFA is meant for a test of the measuring instru- number of correlating variables, producing massive co-
ment as a whole: if χ2 and/or the goodness-of-fit indi- variance matrices. SEM was invented to test structural
ces assume unfavorable values, then the prediction models with a limited number of theoretical constructs
could be considered refuted. However, such an all-or- and a few observed variables, which, together, produce
nothing decision is not the only or main thing in which modest empirical and implied covariance matrices and
the researcher is interested, especially not in the early whose residual matrices can be overviewed easily. For
and intermediate phases of his investigation. Then the structural models, it makes sense to have a few mea-
preference is for more detailed feedback, which enables sures that summarizes the fit of both matrices. For tests
him to improve the theory or test, or both (Anderson & with a great number of items, this makes much less
Gerbing, 1988). Such feedback would be provided by sense. Perhaps it was not a good idea in the first place
a factor structure/pattern with factors, corresponding to use SEM for confirmatory factor analysis as well.
in number and nature to the predicted factor structure,
yet completely in accordance with the empirical corre- Discussion
lation matrix. The output of CFA, however, contains a Even if it would also makes sense to have indices of
factor structure/pattern that is strongly affected by the goodness of fit in cases of a great number of observed
theoretical model behind the test, quite different from variables, those that are produced by SEM are plagued
the structure/pattern resulting from exploratory factor by various problems:
analysis with oblique rotation: secondary factor load- 1. χ2 may indicate misfit, especially in tests of high
ings (beta-weights) are lacking because they were fixed statistical power, simply because the sample is large.
at zero in the prediction, except where cross-loadings On the other hand, SEM requires large samples to
had been predicted (which, necessarily, concerns only do its job properly.
a few indicators). So the researcher can see which items 2. The goodness-of-fit indices do require large sam-
have a suspiciously low primary loading (standardized ples to discern good from poor predictions, in
beta-coefficient), but not whether they should have been line with what SEM needs, but large samples are
reallocated to another factor (and, if so, which one), or often out of reach in many psychological studies,
that they should be considered cross-loading. especially in the field of clinical psychology.
The output also includes modification indices per 3. When the reliability of a test is high and the spe-
parameter. These indices show to what extent χ2—and/ cific variance of the variables is small (betraying
or one or more of the goodness-of-fit indices—could itself in high correlations between the variables
adopt a better value (because of smaller residuals) by
reallocating these items to another factor, or by assum- 8
According to Marsh, Lüdtke, Nagengast, Morin, & Von Davier (2013),
ing cross-loadings or different factor correlations. The this practice is undesirable in any case.

Comprehensive Psychology 13 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

within a cluster), χ2 and RMSEA lead to the re- are of questionable value. However, a critical attitude to
jection of well-predicted models if there is mi- such studies is certainly warranted.
nor model error. Minor model error, however, is Another limitation is that some of the criticisms may
unavoidable under real-world conditions. CFI is pertain more to the way CFA is applied by many re-
somewhat more robust against this effect. searchers than to CFA as such, in the first place, leav-
4. On the other hand, when the reliability of a test ing too many parameters free to estimate (for then, a
is low, χ2 and RMSEA lead to the acceptance of good fit will be readily attained but does not say very
poorly predicted models (comparative fit indices much), or fixing too many (e.g., not including cross-
somewhat less). correlations or correlated error terms in the model). A
5. When the unique variance of test variables is more complete and precise measurement model predic-
high (high specific variance and/or low reliabili- tion will probably yield more reliable and meaningful
ty), then χ2 does not detect incorrect models when χ2 and goodness-of-fit index values.
N ≤ 200. RMSEA does not detect misfit under any The criticisms raised in this article hold especially
N. SRMR can no longer discern good fit from for CFA of tests with many items, not for SEM of struc-
misfit. CFI requires a strict cutoff value and N ≥  tural models with only a few—both valid and reliable—
400. indicators per factor in the measurement part. Besides,
6. When the factors have a high correlation (.70), the objection of limited feedback at the item level be-
while the unique variance is 49%, SRMR and CFI comes less relevant in an advanced stage of a research
are no longer suited to reject incorrect models. If project when the test has gained good psychometric
the unique variance rises to 81%, none of the fit properties. Then, a global evaluation of approximate fit
indices is suited to reject incorrect models, except (GOF) and estimation fit (statistical significance) might
χ2 with an N = 1000. be all the researcher needs.
7. There are also indications that increasing the
number of factors relative to the number of vari- Future Directions
ables spoil the power of RMSEA and SRMR to de-
What could be an alternative to, or an improvement
tect incorrect models.
of, CFA in the case of testing instruments with many
So, on the basis of this selective review, reinforced variables when one is still in the early and intermediate
by the brief theoretical suggestions in the sections stages of a research project?
“Large Unique Variance Combined with High Factor
Correlation Promotes Type I Error” and “A Spurious Exploratory Structural Equation Modeling
Influence of the Number of Factors and Variables,” In 2009, Asparouhov and Muthén introduced a vari-
it must be concluded that, for a significance test of a ant of CFA that was called exploratory structural equa-
model, χ2 and the GOF indices are too unreliable, and tion modeling (ESEM). This approach was intended to
for an estimation of the approximation discrepancy, the overcome the earlier mentioned limitations of CFA. In
GOF indices are too inaccurate. As early as 2007, such this method, in addition to a CFA measurement model,
problems made Barrett call for their abandonment: “I an EFA measurement model with rotations can be used
would recommend banning all such indices from ever within a structural equation model. There is no longer a
appearing in any paper as indicative of ‘model accept- need to fix the non-indicators per factor on 0. They can
ability’ or ‘degree of misfit.’” (Barrett, 2007, p. 821). In be freely estimated, and they are reproduced in the out-
addition, for feedback on the level of individual items put. So there are secondary factor loadings.
CFA is not ideal. The superiority of ESEM to CFA in cases of more
complex measurement models has already been dem-
Limitations onstrated in studies like Marsh, Ludtke, Muthèn, Aspa-
Part of the findings by Stuive and associates has to be rouhov, Morin, Trautwein, et al. (2010), Furnham, Gue-
put into perspective. It was interesting to see what hap- nole, Levine, & Chamorro-Premuzic (2013), Booth and
pened to χ2 and the fit indices in the case of 81% unique Hughes (2014), and Guay, Morin, Litalien, and Valle-
variance, but such a high unique variance in a 3 × 4 item rand (2014). Also, ESEM could further be used for a da-
test implies a Cronbach's α of .50, as Stuive, Kiers, and ta-driven model-modification with more realistic and
Timmermans (2009) admit. Such tests will probably not informative results than CFA.
be used in practice. The other extreme, a questionnaire
with a unique variance of only 25%, will probably also Data-driven Optimization of Predicted Clusters:
be rare in psychology research; around 50% is more A Fruitful Approach to Goodness of Fit
likely. But it is then that CFA yields its most practical What would be the advantage of such “a data-driven
results. So the skeptical conclusions above should not model-modification”? That would result in factors that
be taken to imply that all studies that have applied CFA are congruent with the predicted ones on the one hand,

Comprehensive Psychology 14 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

but that are in good agreement with the empirical corre- yield values between 1 (perfect fit) and 0 (no fit at all), or
lation or covariance matrix on the other: predicted indi- even below 0 (fit is worse than chance level).
cators that appeared to have a insufficient factor load- The procedure above implies that a weight of 1 is at-
ing would no longer be considered an indicator, while tached to each hit and each F and M items. In the case
variables that have an unexpected high loading on a of substantial cross-correlations, especially if these had
factor will now count as indicators. These modifications been expected, such a procedure may generate goodness
often imply that the clusters formed by the redistrib- of fit values that are more mediocre than warranted. This
uted factor indicators have a better cohesion (average is the case, because a reassignment of an item in spite of
inter-variable correlation) than the corresponding pre- its relatively high loading on its predicted factor (so an F
dicted ones at the start. item) should be considered a relatively small error. In the
Such a result would be comparable to that of item same vein, reassignment of an item that had a mediocre
analysis, as explained in the section “Methods for Test- loading in its new factor (so an M item) is a smaller error
ing a Predicted Item Clustering,” but this time contin- than a high-loading M item. This imperfection could be
ued to the point at which the clustering neatly fits the partly corrected by attaching a weight < 1 to the errors, a
empirical correlation matrix. If performed with ESEM, weight depending on the factor loading of the item con-
the result will be based on common factor analysis, cerned.9
which may be preferred by statisticians. Model re-spec- The proposed alternative approach to goodness of
ification with CFA based on the modification indices fit produces a fit value per factor, which can be com-
is also a possibility (Stuive, et al., 2009), but ESEM will bined into a fit value over all factors. Predicted factor
probably lead to better results. correlations, if there were any, will not affect this good-
Whatever the method applied, such final factor struc- ness-of-fit index. To test such predictions would require
ture/pattern (or cluster structure, as the case may be) will a separate test and—if deemed useful—separate index.
be extremely capitalized on chance and other sources of Next, the content of the F and M items should be ex-
error, as MacCallum, Roznowski, and Necowitz (1992) amined in relation to the content of the H items, to see
warned. However, that would only be a problem if the whether these errors warrant a minor or major revision
revised factor structure would be adopted uncondition- of the theory or whether they could be considered to be
ally as the better one, but that is not what it should be poor representations of the phenomena they were sup-
used for. Having a predicted factor structure on the one posed to cover. In the initial and intermediate stages of
hand and an optimized factor structure on the other, the a research project, this probably will result in a revised
latter being continuous with the predicted one but in factor prediction which is in better but not complete ac-
good agreement with the empirical correlation matrix cordance with the optimized factor structure. Remem-
at the same time, puts the researcher in a position for ber, the latter has been capitalized on various sources
a detailed comparison of the two structures. And this of error. The criterion for modifying one's prediction
generates detailed feedback on item level as well as a is whether the changes make better sense theoretical-
basis for a new kind of goodness-of-fit index. This will ly than the original prediction does on second thought,
be explained briefly. not what the optimized factor structure tells in all de-
Indicators that are shared by both the predicted and tail. However, Steiger (1990, p. 175) has shown himself a
its corresponding optimized factor can be considered believer in the researcher's ability to find whatever the-
correct positives; in other words, hits (H items). Indica- oretical justification is needed, which is merely a flat-
tors that had to be removed from the predicted factor tering way of saying that he is skeptical of the value of
to arrive at its corresponding optimized version can be such theoretical justifications; but then, there is always
considered false positives (F items). Indicators that had the scientific community to criticize sloppy arguments.
been missed in the prediction of the factor, correspond- If not in complete accordance, then the optimizing
ing with its optimized version, can be considered false program may run again and will yield goodness-of-fit
negatives (M items). The number of H items, F items, values better than those of the unrevised factors but still
and M items can now be combined in a simple formula: below 1. Of course, it would need a new sample (prefer-
ably more than one) for a more conclusive test.
H H
AP ( it ) = = .
( H + F ).( H + M ) Cp.Cf References
Andersen, B. L., Farrar, W. B., Golden-Kreutz, D., Katz, L. A., Mac-
Here, AP (it) stands for accuracy of prediction in Callum, R. C., Courtney, M. E., & Glaser, R. (1998) Stress and
terms of number of items, H for number of hits, F for immune response after surgical treatment for regional breast
number of F items, M for number of M items, Cp for size cancer. Journal of the National Cancer Institute, 90, 30-36.
of the predicted cluster, and Cf for size of the final (op-
9
In November 2014, I have submitted an article explaining in detail
timized) cluster. A correction for “correct prediction by this approach and the two formulas (Prudon, submitted); it is still un-
chance” needs to be added to this formula. Then, it will der review.

Comprehensive Psychology 15 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

Anderson, J. C., & Gerbing, D. W. (1984) The effect of sampling Hayduk, L. A., Cummings, G. C., Boadu, K., Pazderka-Robinson,
error on convergence, improper solutions, and goodness-of-fit H., & Boulianne, S. (2007) Testing! Testing! One, two, three:
indices for maximum likelihood confirmatory factor analysis. testing the theory in structural equation models! Personality
Psychometrika, 49, 155-173. and Individual Differences, 42(5), 841-850.
Anderson, J. C., & Gerbing, D. W. (1988) Structural equation Hayduk, L. A., Pazderka-Robinson, H., Cummings, G. C., Le-
modeling in practice: a review and recommended two-step vers, M.-J. D., & Beres, M. A. (2005) Structural equation model
approach. Psychological Bulletin, 103, 411-423. testing and the quality of natural killer cell activity measure-
Arbuckle, J. L. (2004) Amos 5.0. Chicago: SPSS. [Computer soft- ments. BMC Medical Research Methodology, 5, 1.
ware] Hu, L. T., & Bentler, P. M. (1998) Fit indices in covariance struc-
Barrett, P. (2007) Structural equation modeling: adjudging model ture modeling: sensitivity to under-parameterized model mis-
fit. Personality and Individual Differences, 42(5), 815-824. specification. Psychological Methods, 3, 424-453.
Bentler, P. M. (1990) Comparative fit indexes in structural models. Hu, L. T., & Bentler, P. M. (1999) Cutoff criteria for fit indexes in
Psychological Bulletin, 107, 238-246. covariance structure analysis: conventional criteria versus new
Bentler, P. M. (2000–2008) EQS 6 structural equations program man- alternatives. Structural Equation Modeling, 6, 1-55.
ual. Encino, CA: Multivariate Software, Inc. Jöreskog, K. G., & Sörbom, D. (1974) LISREL III. Chicago, IL: Sci-
Bentler, P. M., & Bonett, D. G. (1980) Significance tests and good- entific Software International. [Computer software]
ness-of-fit in the analysis of covariance structures. Psychologi- Jöreskog, K. G., & Sörbom, D. (1988) LISREL 7: guide to the program
cal Bulletin, 88, 588-600. and applications. (2nd ed.) Chicago, IL: SPSS.
Berk, R. A., & Freedman, D. A. (2003) Statistical assumptions as Kenny, D. A. (2012) Measuring model fit. Retrieved from http://
empirical commitments. In T. G. Blomberg & S. Cohen (Eds.), www.davidakenny.net/cm/fit.htm.
Law, punishment, and social control: essays in honor of Sheldon Kenny, D. A., & McCoach, D. (2003) Effect of the number of vari-
Messinger. (2nd ed.) Aldine de Gruyter. Pp. 235-254 (Chap. 10). ables on measures of fit in structural equation modeling. Struc-
Bollen, K. A. (1986) Sample size and Bentler & Bonett's non- tural Equation Modeling, 10, 333-351.
normed fit index. Psychometrika, 51, 375-377. MacCallum, R. C. (2003) Working with imperfect models. Multi-
Bollen, K. A. (1989) A new incremental fit index for general structur- variate Behavioral Research, 38, 113-139.
al equation models. Sociological Methods & Research, 17, 303-316. MacCallum, R. C., Roznowski, M., & Necowitz, L. B. (1992) Model
Booth, T., & Hughes, D. J. (2014) Exploratory structural equation modifications in covariance structure analysis: the problem of
modeling of personality data. Assessment, 21(3), 260-271. capitalization on chance. Psychological Bulletin, 111(3), 490-504.
Breivik, E., & Olsson, U. H. (2001) Adding variables to improve Marsh, H. W., Hau, K-T., & Grayson, D. (2005) Goodness of fit
fit: the effect of model size on fit assessment in LISREL. In R. in structural equation models. In A. Maydeu-Olivares & J. J.
Cudeck, K. G. Jöreskog, S. H. C. du Toit, & D. Sörbom (Eds.), McArdle (Eds.), Contemporary psychometrics: a festschrift to Roder-
Structural equation modeling: present and future: a festschrift in honor ick P. McDonald. Mahwah, NJ: Erlbaum. Pp. 275-340 (Chap. 10).
of Karl Jöreskog. Chicago, IL: Scientific Software. Pp. 169-194. Marsh, H. W., Hau, K-T., & Wen, Z. (2004) In search of golden rules:
Brown, T. A. (2015) Confirmatory factor analysis for applied research. comment on hypothesis-testing approaches to setting cutoff val-
(2nd rev. ed.) New York: Guilford Press. ues for fit indices and dangers in overgeneralizing Hu & Bentler's
Brown, T. A., & Moore, M. T. (2012) Confirmatory factor analysis. (1999) findings. Structural Equation Modeling, 11, 320-341.
In R. H. Hoyle (Ed.), Handbook of structural equation modeling. Marsh, H., Ludtke, O., Muthèn, B., Asparouhov, T., Morin, A. J. S.,
New York: Guilford Press. Pp. 361-379. Trautwein, U., & Nagengast, B. (2010) A new look at the Big
Browne, M. W., MacCallum, R. C., Kim, C-T., Andersen, B. L., & Five factor structure through exploratory structural equation
Glaser, R. (2002) When fit indices and residuals are incompat- modeling. Psychological Assessment, 22, 3, 471-491.
ible. Psychological Methods, 7, 403-421. Marsh, H. W., Lüdtke, O., Nagengast, B., Morin, A. J. S., & Von
Chen, F., Curran, P. J., Bollen, K. A., Kirby, J., & Paxton, P. (2008) Davier, M. (2013) Why item parcels are (almost) never appro-
An empirical evaluation of the use of fixed cutoff points in priate: two wrongs do not make a right—camouflaging mis-
RMSEA test statistic in structural equation models. Sociological specification with item parcels in CFA models. Psychological
Methods & Research, 36(4), 462-494. Methods, 18(3), 257-284.
Cudeck, R., & Henly, S. J. (1991) Model selection in covariance McDonald, R. P., & Marsh, H. W. (1990) Choosing a multivariate
structure analysis and the “problem” of sample size: a clarifi- model: noncentrality and goodness-of-fit. Psychological Bulle-
cation. Psychological Bulletin, 109, 512-519. tin, 107, 275-255.
Fan, X., & Sivo, S. A. (2005) Sensitivity of fit indices to misspeci- McIntosh, C. N. (2007) Rethinking fit assessment in structural
fied structural or measurement model components: rationale equation modelling: a commentary and elaboration on Barrett
of two-index strategy revisited. Structural Equation Modeling, (2007). Personality and Individual Differences, 42(5), 859-867.
12(3), 343-367. Meade, A. W. (2008) Power of AFI's to detect CFA model misfit. Paper
Fan, X., & Sivo, S. A. (2007) Sensitivity of fit indices to model mis- presented at the 23th Annual Conference of the Society for Indus-
specification and model types. Multivariate Behavioral Research, trial and Organizational Psychology, San Francisco, CA, April 10-
42(3), 509-529. 12. Retrieved from http://www4.ncsu.edu/~awmeade/Links/
Furnham, A., Guenole, N., Levine, S. Z., & Chamorro-Premuzic, T. Papers/AFI(SIOP08).pdf.
(2013) The NEO Personality Inventory–Revised: factor struc- Miles, J., & Shevlin, M. (2007) A time and a place for incremental
ture and gender invariance from exploratory structural equa- fit indices. Personality and Individual Differences, 42(5), 869-874.
tion modeling analyses in a high-stakes setting. Assessment, 20, Mulaik, S. (2010) Another look at a cancer research model: theory
1, 14-23. and indeterminacy in the BMC model by Hayduk, et al. Paper
Guay, F., Morin, A. J. S., Litalien, D., Valois, P., & Vallerand, R. J. (2014) presented at the Annual meeting of the Society for Multivari-
Application of exploratory structural equation modeling to eval- ate Experimental Psychology, Atlanta, GA, October 2010.
uate the Academic Motivation Scale. The Journal of Experimental Muthén, L. K., & Muthén, B. O. (1998-2010) Mplus user's guide.
Education, 83(1), 51-82. DOI: 10.1080/00220973.2013.876231 (6th ed.) Los Angeles, CA: Muthén & Muthén.

Comprehensive Psychology 16 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

Rasch, G. (1980) Probabilistic models for some intelligence and attain- Stuive, I. (2007) A comparison of confirmatory factor analysis
ment tests. Chicago, IL: Univer. of Chicago Press. methods: oblique multiple group method versus confirmatory
Rigdon, E. E. (1996) CFI versus RMSEA: a comparison of two fit common factor method. Unpublished dissertation, Univer.
indexes for structural equation modeling. Structural Equation of Groningen, Groningen, The Netherlands. Retrieved from
Modeling, 3(4), 369-379. http://irs.ub.rug.nl/ppn/305281992.
Rigdon, E. E. (1998) The equal correlation baseline model for com- Stuive, I., Kiers, H. A. L., & Timmermans, M. (2009) Comparison of
parative fit assessment in structural equation modeling. Struc- methods for adjusting incorrect assignments of items to subtests:
tural Equation Modeling, 5(1), 63-77. oblique multiple group method versus confirmatory common
Saris, W. E., Satorra, A., & van der Veld, W. (2009) Testing struc- factor method. Educational and Psychological Measurement, 69(6),
tural equation models or detection of misspecifications? Struc- 948-965.
tural Equation Modeling, 16, 561-582. Stuive, I., Kiers, H. A. L., Timmermans, M., & ten Berge, J. M. F.
Steiger, J. H. (1990) Structural model evaluation and modifica- (2008) The empirical verification of an assignment of items to
tion: an interval estimation approach. Multivariate Behavioral subtests: the Oblique Multiple Group method versus the con-
Research, 25, 173-180. firmatory common factor method. Educational and Psychologi-
Steiger, J. H. (2000) Point estimation, hypothesis testing, and inter- cal Measurement, 68(6), 923-939.
val estimation using the RMSEA: some comments and a reply Tomarken, A. J., & Waller, N. G. (2003) Potential problems with
to Hayduk and Glaser. Structural Equation Modeling, 7, 149-162. “well fitting” models. Journal of Abnormal Psychology, 112(4),
Stouthard, M. E. A. (2006) Analyse van tests. In W. P. van den 578-598.
Brink & G. J. Mellenbergh (Eds.), Testleer en testconstructie. Am- Tucker, L. R., & Lewis, C. (1973) The reliability coefficient for max-
sterdam, The Netherlands: Boom. Pp. 341-376. imum likelihood factor analysis. Psychometrika, 38, 1-10.

APPENDIX
Number of Factors vs Variables (Number of Within-clusters vs Number of Between-clusters Correlations), Visualized

Correlation Matrices of 2, 3, 4, and 6 Clusters Within 12 Variables (Diagonal Consists of Communalities)

2 Clusters of 6 Variables
(i) 1 2 3 4 5 6 7 8 9 10 11 12
1 c
2 0.5 c
3 0.5 0.5 c
4 0.5 0.5 0.5 c
5 0.5 0.5 0.5 0.5 c
6 0.5 0.5 0.5 0.5 0.5 c
7 0 0 0 0 0 0 c
8 0 0 0 0 0 0 0.5 c
9 0 0 0 0 0 0 0.5 0.5 c
10 0 0 0 0 0 0 0.5 0.5 0.5 c
11 0 0 0 0 0 0 0.5 0.5 0.5 0.5 c
12 0 0 0 0 0 0 0.5 0.5 0.5 0.5 0.5 c
Note  2 clusters of 6 variables: 1 to 6, 7 to 12. 2 × 21 = 42 within-factors correlations (including communalities);
1 × 62 = 36 between-factors correlations. 42 vs 36 = 1.17.

Comprehensive Psychology 17 2015, Volume 4, Article 10


Confirmatory Factor Analysis: A Critique / P. Prudon

3 Clusters of 4 Variables
(ii) 1 2 3 4 5 6 7 8 9 10 11 12
1 c
2 0.5 c
3 0.5 0.5 c
4 0.5 0.5 0.5 c
5 0 0 0 0 c
6 0 0 0 0 0.5 c
7 0 0 0 0 0.5 0.5 c
8 0 0 0 0 0.5 0.5 0.5 c
9 0 0 0 0 0 0 0 0 c
10 0 0 0 0 0 0 0 0 0.5 c
11 0 0 0 0 0 0 0 0 0.5 0.5 c
12 0 0 0 0 0 0 0 0 0.5 0.5 0.5 c
Note  3 clusters of 4 variables: 1 to 4, 5 to 8, 9 to 12. 3 × 10 = 30 within-factors correlations (including commun-
alities). 3 × 42 = 48 between-factors correlations. 30/48 = 0.63.

4 Clusters of 3 Variables
(iii) 1 2 3 4 5 6 7 8 9 10 11 12
1 c
2 0.5 c
3 0.5 0.5 c
4 0 0 0 c
5 0 0 0 0.5 c
6 0 0 0 0.5 0.5 c
7 0 0 0 0 0 0 c
8 0 0 0 0 0 0 0.5 c
9 0 0 0 0 0 0 0.5 0.5 c
10 0 0 0 0 0 0 0 0 0 c
11 0 0 0 0 0 0 0 0 0 0.5 c
12 0 0 0 0 0 0 0 0 0 0.5 0.5 c
Note  4 clusters of 3 variables: 1-2-3, 4-5-6, 7-8-9, 10-11-12. 4 × 6 = 24 within-factors correlations (including
communalities). 6 × 32 = 54 between-factors correlations. 24/54 = 0.444.

6 Clusters of 2 Variables
(iv) 1 2 3 4 5 6 7 8 9 10 11 12
1 c
2 0.5 c
3 0 0 c
4 0 0 0.5 c
5 0 0 0 0 c
6 0 0 0 0 0.5 c
7 0 0 0 0 0 0 c
8 0 0 0 0 0 0 0.5 c
9 0 0 0 0 0 0 0 0 c
10 0 0 0 0 0 0 0 0 0.5 c
11 0 0 0 0 0 0 0 0 0 0 c
12 0 0 0 0 0 0 0 0 0 0 0.5 c
Note  6 clusters of 2 variables: 1-2, 3-4, 5-6, 7-8, 9-10, 11-12. 6 × 3 = 18 within-factors correlations (including
communalities). 15 × 22 = 60 between-factors correlations. 18/60 = 0.30.

Comprehensive Psychology 18 2015, Volume 4, Article 10

View publication stats

You might also like