You are on page 1of 9

Personality and Individual Differences 42 (2007) 883–891

www.elsevier.com/locate/paid

There is a place for approximate fit in structural


equation modelling
S. Mulaik
Georgia Institute of Technology, School of Psychology, Atlanta, GA 30332, United States

Available online 19 January 2007

1. Introduction

I join with Barrett (2007) in urging researchers not to ignore the v2 test in favor of approximate
fit indices in structural equation modeling. Nevertheless, my position is that both v2 tests of exact fit
and approximate fit indices provide useful information for the researcher and complement one an-
other. Both concern different features of a model and both have significant limitations that must be
understood. And having thresholds of acceptable approximation have a place in our methods, but
these thresholds are not justified by empirical, Monte Carlo studies. They are not set by probabi-
listic considerations but by considerations of acceptable relative distance or similarity.
Where I differ with Barrett begins with his assertion that ‘‘. . .four recent papers have cast doubt
upon the continued utility of using indicative thresholds for approximate fit indices, essentially
removing the notion that a single threshold-value can be applied to any particular approximate
fit index under all measurement and data conditions (Beauducel & Wittman, 2005; Fan & Sivo,
2005; Marsh, Kit-Tai, & Wen, 2004; Yuan, 2005). Each paper demonstrated empirically that,
under varying data conditions using known a priori model structures, single-valued indicative
thresholds for approximate fit indices were impossible to set without some models being incor-
rectly identified as fitting ‘acceptably’ when in fact they were misspecified to some degree. Indeed,
the main theme running through these papers was that fixed thresholds for approximate fit indices
were simply not plausible,’’ (p. 817).
My reading of these papers is a bit more nuanced.
The Marsh et al. (2004) paper was provoked by a Monte Carlo study by Hu and Bentler (1999)
to investigate Type I and Type II error rates in applying approximate fit indices. Hu and Bentler

E-mail address: stanmulaik@mindspring.com

0191-8869/$ - see front matter Ó 2006 Published by Elsevier Ltd.


doi:10.1016/j.paid.2006.10.024
884 S. Mulaik / Personality and Individual Differences 42 (2007) 883–891

(1999) considered models (limited to three-factor confirmatory factor analysis models) that were
‘‘true’’ (consistent with the model used to generate the variance/covariance matrices) and models
that were slightly misspecified relative to the model generating the data. They also looked at vary-
ing conditions, such as whether the models were simple (each indicator had free loadings on only
one factor) or complex (a few indicators could have free loadings on more than one factor). They
varied the form of the probability distribution generating the data and examined 5 different sample
sizes from N = 150–5000. On the basis of their results they recommended ‘‘. . .that practitioners use
a cutoff value close to .95 for TLI (BL89, RNI, CFI, or Gamma Hat) in combination with a cutoff
value close to .09 for SRMR to evaluate model fit’’ (Hu & Bentler, 1999, p. 27). Models with values
outside these cutoff values were to be deemed not acceptable approximate fit. For the RMSEA in-
dex they recommended rejecting models when ‘‘RMSEA > .06 and SRMR > .09 (or .10)’’ (Hu &
Bentler, 1999; p. 28). They said these rules worked best when sample N P 250.
This summary of their work, I admit, oversimplifies their findings. But the summary here is
what Marsh et al. (2004) believed most researchers extracted from the Hu and Bentler (1999) pa-
per. The thrust of the Marsh et al. (2004) paper was to challenge treating the rules of Hu and Ben-
tler (1999) as ‘‘golden rules or even recommended guidelines—of acceptable levels of fit that are a
necessary basis for valid interpretations’’ (p. 324).
One of the problems with the study of Hu and Bentler (1999) according to Marsh et al. (2004) is
that they did not have population values for the GOF (goodness of fit) indices for their models,
meaning they had no objective criterion against which to compare the sample distributions. Fur-
thermore their studies of Type I and Type II error concerned acceptance and rejection of the true
model generating the data for a given hypothesized model. But indices of approximation are just
that and not indices of exact fit. One of the misspecified models, M1, that Hu and Bentler (1999)
studied approached an acceptance of 100%, in large samples, which they suggested was a limita-
tion of the index. Marsh et al. (2004) argued that this is as it should have been, because they used a
sample size of 500,000 cases to establish an ‘estimate’ of the population value for the GOF index
of the model, and found that it fell well within the acceptance range. For both the RNI and the
SRMR indices, they showed figures in which the sampling distributions of the GOF values (here
the RNI and SRMR) for each model tended to be bell-shaped, with decreasing variances as the
sample size increased. The RNI index distribution seemed to be unbiased in being centered over
the population RNI values while having decreasing variances in larger samples. In the case of the
SRMR, the distribution modes seemed to shift toward zero while shrinking in variance in the
much larger sample, with the ‘true model’ distribution shifting much more toward zero than
the misspecified models. Model M0, the true model, in the case of the RNI was squeezed up next
to 1.00, because this is the upper limit for the index.
Marsh et al. (2004) found problematic that the cutoff criterion of .95 would reject models with
population RNI values just at or slightly above the threshold of .95, (e.g., .951) nearly 50% of the
time (analogous to a Type I error), with the probability of rejection dropping with increasingly larger
samples. At the same time some models that were just slightly below .95 in their population RNI
would be accepted with almost the same probability as a model with a population RNI of exactly
.95 (analogous to a Type II error). This ‘‘Type II’’ error is not remarkable since it would take a very
large sample to achieve sufficient power to reject a model with a population RNI near, but below .95.
I find these results not unusual considering that for some researchers the .95 value need not be
set with sampling error in mind. See Carlson and Mulaik (1993), where, after rejecting their mod-
S. Mulaik / Personality and Individual Differences 42 (2007) 883–891 885

els with the v2 test, they used .95 as a cutoff criterion for the RNI, simply because it seemed more
rigorous and still easy to attain in their models, after judiciously using a theory-guided specifica-
tion search in which a few fixed parameters were freed, implying only small losses in degrees of
freedom. They also had parsimony ratios that were quite high.
Nevertheless a remedy to 50% rejection rates for models whose population RNI is exactly .95
would be to set up a one-tailed 5% rejection region on the sampling distribution of the RNI index
centered on the value of .95, so that one would accept the model 95% of the time if the sample
RNI value were above the critical 5% value. But this would have the effect of increasing Type
II errors to near 95% for models with population RNI’s very close to .95. So, there is a trade-
off here based on where you place the cutoff.
However, finding a one-tailed 5% rejection region may be easier said than done, since analytic
theory is not currently available to provide such intervals. Still it should be possible to use boot-
strap computations to obtain variances of the sampling distributions of RNI and from there com-
pute a cutoff that is 1.64 standard deviations below .95 and get an approximate interval.
Another approach is to look at the problem differently. For the RNI, distributions appear to be
bell-shaped centered over their population values and have nearly homogeneous variances at a
given sample size and degrees of freedom (Marsh et al., 2004). This suggests that we may look
on the cutoff as having a characteristic curve, as in IRT modeling. I illustrate in Fig. 1, just for
argument sake, what a characteristic curve would look like in this case. (These are not real char-
acteristic curves for the RNI, just illustrative ones). I show characteristic curves for two cases, (a)
low discrimination and (b) high discrimination. Small samples create lower discrimination. Large
samples, create higher discrimination. So, in the region of the cutoff there is greater uncertainty,
especially in smaller samples, This suggests that the cutoff of .95 is a fuzzy criterion, allowing some
models with population RNI’s slightly less than .95 to be accepted with nearly the same proba-
bility as models with population RNI’s slightly more than .95. But any approximationist can
accommodate that, since the sharp conceptual boundary is somewhat arbitrary. In very large sam-
ples the discrimination at the cutoff is quite good. The cutoff is also conservative.
But I believe that these problems raised by Marsh et al. (2004) are solvable problems, and that
in principle having these problems does not vitiate the desirability of using RNI or other GOF
indices and cutoff values for ‘‘good approximations’’.
Marsh et al. (2004) also showed that the v2 statistic (when the data were generated with multi-
variate normal distributions and maximum likelihood estimation was used) ‘‘consistently outper-
formed all of the seven GOF indexes in terms of correctly rejecting misspecified models’’ (p. 336),
while accepting ‘true’ models. They suggested, rhetorically, I think, ‘‘. . .that either we should dis-
card all GOF indexes and focus on v2 test statistics or that the hypothesis-testing paradigm used to
evaluate GOF indexes was inappropriate.’’ Perhaps Barrett was unduly influenced by this sentence.
But I think it improperly suggests that we must choose between v2 tests and goodness of fit indices.
On the contrary, they are complementary and not exclusive of the other. GOF indices are not test
statistics of exact fit, but measures of degree of approximation, which presumes, by definition, that
a model does not have exact fit. So, they do not test exact fit. (They can ‘‘test’’ approximate fit). But
Marsh et al. (2004) also indicate that the GOF indices are very good at differentiating between
nested models, which suggests a possibility for something like hypothesis testing.
The Beauducel and Wittman (2005) paper was not designed to evaluate cutoff values and had a
quite restricted objective. Their motivation was to study the behavior of several goodness of fit
886 S. Mulaik / Personality and Individual Differences 42 (2007) 883–891

Fig. 1. Probability of accepting a model as having at least a population RNI (poprni) of .95, given its population RNI,
in (a) low discrimination case and (b) high discrimination case. Assumes distributions of RNI for each model have same
variances. High discrimination is created among other things by large samples. (Curves are illustrative only.)

indices in connection with what they believed to be models typical of personality research. They
noted that the approximate fit indices varied in their performance across the varying size of
loadings in the ‘true’ data-generating models. However, most of their findings are explicable in
terms of findings given in Marsh et al. (2004) as interpreted here. In summary, I do not think
Beauducel and Wittman (2005) casts doubt on the idea of using approximate fit indices with cutoff
scores, since the problems raised are rectifiable.
I am puzzled why Barrett (2007) cites Fan and Sivo (2005) in support of his doubts in principle
about using approximate fit indices with cutoff scores. This paper had a very narrow focus, to
show that Hu and Bentler (1999) assertion that one needed to use two kinds of fit indices (an
incremental fit index and the SRMR index) to evaluate model fit was false. They succeeded in
showing that Hu and Bentler’s conclusion was based on an artifact. Perhaps in Barrett’s mind this
undermines the ‘golden rules’ of Hu and Bentler (1999), but it does not undermine the general idea
of approximate fit indices and cutoff values, since some other improved rules, as I have already
S. Mulaik / Personality and Individual Differences 42 (2007) 883–891 887

suggested here, may be developed as we gain better understanding of the mathematics and the
empirical situation typically dealt with.
Of all the papers cited by Barrett (2007), the paper by Ke-Hai Yuan (2005) raises the most seri-
ous questions about current uncritical uses of approximate fit indices. In my view this paper raises
the analysis of approximate fit indices to a new level and will be a point of departure for future
research in this field. Yuan notes in his abstract that most fit indices base themselves on certain test
statistics that one assumes follow a central v2 distribution or a noncentral v2 distribution. ‘‘But,’’ he
says, ‘‘few statistics in practice follow a v2 distribution. . .’’ (p. 115). This raises doubts about the use
of the noncentrality parameter of a noncentral v2 distribution. But indices like the CFI and the
RMSEA index are based on the noncentrality parameter of the noncentral v2 distribution as a mea-
sure of model discrepancy. Of course, for a mathematical psychologist like Yuan small empirical
deviations from a theoretical distribution may in his mind be sufficient reason to abandon applying
the distributional theory involved. But even he seems willing to examine ways of using the approx-
imate fit indices by finding how to apply them when the classical theory does not apply.
Briefly what he does is find sample statistics T that even when the data is not normally distrib-
uted, T will behave as approximate v2 statistics with expected values equal to the degrees of free-
dom. Then one can use T  df to obtain estimates of a ‘noncentrality parameter’ which can be
applied to the formula for any one of the incremental fit indices or the RMSEA. He reports
two such statistics, TAR and TCRADF which should be used instead of the maximum likelihood
v2 for T in the fit indices based on the noncentrality parameter. TCRADF is currently available
in the program EQS. At the present, exact confidence intervals for fit indices derived with these
statistics are not analytically available.
For us, it is important to cite the following sentences at the conclusion of Yuan (2005) paper:
‘‘Although most of our results are on the negative side of current practice, fit indices are still
meaningful as relative measures of model fit/misfit. . .. We want to emphasize that the purpose
of the article is not to abandon cutoff values but to point out the misunderstanding of fit indices
and their cutoff values. We hope the result of the article can lead to better efforts to establish a
scientific norm on the application of fit indices’’ (pp. 142–143).
So, the current statistical problems associated with indices of approximation and cutoff values
do not in principle imply abandoning the ideas of approximate fit or cutoff values.

2. What happened to the logic of model testing?

Space limitations do not allow me to treat in detail every one of Barrett’s other assertions
against approximate fit indices, which to me are not well thought out, inuendo and exaggerations.
But let us deal with the question of the logic of model testing: First, I assume a cognitive science
stance toward science (see Mulaik, 1995, 2001a, 2004) which holds that science is heavily meta-
phoric in its abstract concepts (Browne, 2003). Central to science is the metaphoric schema to
be expressed as ‘‘Science is knowledge of objects’’ (Mulaik, 1995, 2001a, 2004). Our ordinary
immediate perceptual experience of objects is the basis for objectivity in science. The perceptual
schema of object perception is that objects are invariants independent of the actions and motions
of the observer and objects in perception as observer and objects move with respect to each other
in the world (Gibson, 1966, 1979, 1982; Mace, 1986). With scientist as observer, this schema drives
888 S. Mulaik / Personality and Individual Differences 42 (2007) 883–891

scientists to seek to establish ‘‘objective concepts’’ that represent invariants that integrate indi-
rectly percepts across sizeable periods of time and distances in space (their ‘‘objective concepts’’
are not perceived immediately or directly) as they and other scientists gather observations from
experiments with different methods, in different laboratories and countries. Mathematics is like
a tool box of metaphoric schemas for representing objects and their relationships, and provides
the material for the construction of models by which scientists think about things in the world
(Lakoff & Nuñez, 2000).
Hypothesis testing concerns assertions of invariance about properties of and relations between
objects given in thought. That objects are invariants independent of the actions and points of view
of their observers leads to the requirement that hypotheses be tested with data not used in their
formulation, so that the hypothesized invariant and the observations used to test it are logically
independent. A test has to have a logical possibility of failure as well as success.
In structural equation modeling (SEM), the hypothesized models are complex in having many,
sometimes, hundreds of parameters. Ideally a high proportion of these parameters are prespecified
by theory. Furthermore the models are studied in certain experimental and observational settings,
of which all relevant influences on the outcomes of the observed variables are not always fully
known, especially of those whose influences are invisible and small in effect and so, have not been
detected. Most SEM researchers also do not now concern themselves with quantitative measure-
ment which involves establishing additive conjoint measurement (Andrich, 2004), and failure to
achieve this can cause some misfit. They may not be fully aware of causal heterogeneity among
their experimental subjects, where stimuli have different kinds of effects on different subjects, or
words in test items have different idiosyncratic meanings for different subjects. Causal relations
may also be nonlinear but monotonic in their effects, while the models only treat these with linear
relations. There are numerous background conditions that may have an effect on results, and all
of these may not be controlled or measured and included in the model (James, Mulaik, & Brett,
1982; Mulaik & James, 1995).
In formulating their hypotheses, researchers nevertheless try to identify tentatively the princi-
pal, obvious causes of their chosen effect variables, along with any other causes. They also attempt
to have indicators of these causal variables. They specify hypotheses about the relations of the
causes to the various effect variables by fixing and constraining certain parameters in their models.
Often these are expressed as zero coefficients, specifying what variables are not causes of other
variables, but it is quite possible even to specify non-zero values, if prior experience or theory
is sufficient to provide values. So, the logic of the hypothesis test goes like this:
If H1 &    & Hp & B1 &    & Bm & D1 &    & Dk are true then T < v2dfð:05Þ :
H1, . . . ,Hp are hypotheses expressed by fixing or constraining certain parameters in the model.
Free parameters are not part of the hypothesis, but filler in the model. They are free because the
researcher has no knowledge by which to specify their values. They are to be estimated in such a
way as to minimize lack of fit of the model conditional on the hypothesized constraints, so that if
there is any lack of fit, it will be attributed to the fixed and constrained parameters. B1, . . . ,Bm are
background conditions assumed to be the case for the experiment. D1, . . . ,Dk are probabilistic
distributional assumptions made for performing statistical analysis with the data. T is the v2 test
statistic referred to a critical value of the v2 distribution with df degrees of freedom at the .05 level
of significance. If all the hypotheses, background assumptions, and distributional assumptions are
true, then v2 should be less than the critical value of v2.
S. Mulaik / Personality and Individual Differences 42 (2007) 883–891 889

Now if it is observed that it is false that T < v2dfð:05Þ , then this falsifies the joint condition on the
left hand side of the expression, but it does not mean that everyone of the assertions on the left
hand side are false. There may be only one condition, say it is a hypothesis about a parameter
value, or an assumption about a background condition or a statistical assumption that is false,
and then the whole joint expression on the left is negated. Or it could be any number of hypoth-
eses, background assumptions, and statistical assumptions that are false. The v2 test does not indi-
cate what is false, only that something, somewhere is probably false (Mulaik & James, 1995). The
results deviate from the expected v2 value by a value greater than what would be expected by
chance, say, 5% of the time if the assumed hypothesis were true.
So, what is a researcher to do if he or she gets a significant v2? Well, the first thing, of course,
is to recognize that the results deviate from the ideal v2 distribution more than would be ex-
pected by chance, and report it. You should then look at the diagnostics provided by the pro-
gram. (However, a word of caution: The diagnostics often concern differences between the
model’s covariances and the observed covariances, and the differences may be misleading if
the model is structurally seriously misspecified). Are there reasons to question the distributional
assumptions? What do the residuals tell you? Are there omitted latent variables that are com-
mon to some of the manifest variables? Do the Lagrange multiplier tests or modification indices
give you information about constrained parameters that produce the ill fit? Many of the con-
clusions from such an investigation will be hypotheses to be tested in future studies. For exam-
ple, by including indicators of hypothesized omitted variables in a new study with the present
variables, one may be able to control for the lack of fit. If you free the constrained parameters
most associated with ill fit, you do not test a totally new hypothesis, since you only test a subset
of the original constrained parameters. But you may get a new model with fewer degrees of
freedom that fits better, but tests less. There are other things to do, but in some ways if you
pursue exclusively the omitted causes, you abandon the original causes and effects of your mod-
el. What is their status?
Most researchers, I believe, will think the effort to find the causes of ill fit to be worth it, if
they already have evidence that the hypothesized causal model fits the data to a high degree of
approximation with a parsimony ratio df/[p(p + 1)/2] (Carlson & Mulaik, 1993) of .85 or more,
where df are the degrees of the model and p(p + 1)/2, the maximum possible number of degrees
of freedom given by the number of distinct elements of the covariance matrix fitted by the mod-
el, with p the number of variables. This is a highly tested model, because degrees of freedom
correspond to the number of dimensions in which a model is free to differ from the data it
is fitted to, when the model has estimated parameters (Mulaik, 2001b). The number of data
points gives the maximum number of potential degrees of freedom as possible points from
which the model can deviate. You lose a degree of freedom for each parameter estimated.
So, models with more degrees of freedom are to be preferred to models with less, if the fit is
very good, because the model is tested in more dimensions. This is different from simply relying
on fit alone.
So, what degree of fit should justify further work to improve the fit of an already highly tested
model? There is no golden rule, but there is a heuristic. The fit should be quite high, approaching,
but not quite attaining perfect fit (for then it is no longer an index of approximation). For indices
like the RNI this can be, say, .95, which is 95% of the scale from 0 to 1. The rationale for this is
that if the scientist did his or her preparatory work well in developing the hypothesized model, it
890 S. Mulaik / Personality and Individual Differences 42 (2007) 883–891

will include most of the major observable and detected causes that account for most of the var-
iance in the variables studied. Then if there is any lack of fit, it should be small because if these
causes are correctly specified the small lack of fit will likely correspond to minor effects that were
difficult to detect or anticipate. And these would now be the focus of further research to identify
these omitted effects or reasons for minor lack of fit. This is a heuristic rationale. There is no guar-
antee that it always works to identify a correct model. But much of our personal experience with
our own bodies testifies that acts that produce approximate results usually can be modified
slightly to get exact results. So, approximations give us plausible grounds for pursuing a concept’s
applicability and usefulness further.
Limited space prevents me from pursuing adequately the theme that science itself has histor-
ically developed via a succession of approximations in every field. The push has been to im-
prove the degree of approximation up to fit to within sampling error as measurement
became more and more accurate. The approximations have also been extended to wider and
wider syntheses of experience. Take, for example, Wegener’s theory of continental drift. The
approximate fit of the coastline of South America to that of Africa were plausible grounds
for Wegener and others to pursue the idea that the two continents had ‘‘drifted apart’’ in some
way. Copernicus initially had planets circling the sun in circles. Kepler and Newton later deter-
mined that planets had elliptical orbits. Later astronomers modified Newtonian orbits to ac-
count for the attraction of other large bodies near them. There are countless stories in the
history of science that show that good initial approximations were clues to scientists that they
had stumbled on ideas worth pursuing further. Thus I think it is ridiculous to reject the use of
approximate fit in our science as a matter of principle. The only kinds of criticism that I feel
appropriate about approximate fit indices concern individual indices and their limitations and
misunderstandings of their interpretation.
Some final points in Barrett’s essay: He says, ‘‘The problem is that no one actually knows
what ‘approximation’ means in terms of ‘approximation to causality’,’’ (Barrett, 2007). I do
not know what it means either. The approximations in SEM are between a modeled covariance
matrix and an observed one. Causality only enters in indirectly as causal relations between vari-
ables determine covariances between variables in the model. A linear function for a monotonic,
nonlinear function can be an approximation and produce approximations to empirical covari-
ances. Omission of small common causal influences can produce discrepancies between model
and observed covariances. And contrary to his opinion, some measures of approximation can
and do represent predictive accuracy, insofar as the causal structure of the model predicts
the covariance structure of the model and in turn of the observed covariances. As for quibbles
over what different degrees of approximation mean, as between a CFI of .90 and a CFI of .95,
this is no different than quibbling over what a .05 versus a .01 level of significance means.

3. Recommendations

I agree with Barrett’s recommendations that the v2 test be reported and its implications for the
model explored and discussed. I agree that SEM studies with samples of 200 or less should not
ordinarily be published and power analyses against meaningful alternative models should be con-
sidered when feasible. I accept most of his recommendations of what to do when the v2 test fails. I
S. Mulaik / Personality and Individual Differences 42 (2007) 883–891 891

reject his position against the use of indices of approximate fit in SEM research. His reasons
against are based on misinterpretations of recent literature on this subject. I would add that
researchers should report the parsimony ratio (Carlson & Mulaik, 1993; Mulaik, 1988) along with
indices of approximate fit so that one can evaluate the degree of fit in the light of the degree to
which the model was tested by the data. Passing a v2 test does not mean the model is necessarily
correct, especially when the parsimony ratio is low. Equivalent models are possible in this case
and should be considered for elimination in further research.

References

Andrich, D. (2004). Controversy and the Rasch model: a characteristic of incompatible paradigms? Medical Care, 42, I-
7–I-16.
Barrett, P. (2007). Structural equation modelling: adjudging model fit. Personality and Individual Differences, 42(5),
815–824. doi:10.1016/j.paid.2006.09.018.
Beauducel, A., & Wittman, W. (2005). Simulation study on fit indices in confirmatory factor analysis based on data
with slightly distorted simple structure. Structural Equation Modeling, 12, 41–75.
Browne, T. L. (2003). Making truth: metaphor in science. Urbana and Chicago, IL: University of Illinois Press.
Carlson, M., & Mulaik, S. A. (1993). Trait ratings from descriptions of behavior as mediated by components of
meaning. Multivariate Behavioral Research, 28, 111–159.
Fan, X., & Sivo, S. A. (2005). Sensitivity of fit indices to misspecified structural or measurement model components:
rationale of two-index strategy revisited. Structural Equation Modeling, 12, 343–367.
Gibson, J. J. (1966). The senses considered as perceptual systems. London: Allen & Unwin.
Gibson, J. J. (1979). The ecological approach to visual perception. Boston: Houghton-Mifflin.
Gibson, J. J. (1982). In E. Reed & R. Jones (Eds.), Reasons for realism: Selected essays of James J. Gibson. Hillsdale,
NJ: Lawrence Erlbaum.
Hu, Li-tze, & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria
versus new alternatives. Structural Equation Modeling, 6, 1–55.
James, L. R., Mulaik, S. A., & Brett, J. M. (1982). Causal analysis: assumptions, models, and data. Beverly Hills, CA:
Sage Publications.
Lakoff, G., & Nuñez, R. E. (2000). Where mathematics comes from: How the embodied mind brings mathematics into
being. New York: Basic Books.
Mace, W. M. (1986). J.J. Gibson’s ecological theory of information pickup: cognition from the ground up. In T. J.
Knapp & L. C. Robertson (Eds.), Approaches to cognition: contrasts and controversies (pp. 137–157). Hillsdale, NJ:
Lawrence Erlbaum.
Marsh, H. W., Kit-Tai, Hau, & Wen, Z. (2004). In search of golden rules: comment on hypothesis testing approaches to
setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler’s (1999) findings. Structural
Equation Modeling, 11, 320–341.
Mulaik, S. A. (1988). Confirmatory factor analysis. In R. B. Cattell & J. R. Nesselroade (Eds.), Handbook of
multivariate experimental psychology (pp. 259–288). New York: Plenum.
Mulaik, S. A. (1995). The metaphoric origins of objectivity, subjectivity and consciousness in the direct perception of
reality. Philosophy of Science, 62, 283–303.
Mulaik, S. A., (2001a). Objectivity and other metaphors of structural equation modeling. In Robert Cuideck, Stephen
du Toit and Dag Sörbom (Eds.), Structural equation modeling: present and future (pp. 59–78).
Mulaik, S. A. (2001b). The curve-fitting problem: An objectivist view. Philosophy of Science, 68, 218–303.
Mulaik, S. A. (2004). Objectivity in science and structural equation modeling. In D. Kaplan (Ed.), The sage handbook of
quantitative methodology for the social sciences (pp. 425–446). Thousand Oaks, CA: Sage Publications.
Mulaik, S. A., & James, L. R. (1995). Objectivity and reasoning in science and structural equations modelling. In R. H.
Hoyle (Ed.), Structural equation modeling: issues and applications (pp. 118–137). Beverly Hills, CA: Sage
Publications.
Yuan, K. H. (2005). Fit indices versus test statistics. Multivariate Behavioral Research, 40, 115–148.

You might also like