You are on page 1of 27

J. of the Acad. Mark. Sci.

(2012) 40:8–34
DOI 10.1007/s11747-011-0278-x

Specification, evaluation, and interpretation of structural


equation models
Richard P. Bagozzi & Youjae Yi

Received: 15 July 2011 / Accepted: 27 July 2011 / Published online: 21 August 2011
# Academy of Marketing Science 2011

Abstract We provide a comprehensive and user-friendly methods, among others, these multivariate statistical tools are
compendium of standards for the use and interpretation of essential to master if one is to understand many bodies of
structural equation models (SEMs). To both read about and research and to conduct basic or applied research in the
do research that employs SEMs, it is necessary to master behavioral, managerial, health, and social sciences.
the art and science of the statistical procedures underpin- In this article, we focus on providing researchers with
ning SEMs in an integrative way with the substantive knowledge needed to specify, evaluate, and interpret
concepts, theories, and hypotheses that researchers desire to structural equation models (SEMs). Structural equation
examine. Our aim is to remove some of the mystery and modeling can be complex to implement, and the appraisal of
uncertainty of the use of SEMs, while conveying the spirit findings can be difficult as well. Surprisingly, treatments of the
of their possibilities. evaluation and interpretation of SEMs are few in number,
sparse in coverage, and in need of updating (e.g., Bagozzi
Keywords Structural equation models . Confirmatory factor 1980, 2010; Bagozzi and Baumgartner 1994; Bagozzi and Yi
analysis . Construct validity . Reliability . Goodness-of-fit 1988; Barrett 2007; Gefen et al. 2000; Iacobucci 2009, 2010;
Markland 2007; McDonald and Ho 2002). A need exists to
consider the art and practice of SEM specification, evalua-
What we observe is not nature itself, but nature tion, and interpretation without sacrificing too much techni-
exposed to our method of questioning. cal considerations. We attempt to strike a balance that is
especially mindful of the everyday user of SEMs and the
Werner Heisenberg general reader of SEM research.

Some preliminaries
Introduction
Researchers disagree on how to present SEMs, with some
Structural equation models are statistical procedures for (e.g., Bentler 2010) favoring the use of equations and
testing measurement, functional, predictive, and causal simple diagrams, and others (e.g., Iacobucci 2009) prefer-
hypotheses. Complementing multiple regression and ANOVA ring the use of matrix algebra and diagrams with Greek
letters. The choice is largely a matter of taste and previous
R. P. Bagozzi (*)
Ross School of Business, University of Michigan,
exposure to one approach or another. The former is less
Ann Arbor, MI, USA intimidating than the latter, but because many published
e-mail: bagozzi@umich.edu articles contain common Greek letter conventions and the
level of matrix algebra is typically very elementary, it might
Y. Yi
College of Business Administration, Seoul National University,
be worthwhile to become familiar with both the use of
Seoul, Korea equations and matrixes and Greek letters. In any case, all
e-mail: youjae@snu.ac.kr the leading programs—AMOS (Arbuckle 2009), EQS
J. of the Acad. Mark. Sci. (2012) 40:8–34 9

(Bentler 2008), LISREL (Jöreskog and Sörbom 1996a, b), one’s attitude toward adoption and felt normative pressure
and Mplus (Muthén and Muthén 2010)—permit some to adopt. Two basic categories of reasons for adoption—
combination of matrix algebra, equation, and/or graphical perceived usefulness and ease of use—ground attitudes and
implementations. Our personal experience and comments subjective norms.
from students and colleagues over the years suggest that The standard graphical SEM conventions have been used
SEM users achieve deeper insight and avoid certain model in Fig. 1. The central variables in TAM are shown as ellipses
misspecification errors when they include matrix conven- or circles, with arrows used to connect these variables, so as
tions in their skill sets. But again, the choice of approach to conform to hypotheses. For example, the hypothesis that
and indeed the choice of program involve heavy elements usage will be a function of intentions is conveyed by the
of convention and familiarity gained from taking an arrow going from intention to usage. The central variables in
introductory course or workshop. a theory go by various names: latent variables, theoretical
Figure 1 presents an SEM for illustrative purposes; it is variables, constructs or theoretical constructs, unobservable
typical of what one finds across the various literatures. Here variables, factors, traits, or simply, concepts. They signify the
we take as an example the technology acceptance model notion that the variables in our theories and hypotheses are
(TAM), one of the most highly cited models in the literature typically framed as abstractions, ideal types, or ideas
(two articles alone have received over 15,000 Google conceived conceptually without measurement error necessar-
citations to date: Davis 1989; Davis et al. 1989). Although ily in mind. Each latent variable in Fig. 1 is connected to one
TAM was originally tested with multiple regression, SEMs or more rectangles or boxes, which designate measurements
are routinely used nowadays. of the latent variables. The measurements are also referred to
The TAM attempts to explain the adoption of new variously as manifest variables, empirical variables, observed
technologies (“usage” in Fig. 1) and posits that usage is a variables, observations, indicators, or simply, measures. The
direct function of one’s decision or intention to adopt the connections of latent variables to manifest variables are also
technology. Intention to adopt, in turn, is determined by represented by arrows.

E3, 1 E4, 2 E5, 3 E6, 4


E7, 5

v3 , y1 v4, y2 v5, y3 v6, y4 v7, y5

1 1 y
* y 1 1 * y E10, E11,
* 21 31 52
8 9

D1, 1 Perceived
21 Attitude
usefulness v10, y8 v11, y9
* F3 , 2
F2 , 1
42
1 1 * y
31
* 94 D5, 5
*

* 11 D2, 2 Intention 54 Usage


D3, 3
32 F5 , 4 * F6 , 5
21
*
*
1
43
Perceived 31
Subjective D4, 4
ease of use norm
*
F1 , F4 , 3 y
v12 , y10
x y
1
x
11 * 21 1 1 * 73

v1, x1 v2, x2 v8, y6 v9, y7

E1, 1 E2, 2 E8, 6 E9, 7

Fig. 1 Illustrative structural equation model for the technology acceptance model (partial rendition, for discussion purposes)
10 J. of the Acad. Mark. Sci. (2012) 40:8–34

Two alternative conventions are followed in the literature for variable, such as the effect of attitude on intention in Fig. 1,
designating information in a model graphically. The Bentler- the symbol is β. The γ’s and β’s are regression coefficients,
Weeks (B-W) tradition uses equations to specify a model but unlike seemingly similar coefficients in traditional
(Bentler and Weeks 1980). Table 1 presents the B-W equations multiple regression, where the independent and dependent
for the TAM in Fig. 1. Notice that 17 equations capture the variables are observed variables and the regression coef-
variables and system of relationships in Fig. 1. There are 6 ficients can be contaminated by measurement error, the γ’s
latent variables (F1–F6), 12 manifest variables (V1–V12), 5 and β’s are corrected for the unreliability of the indicators.
latent variable disturbances (D1–D5), and 12 manifest variable With experience, the numerous above mentioned conven-
disturbances (E1–E11 plus E12 =0). We have loosely followed tions will become second nature. A number of our former
the practice in the EQS program for depicting relationships students, who are now professors, have gone out of their way to
between latent and manifest variables by use of asterisks, tell us that although some learned earlier to graphically program
where the asterisk stands for coefficients to be estimated SEMs before taking our classes, they now recommend that one
(except for one arrow per latent-variable-to-manifest-variable not initially rely on graphical techniques because doing so is
relationship, which is fixed to 1.0 for scaling purposes). “too easy.” They claim that not only can reliance on graphical
The other conventional tradition also frequently techniques lead to specification errors, but also one does not
employed is LISREL, which relies on equations and matrix gain the depth of learning and intuition that accrues when one
notation (see Table 1). Here explicit differentiation is made internalizes the equations and matrixes underlying SEMs.
between latent exogenous and endogenous variables, and
between their respective indicators. As we see in Fig. 1 and
Table 1, latent exogenous variables (i.e., those functioning Benefits of SEMs
as independent variables) are labeled with Greek ξ, latent
endogenous variables (those functioning as only dependent
In pursuit of knowledge, every day something is acquired;
variables or as both independent and dependent variables)
in pursuit of wisdom, every day something is dropped.
with η, indicators of ξ with x, and indicators of η with y.
Error terms for indicators are δ for x and ε for y. Error terms
Lau Tzu, 6th century BC
for latent endogenous variables are drawn as ζ. Depending
on custom, some researchers call error terms disturbances To motivate the study and use of SEMs, we offer the
or residuals. The δ’s and ε’s are also termed errors in following comments. SEMs are not new in the sense that they
variables or measurement error, whereas the ζ’s are termed are part of the existing family of multivariate statistical
errors in equations or theoretical error. techniques. Indeed, most parametric statistics can be accom-
Relationships between latent variables and their respective plished by use of SEMs, so one benefit is that SEMs are generic
indicators are presented as lxij for ξ and x, and lyij for η and y. tools and provide a broad, integrative function conveying the
The λ’s are coefficients expressing the strength of correspon- synergy and complementarity among many different statistical
dence between latent variables and their indicators. Often methods. These notions occasionally get lost in the oft-made
they are referred to as factor loadings. All measures shown in distinction between so-called first-generation statistical methods
Fig. 1 “reflect” their hypothesized latent variables and are (e.g., correlation analysis, exploratory factor analysis, multiple
often called reflective indicators. Notice that the equation regression, ANOVA, canonical correlation analysis) and
relating an indicator to its corresponding factor can be second-generation methods (i.e., SEMs: confirmatory factor
written as follows: y ¼ ly h þ ". This might be interpreted as analysis and structural equation models). The former are special
claiming that variation in measure, y, is produced by both the cases of the latter, and common SEM programs can be used to
factor and error. The factor and error term are assumed to be perform most classical statistical analyses if desired.
uncorrelated by convention, an assumption needed to make Yet there are some differences to note. The use of SEMs
models tractable. Later we will consider an alternative yields benefits not possible with first-generation statistical
measurement model for equations linking latent varia- methods. One important benefit is that it is possible to take
bles to their measures: a formative model positing that into account types of error confounding first-generation
latent variables are functions of their indicators. procedures. For example, random or measurement error in
Finally, relationships among latent variables in the indicators of latent variables can be modeled and estimated
LISREL tradition are of two types. For relationships explicitly. Systematic or method error can also be repre-
showing the effect of an exogenous latent variable on an sented. The result is that focal parameters corresponding to
endogenous latent variable, such as the impact of perceived hypotheses are purged of particular kinds of bias, and
ease of use on perceived usefulness (see Fig. 1), the symbol certain errors in inference avoided. For instance, the
is γ. For relationships showing the effects of one downward biases in coefficients in multiple regressions
endogenous latent variable on another endogenous latent are often corrected for measurement error in SEM analyses,
J. of the Acad. Mark. Sci. (2012) 40:8–34 11

Table 1 Equation and matrix representation for technology acceptance model in Fig. 1

Measurement model: Structural model:

Bentler-Weeks representation
V1 ¼ 1F1 þE1 F2 ¼  F1 þD1
V2 ¼  F1 þE2 F3 ¼  F1 þ  F2 þ D2
V3 ¼ 1F2 þE3 F4 ¼  F1 þ  F2 þ D3
V4 ¼  F2 þE4 F5 ¼  F3 þ  F4 þ D4
V5 ¼  F2 þE5 F6 ¼  F5 þD5
V6 ¼ 1F3 þE6
V7 ¼  F3 þE7
V8 ¼1F4 þE8
V9 ¼  F4 þE9
V10 ¼1F5 þE10
V11 ¼  F5 þE11
V12 ¼F6 þ0
LISREL representation
x1 ¼ x þ d 1 η1 ¼ g 11 x þ z 1
x2 ¼ lx21 x þ d 2 η2 ¼ b21 η1 þ g 21 x þ z 2
η3 ¼ b31 η1 þ g 31 x þ z 3
y1 ¼ η1 þ "1 η4 ¼ b42 η2 þ b43 η3 þ z 4
y2 ¼ ly21 η1 þ "2 η5 ¼ b54 η4 þ z 5
y3 ¼ ly31 η1 þ "3
y4 ¼ η2 þ "4
y5 ¼ ly52 η2 þ "5
y6 ¼ η3 þ "6
y7 ¼ ly73 η3 þ "7
y8 ¼ η4 þ "8
y9 ¼ ly94 η4 þ "9
y10 ¼ η5 þ 0
Matrixes
x¼l x x þ d y ¼l y h þ e
h ¼ Bh þ G x þ z
2 3 2 3 2 3
y1 1 0 0 0 0 "1
6 y2 7 6 ly21 0 0 0 0 7 6 "2 7
6 7 6 7 6 7
6 y3 7 6 ly31 0 0 0 0 72 3 6 "3 7
6 7 6 7 η1 6 7
6 y4 7 6 0 1 0 0 0 7 6 "4 7
      6 7 6 76 η 7 6 7
x1 1 d 6 y5 7 6 0 ly 0 0 0 76 2 7 6 "5 7
¼ xþ 1 6 7¼6 52 76 η3 7 þ 6 7
x2 x
l21 d2 6 y6 7 6 0 0 1 0 0 76 7 6 "6 7
6 7 6 74 5 6 7
6 y7 7 6 0 0 ly 0 0 7 η4 6 "7 7
6 7 6 7 6 7
6 y8 7 6 0 0 0 1 0 7 η5
73
6 "8 7
6 7 6 7 6 7
4 y9 5 4 0 0 0 ly 0 5 4 "9 5
94
y10 0 0 0 0 1 0
2 3 2 3
0 0 0 0 0 y 11
6 b21 0 0 0 07 6 0 y 22 7
6 7 6 7
B¼6
6 b31 0 0 0 077 y ¼6
6 0 y 32 y 33 7
7
4 0 b 42 b 43 0 05 40 0 0 y 44 5
0 0 0 b54 0 0 0 0 0 y 55

0
Γ ¼ ½ g 11 g 21 g 31 0 0 
f ¼ f11 ; qd ¼ ½ qd11 q d22 diag
q" ¼ ½ q"11 q"22 q"33 q"44 q"55 q"66 q"77 q"88 q"99 0 diag
12 J. of the Acad. Mark. Sci. (2012) 40:8–34

such that some failures to detect true effects under the holistic process for discerning meaning from any research
former will not occur under the latter.1 Other advantages of enterprise. In this section, we discuss three types of meaning
SEMs over first-generation methods include provision of which infuse sense-making: theoretical meaning, empirical
more straightforward tests of mediation, methods to assess meaning, and spurious meaning. While it is possible to
construct validity in broader and deeper ways than possible discuss each type of meaning in isolation, it is important to
with traditional correlation analyses, and ways to correct for incorporate all senses of meaning into any research effort,
systematic bias in tests of substantive hypotheses. with an aim to being as balanced and inclusive as possible.
By way of summary, we provide a list of benefits that The different senses of meaning function dialectically to
SEM use may offer: constrain and empower each other.
The following presentation is by necessity brief and
1. Provides integrative function (a single umbrella of
abstract. The reader may wish to skip this section for
methods under leading programs).
now and come back to it later. But because the holistic
2. Helps researchers to be more precise in their specification
construal sketched below can be thought of as the
of hypotheses and operationalizations of constructs.
raison d’être or Shakespearean “be-all and end-all” for
3. Takes into account reliability of measures in tests of
SEMs, it may prove useful to master the mindset
hypotheses in ways going beyond the averaging of
undergirding the presentation herein. That is, the holistic
multi-measures of constructs.
construal can serve to bind theory and method and in so
4. Guides exploratory and confirmatory research in a
doing transcend what happens in the world and our
manner combining self-insight and modeling skills with
ability to comprehend it (for more elaboration on the ideas see
theory. Works well under the philosophy of discovery
Bagozzi 1980, 1984, 2011a, b; Bagozzi and Phillips 1982).
or the philosophy of confirmation.
Figure 2 summarizes the ideas behind the holistic construal.
5. Often suggests novel hypotheses originally not consid-
To simplify the discussion, consider that a researcher proposes
ered and opens up new avenues for research.
that a focal construct, F, is a mediator of the relationship of
6. Is useful in experimental or survey research, cross-
antecedent, A, to consequence, C. The researcher’s goal may
sectional or longitudinal studies, measurement or
be to formulate a conceptual model and explain the mediation
hypothesis testing endeavors, within or across groups
process. This would be normally done verbally, perhaps
and institutional or cultural contexts.
augmented with symbols of variables and equations, to
7. Is easy to use.
provide a theoretical development of hypotheses for study.
8. Is fun.
The theoretical meaning of the variables and proposed
Hereafter we consider many practical and technical mediation linkages and processes comes about in the
issues in the use and interpretation of SEMs. following way. Each concept in the theory—A, F, and C—
and its conceptualization attains part of its interpretation
through specification of its characterization and a theoret-
Philosophical foundations ical definition (depicted as CS and triangles in Fig. 2).
Sentences specifying the meanings of A, F, and C might
designate attributes or characteristics of the concepts, a
Truth, Existence, Knowledge, Causality, Identity, Good-
structure organizing the attributes, and dispositions (e.g.,
ness: these are the principal notions which philosophers
powers or liabilities) of the concepts as a whole or of their
examine. Intelligent persons normally have thoughtful
attributes. The meaning of any focal construct, such as F,
and useful lives without pausing to look into these
resides also in (1) the antecedents, determinants, or causes
notions and into the connections between them. Once
of F, (2) the consequences, implications, or results of F, and
one starts to look into them, it is difficult to stop.
(3) the associative (i.e., nonfunctional, noncausal) links to F
(the last are not shown in Fig. 2). That is, theoretical meaning
Stuart Hampshire
accrues via the connections that each theoretical variable has
A primary objective of virtually all research is to make with other theoretical variables in a nomological network and
sense of some aspect of the world of experience. SEMs is expressed through indication of the content of the
provide a useful forum for sense-making and in so doing link hypotheses linking a concept to other concepts and the
philosophy of science criteria to theoretical and empirical rationales for hypotheses (see H and R in Fig. 2). Notice that
research. A useful way to construe sense-making is as a γ accompanies the A to F path and β alludes to the F to C
path. The γ and β symbolize parameter estimates and thus
1 are inferred from estimation procedures with data. They are
Although parameter estimates in simple regression are always biased
downward to the extent of measurement error, no general statements not the theoretical relations in question but rather empirical
in this regard can be made with respect to multiple regression. measures or implications of H and R.
J. of the Acad. Mark. Sci. (2012) 40:8–34 13

Fig. 2 A framework for think-


ing about theoretical, empirical,
and spurious meaning. Source:
Bagozzi (2011a)

Empirical meaning refers to the observational content content of the rule. In this sense, there is some surplus
affiliated with theoretical constructs after spurious meaning meaning not entirely captured in a factor loading. All this
(discussed below), if any, has been removed. This is suggests that one should incorporate formal rules and
done formally through specification of correspondence conceptualization procedures when specifying measures of
rules joining theoretical concepts to observational con- latent variables. For example, such criteria should be met as
cepts. Observational concepts are abstract definitions or logical deducibility of observations from the conceptual
interpretations of indicators or measures. Correspon- definition of a theoretical construct, and consistency and
dence rules are not part of either the theoretical comparability of levels of abstraction for multiple measures
meaning of a theoretical concept or an observation, of a construct (see Bagozzi and Edwards 1998, pp. 79–82).
per se, but rather they constitute auxiliary hypotheses Spurious meaning refers to contamination of empirical
concerning additional theoretical mechanisms (e.g., the meaning and resides in one or more of three sources:
theory of instrumentation or method used to operation- random error, systematic error, and measure specificity.
alize a theoretical construct), empirical criteria (e.g., Returning to our general measurement equation, where
observational procedures or descriptive conventions), and a x ¼lx x þ d or y ¼ly h þ ", we can summarize the three
rule connecting mechanisms and criteria (see Bagozzi 2011a). sources of spurious meaning with the following equation:
In addition to correspondence rules, factor loadings are
d or " ¼ e þ s þ m
shown connecting latent variables to manifest variables in
Fig. 2. A factor loading is an inferred parameter estimated where e is a random component, s is a component specific
from empirical associations among observed variables. It is to each measure, and m is a component specific to
in a sense an imperfect implication of the meaning of a systematic error (e.g., method bias). Researchers using
correspondence rule, and even when measurement error in a multiple regression or ANOVA often assume that s and m
SEM is negligible, a factor loading may not capture fully are small in comparison to e and therefore can be ignored,
the meaning of the correspondence rule, because some but this is an assumption seldom met, and in any case it
aspects of empirical meaning reside also in the theoretical would be best to estimate the magnitudes of the sources of
14 J. of the Acad. Mark. Sci. (2012) 40:8–34

error, which is possible in many SEMs (e.g., Bagozzi et al.


1991a, 1999).
Why is it important to consider theoretical meaning and
differentiate it from other kinds of meaning? One reason is
that it emphasizes that which is to be explained and ultimately
measured and tested. In particular it focuses on the content of
a conceptualization and its theoretical integrity, and it does so
in a way avoiding confounding with empirical issues and
contamination in the measurement or testing process. At the
same time, scrutiny is placed on observational procedures and
ways to circumvent biases and threats to validity. Finally, the
three senses of meaning are interdependent and synergistic. A
dialectic tension is built into the holistic construal such that
theoretical, empirical, and spurious meanings constrain and
inform each other. The result is that theory development and
empirical aspects of the research enterprise are brought
together in a way that promotes speculation but is guided by
appraisals and scientific standards, thereby making research
an ongoing process of interconnected steps of inquiry,
confirmation, refutation, and (temporary) synthesis. Fig. 3 Example single-factor confirmatory factor analysis model
(reflective model)

Measurement CFA model can be used to test many interesting hypotheses.


One use is to establish or verify the dimensionality of scales.
Many research aims can be achieved by use of SEMs Many scales are hypothesized to be unidimensional; the
strictly for measurement purposes. Much of what we say single-factor CFA can be used to test this hypothesis.
below, however, applies also to other uses of SEMS such as To grasp the ideas behind the CFA model and its benefits
prediction and explanation. and implications, consider what happens when we apply the
model of Fig. 3 to the data in Table 2. The correlations
Confirmatory factor analysis model below the diagonal are for the measures of attitude toward
giving blood taken by 7-point semantic differential items:
Consider the issue of how to measure a theoretical variable unpleasant-pleasant, sad-happy, bad-good, and unfavorable-
in a study. The confirmatory factor analysis (CFA) model is favorable. The data were collected from 155 respondents
useful in this regard. Figure 3 depicts a single-factor CFA who were emotionally aroused experimentally, and the
where we show four measures or indicators for purposes of hypothesis is that all four measures should load satisfacto-
discussion. We begin with the reflective model case and rily on one factor. The left-hand panel of Table 3 presents
have omitted designating each indicator with a Vi, relying the findings. Based on the goodness-of-fit indexes, we
on only xi’s for simplicity. cannot reject the hypothesis of a single factor underlying
In traditional analyses such as multiple regression and the data: χ2(2)=2.22, p=.33, RMSEA=.03, NNFI=1.00,
ANOVA, researchers typically average all measures of a CFI=1.00, and SRMR=.014 (we will discuss goodness-of-
variable or scale. Justification for such practices rests on fit measures and their interpretation later in the article).
showing first that all items load on a single factor by use of Factor loadings are high (.76–.87) and statistically signif-
exploratory factor analysis and then achieving satisfactory icant, and error variances are relatively low (.24–.42).
reliability, such as is demonstrated by Cronbach alpha
values greater than .70. Table 2 Correlations among attitudinal measures used to illustrate
A more rigorous approach is to perform a CFA, where one-factor and two-factor models (see Figs. 3 and 4)
whether a set of indicators shares enough common variance to
Pleasant-unpleasant 1.000 .851 .540 .530
be considered measures of a single factor is formally tested. A
Happy-sad .655 1.000 .499 .513
failure to reject such a model on the basis of a χ2-test or other
Good-bad .650 .769 1.000 .892
goodness-of-fit indexes (discussed below) establishes the
Favorable-unfavorable .640 .677 .687 1.000
unidimensionality of the indicators. Reliability of the indica-
tors can be computed by use of the factor loadings and error Below diagonal for N=155 respondents exposed to emotional arousal;
variances according to a formula we consider below. The above diagonal for N=174 respondents not exposed to emotional arousal
J. of the Acad. Mark. Sci. (2012) 40:8–34 15

Table 3 Summary of findings for single-factor and two-factor models applied to the data shown in Table 2 (see Figs. 3 and 4)

Aroused respondents Unaroused respondents

Single-factor Two-factor Single-factor Two-factor

λ11 .76(.07) λ11 .76(.07) λ11 .93(.06) λ11 .95(.06)


λ21 .87(.07) λ21 .86(.07) λ21 .90 (.06) λ21 .90(.07)
λ31 .87(.07) λ32 .87(.07) λ31 .62(.07) λ32 .95(.06)
λ41 .79(.07) λ42 .79(.07) λ41 .62(.07) λ42 .94(.06)
θd11 .42(.06) θd11 .42(.06) θd11 .14(.04) θd 11 .10(.06)
θd22 .25(.04) θd22 .26(.05) θd22 .19(.04) θd 22 .20(.06)
θd33 .24(.04) θd33 .25(.05) θd33 .62(.07) θd 33 .10(.05)
θd44 .37(.05) θd44 .37(.05) θd33 .62(.07) θd 44 .11(.05)
ϕ21 1.01(.03) ϕ21 .60(.06)
χ2(df) 2.22(2), p=.33 2.03(1), p=.15 207.71(2), p=.00 1.55(1), p=.21
RMSEA .03 .08 .62 .06
NNFI 1.00 .99 .39 .99
CFI 1.00 1.00 .54 1.00
SRMR .014 .013 .16 .004

Standard errors in parentheses; ϕ is standardized

The correlations above the diagonal in Table 2 are for a Table 2? Column two in Table 3 shows the results of this
different sample of 174 respondents who were not model. Notice that this model fits the data well, although
emotionally aroused experimentally. The hypothesis is that the RMSEA is somewhat higher than desirable: χ2(1)=
two distinct factors should result: one corresponding to 2.03, p=.15, RMSEA=.08, NNFI=.99, CFI=1.00, and
affective attitude (unpleasant-pleasant, sad-happy), and one SRMR=.013. The choice between the one-factor and two-
conforming to evaluative attitude (bad-good, unfavorable- factor models for the data presented below the diagonal in
favorable). Figure 4 illustrates the two-factor model. See
Bagozzi (1994, 1996) for a rationale explaining one-factor
and two-factor representations in a similar study. The last
column in Table 3 shows the findings for this model, where
we see that the two-factor model fits well: χ2(1)=1.55, p=.21,
RMSEA=.06, NNFI=.99, CFI=1.00, and SRMR=.004. The
factor loadings are high, error variance is low, and the factors
correlate ϕ21 =.60 (s.e.=.06). Because one expects affective
and evaluative reactions to a single target (in this case, “giving
blood”) to be positively and at least moderately highly
correlated, ϕ=.60 seems reasonable. A confidence interval
around the value, .48≤ϕ21 ≤.72, suggests that the factors
correlate much below 1.00 and therefore are distinct (one
could alternatively do a χ2-difference test to ascertain this).
What would have happened if we tested a one-factor
model on the data presented above the diagonal in Table 2?
The findings are presented in the third column of Table 3,
where it can be seen that the model fits poorly: χ2(2)=
207.71, p=.00, RMSEA=.62, NNFI=.39, CFI=.54, and
SRMR=.16. Thus, the data support a two-factor model but
fail to support a one-factor model. The differences in
goodness-of-fit indexes and the χ2 test allow us to make a
choice between the two models in this case.
In a parallel fashion, we might ask: what if a two-factor
model were fit to the data presented below the diagonal in Fig. 4 Two-factor confirmatory factor analysis model (reflective model)
16 J. of the Acad. Mark. Sci. (2012) 40:8–34

Table 2 is not easy to make based on the goodness-of-fit hand, the more indicators per factor, the more parameters to
and χ2 tests, and it may seem that the one-factor and two-factor be estimated and the greater the sample size required to test
models are undifferentiated in this instance. However, it is models. Likewise, the more indicators per factor, the more
possible to make a choice between the one-factor and two- effort needed in measurement, such as adding more items to
factor models on the basis of an inspection of the findings. questionnaires, which may not be feasible in some studies.
Notice in Table 3, column 2, that ϕ21 =1.01 with s.e.=.03, Therefore, there are practical limitations to adding more
which suggests that the proposed factors are in effect perfectly indicators. Moreover, some latent variables will have only a
correlated. In other words, both the one-factor and two-factor limited number of measures available by design or other
model results support the case for a single factor. constraints. As a consequence, in practice one sees a great
An issue to consider is whether a model and its variety in the number of indicators employed per factor in
parameters are identified. A model is identified if all its various investigations. Note, too, that before the advent of
freely estimated parameters are identified in the sense that it SEMs, researchers using multiple regression, ANOVA, and
is not possible for two or more distinct sets of parameter other first-generation statistical methods were forced to use
estimates to reproduce the same population variance- single indicators per variable, so even with a few indicators
covariance matrix. A model where one or more parameters in SEMs, we still achieve advantages over first-generation
are not identified is termed underidentified, and programs procedures. The latter, in fact, fail to take into account
cannot provide reliable parameter estimates for all param- measurement error in tests of hypotheses.
eters. A model where parameter estimates can be solved for A frequent practical problem is that some scales or latent
uniquely in terms of observed variances and covariances is variables have too many indicators. For example, certain
called identified. When there are exactly as many unique personality scales, believed to be unidimensional, have 20 or
ways to solve for parameters as there are parameters, the more items. It may be unreasonable to expect that all such
model is just identified. When there are redundant items measure a factor in a way that sustains unidimension-
restrictions in a model that permit solving for all parame- ality. Further, sample sizes larger than practically obtainable
ters, such that one or more can be solved for in multiple may be needed to estimate all the parameters associated with
ways, we term the model overidentified. Underidentified, models that have many items per factor. Thus researchers
just identified, and overidentified models have negative, sometimes aggregate (e.g., average or sum) items into subsets
zero, and positive degrees of freedom, respectively. Only or parcels to use as indicators of factors. For example, imagine
overidentified models can be used to test the overall fit of that one variable in a larger theory to be tested has 12 items,
models (e.g., by use of χ2 tests). For single-factor models and it is not feasible to have each item load on that factor in the
with normal restrictions, at least four indicators are needed larger model. If an exploratory factor analysis, or better yet a
to achieve overidentification; three-indicator single-factor confirmatory factor analysis, shows that a one-factor model
models are just identified. Some scales are multidimensional, fits the 12 items well, then some basis exists for treating the 12
and CFAs with multiple factors can be used to test these items as parallel measures, and the 12 items might be treated
models. The simplest overidentified two-factor model has two as equivalent. Given this, it might be useful to form three
indicators per factor and 1 degree of freedom (e.g., see Fig. 4). indicators of the factor in question, say, by selecting four items
An important issue in specifying CFA models and causal each per indicator and averaging the respective items. This
models is to determine the number of indicators per factor. practice reduces the number of parameters to be estimated in
Of course, for testing an existing scale, the number of the larger model from 24 to 6, requires a smaller sample size to
indicators is given by the number of items in the scale. But test the model, and may smooth out some of the error in items
when a researcher designs his/her own scale, adds items to as well. For discussions and examples of this practice, see
or subtracts items from an existing scale, or proposes Bagozzi and Heatherton (1994), Bagozzi and Edwards
measures of a factor, he/she must decide how many (1998), and Little et al. (2002).
indicators to specify. A number of trade-offs must be taken
into account in this regard. Too few indicators per factor Reliability
may produce unstable solutions and lead to failures of
programs to converge, especially in complex models with Cronbach’s alpha and other formulae for computing the
many latent variables and paths. For this reason, some reliability of measures were derived to ascertain the internal
researchers advocate using at least three indicators per consistency of items. It has become commonplace to report
factor. Another reason in support of using more indicators the reliabilities of measures in any study, even when SEMs are
per factor is that with a greater number of indicators per used. In fact, the use of SEMs makes such a practice
factor, it is more difficult to achieve convergent and unnecessary or redundant, because the information provided
discriminant validity, and thus we obtain a tougher test of in factor loadings and error variances incorporates reliability
our hypotheses with more indicators per factor. On the other so to speak. Nevertheless, because researchers often choose to
J. of the Acad. Mark. Sci. (2012) 40:8–34 17

present explicit evidence of reliability in a traditional sense, be applied rigidly to SEMs, and indeed focus should be placed
and many journals require this, we will briefly draw linkages more on the hypotheses under tests in, and goodness-of-fit of,
between reliability and SEMs. any SEM.
The findings from a CFA can be used to compute indexes
of reliability for a measure of a factor. The reliability of an Reflective and formative indicators
individual item of a factor can be computed as
The choice of whether to use reflective or formative indicators
r l2i varðfactorÞ
¼ of latent variables is a controversial issue. The models shown
indicator i
li varðfactorÞ þ qii
2
in Figs. 1, 3, and 4 use reflective indicators; indicators are
functions of latent variables plus error, which is close in spirit
Where λi is the factor loading connecting an indicator to its
to classical test-score theory (Lord and Novick 1968)
hypothesized factor and θii is the variance of the error term
and the common factor model (McDonald 1999). Figure 5
corresponding to the indicator. For instance, the reliability of
shows a model with formative indicators, where we have
the third measure of the single factor for the analysis of the
taken the special case of what is known as the MIMIC
sample of aroused respondentssummarized in Table 3 can be
(multiple-indicator, multiple-cause) model (e.g., Bagozzi
computed as rx3 ¼ :872 ð1Þ= :872 ð1Þ þ :24 ¼ :76, where
et al. 1981). A latent variable might be socio-economic
we have made use of the fact that the factor was standardized
status, which is measured by education, income, and
in the analyses. Some of the SEM programs provide estimates
occupational prestige. In a formative model, a latent
of item reliability, calling them squared multiple correlations.
variable is a function of its proposed measures. Unlike
Analogously, the reliability of all the items of a factor
the reflective measurement case, where a factor and its
(sometimes termed composite reliability) is given by
indicators can stand alone in the sense of being subject to
P
ð lijÞ2 varðfactorÞ tests of goodness-of-fit and yielding parameter estimates,
rcomposite ¼ P P
ð lijÞ2 varðfactorÞ þ qii the formative measurement case requires that a latent
variable with formative indicators also includes either
where λij refers to factor loading i on factor j. For example, the reflective indicators or additional latent variables that have
composite reliability for the two indicators of the first factor of reflective indicators, which are predicted by the latent variable
the two-factor solution of the data on unaroused respondents with formative indicators, in order for a model to be tested
summarized in Table  3 can be computed as rx1þx2 ¼ against data and parameters estimated. A limiting case of the
ð:95 þ :90Þ2 ð1Þ = ð:95 þ :90Þ2 þ ð:10 þ :20Þ ¼ :92. formative model is principal components analysis.
There are no universally accepted standards as to what A brief summary of the controversy follows. Some
minimally acceptable indicator and composite reliabilities researchers believe that formative indicator models are
should be. Because individual indicator reliabilities can be fundamentally flawed and never should be used, and thus they
relatively low at times, yet the factor to which they correspond recommend use only of reflective indicators (e.g., Howell et al.
might perform satisfactorily in a larger model, somewhat more 2007a, b). The argument is that the regression parameters
emphasis might be placed on composite reliability. Here we relating formative measures to a factor are not only functions
might use the oft-cited classic reliability standard of .70 or of the intended connections but also depend on information
greater for a satisfactory composite reliability, although contained in measures of other latent variables, which are
acceptable reliabilities somewhat below this value may be predicted by the latent variable alleged to be measured
obtained when an overall causal or CFA model fits satisfac- formatively. As a result, such models are both ambiguous and
torily. For individual indicator reliability, standardized load- indeterminant with regard to replicability to a large extent.
ings of .70 or greater, so as to achieve a reliability of at least Other researchers believe that some latent variables are by
.50, seem ideal, with the logic being that one attains about nature related to their measures formatively in an a priori
50% explained variance in the respective measure as a conceptual sense, and therefore formative measurement is not
function of its factor (and avoids having an indicator with only permitted but in a sense called for (e.g., Diamantopoulos
more than 50% error). However in practice, for large models and Siguaw 2006; Podsakoff et al. 2003).
with many latent variables and indicators, one will occasion- While agreeing with Howell et al. (2007a) that formative
ally find loadings as low as .50 that still occur within the measurement parameters are dependent on other measures
context of a satisfactory fitting overall model. Looking at only in an SEM, which introduces ambiguity, Bagozzi (2007)
reliability, loadings of .70, .70, and .50 on a factor still yield a argues that the special case of the MIMIC model can be
composite reliability of .67, for instance. So cut-off values for meaningful, if one is interested primarily in the prediction
indicator and composite reliability might be taken with some of a linear combination of a set of dependent variables by a
leeway in mind. In any case, we feel that old standards for linear combination of a set of independent variables. The
Cronbach’s alpha and other formulae for reliability should not choice between formative and reflective indicators seems also
18 J. of the Acad. Mark. Sci. (2012) 40:8–34

to depend on the metaphysical or ontological assumptions one correlations of measures of a proposed factor with measures of
is willing to make (Bagozzi 2011a; Diamantopoulos and other factors should not only be significantly lower than the
Winklhofer 2001; Petter et al. 2007). Discussion of these level of correlations establishing convergence, but also the
esoteric issues is beyond the scope of this article, and the pattern of correlations of measures across factors should be
reader is referred to the aforementioned articles. But is proportional in the sense of being relatively similar in
should be recognized that formative indicator models do not magnitude. Usually, when considering reliability of indicators,
yield meaningful measures of reliability and pose problems the indicators come from the administration of a single
in terms of doing cross-validations, generalizations, and procedure or method (e.g., a Likert disagree-agree scale of
testing for construct validity (Bagozzi 2011a). For a multiple items). This practice is not very demanding and risks
presentation and discussion of alternative representations of common method biases. That is, convergence of measures
formative models, see Bagozzi (2011a). may be as much a function of the singular method applied as it
To illustrate the MIMIC model shown in Fig. 5, we is of both the nature of the construct under scrutiny and the
applied the model to the data shown in Table 4. As ability of the indicators to measure that construct. But when a
summarized in Table 5, the model fits well: χ2(6)=13.78, single measurement procedure is used, it is not possible to
p = .032, RMSEA= .066, NNFI = .98, CFI = .99, and disentangle true convergence from method bias. Construct
SRMR=.028. Notice that education, income, and occupation validity procedures strive to overcome these drawbacks.
are each significantly related to quality of diet, extent of Construct validity methods were developed to consider the
exercise, and medication compliance through the mathemat- degree of convergence for a set of measures of a hypothesized
ical transformation function performed by the latent construct and of discrimination between those measures and
variable, η. In estimating this model, we followed the measures of a different construct. In the fullest implementation
normalization restriction fixing the variance of ζ to 1 of a construct validation study, measures of a construct under
suggested by Jöreskog and Goldberger (1975, p. 632). scrutiny and measures of other constructs are obtained by
multiple methods such that two or more methods are maximally
Construct validity similar and two or more are maximally different. Such an
approach provides the most demanding test of validity. It is
easier (perhaps too easy) to obtain convergence when two or
Do not confuse the finger pointing to the moon with
more methods are similar (e.g., when a Likert scale and a
the moon.
semantic differential scale are used) than when they differ
Ancient Chinese proverb
(e.g., use of a Likert scale of self-reports combined with peer
Construct validity is the extent to which indicators of a or expert ratings of the persons responding to the Likert scale
construct measure what they are purported to measure. Unlike items). Likewise, it is more difficult to verify discrimination
reliability, which is limited to the degree of agreement among when two or more methods are similar than when different.
a set of measures of a single construct, construct validity Further, discrimination is harder to demonstrate when two or
addresses both the degree of agreement of indicators more constructs should be highly correlated, yet distinct,
hypothesized to measure a construct and the distinction according to theory. As a consequence, the most informed
between those indicators and indicators of a different construct validity analyses employ maximally similar and
construct(s). The notion is that we obtain a reasonable sense maximally dissimilar methods and include closely related
of the validity of indicators of a construct when the measures constructs to yield a rigorous test of construct validity.
converge in the proper way and yet do not relate too highly As an illustration, we applied the additive trait-method-
with measures of something else. Implied in the idea of error construct validity model shown in Fig. 6 to the data
convergence is the logical implication that multiple measures summarized in Table 6. The findings for this model can be
of the same phenomenon should be highly correlated and seen in the second column of Table 7. This model fits very
correlated relatively uniformly in the sense of supporting a well on the basis of the χ2 statistic and four indexes of
single factor, but not two or more factors. At the same time, goodness-of-fit: χ2(12) = 19.55, p = .08, RMSEA= .054,

Fig. 5 Formative factor analysis


model
J. of the Acad. Mark. Sci. (2012) 40:8–34 19

Table 4 Correlations for illustration of formative indicators under a MIMIC model (see Fig. 5)

Education 1.00
Income .59 1.00
Occupation .52 .47 1.00
Quality of diet .38 .39 .35 1.00
Extent of exercise .58 .62 .56 .69 1.00
Medication compliance .52 .57 .49 .64 .79 1.00

N=124

NNFI=.99, CFI=1.00, and SRMR=.04. Notice that all nine model is identical to the model shown in Fig. 6, except now
factor loadings relating indicators to the hypothesized latent there are no method factors and corresponding factor
variables of attitude, desire, and intention are high in value loadings. If the indicators load on their hypothesized factors
and statistically significant (.74–.96), and all nine well and not too highly on the other factors, and thus
corresponding error variances (θε1–θε9) are relatively low achieve convergent and discriminate validity, plus are free
in value (−.01–.44). This means that most of the variation from method bias, the model should fit well. The first
in measures of hypothesized factors is due to the phenomena column of Table 7 presents the findings, where it can be
being measured and little is due to measurement error. At the seen that the model, in fact, fits poorly: χ2(24)=241.64,
same time, because the factor loadings corresponding to the p=.00, RMSEA=.19, NNFI=.81, CFI=87, and SRMR=.05.
three methods of measurement—semantic differential, Likert, This is consistent with the results of the test of the full model
peer evaluation—are low to moderate in value (.08–.59), shown in Fig. 6, which demonstrated the presence of
method bias is on average relatively low. All in all, one may significant levels of method bias.
conclude that the measures of attitude, desire, and intention The advantage of employing multiple methods in a single
achieve satisfactory convergent validity. Inspection of study is that method bias, if any, can be explicitly estimated
the standardized correlations among the three constructs and taken into account. Most published studies use a single
(ψ31, ψ21, ψ32) reveals that we can reject the hypothesis method, which makes it impossible to disentangle true
that the factors are perfectly correlated: ψ31 =.28 (s.e.=.07), variation from measurement error and method bias. When
ψ21 =.37(s.e.=.07), and ψ32 =.55(s.e.=.06). Indeed, the using only a single method in a study, it is possible, even
correlations are far below 1.00, as confidence intervals around common, to achieve satisfactory model fit. Nevertheless, it is
the estimated values confirm. Yet according to theoretical impossible to discern whether the fit is a consequence of the
expectations, people’s attitudes, desires, and intentions toward hypothesized phenomena under investigation alone, method
a target action are positively correlated. Notice, too, consistent bias alone, or a combination of the phenomena and method
with expectations, that self-report semantic differential and bias. Implementation of multiple methods makes it possible to
Likert methods are positively and highly correlated (ψ54 =.66 ascertain the relative contributions.
(s.e.=.13)), whereas peer ratings are correlated with self- The construct validity model shown in Fig. 6 is known
reports at a low level, and in this case not significantly as the construct-error-method model or the trait-error-
associated (ψ64 =−.16(s.e.=.44), ψ65 =−1.11(s.e.=1.15)). method model. There are a number of construct validity
To place the test of construct validity in perspective, we models, each with its advantages and disadvantages
ran a model that hypothesizes that variation in indicators is (see Bagozzi 2011a, for a description of many of these models).
due only to latent constructs plus measurement error. This It turns out that the construct-error-method model can

Table 5 Findings for formative


indicator model under a MIMIC Parameter estimates
model (see Table 4 and Fig. 5) γ1 .36(.09) λ1 .48(.04) θε1 .48(.05)
γ2 .56(.09) λ2 .63(.04) θε2 .10(.03)
γ3 .42(.08) λ3 .56(.04) θε3 .29(.03)
R2diet ¼ :52; R2exercise ¼ :90; R2medication ¼ :71
Goodness-of-fit
χ2(6)=13.78, p=.032
RMSEA=.066
NNFI=.98
CFI=.99
SRMR=.028
20 J. of the Acad. Mark. Sci. (2012) 40:8–34

Fig. 6 Construct validity example for the case of three constructs measured by three methods

be unstable, and often programs will not be able to method model but differs in two ways. First, no
converge and find satisfactory solutions. A model that methods factors are included. Second, the disturbances
usually works in such situations and that is frequently corresponding to common methods are allowed to be
applied is the correlated uniqueness model. The corre- correlated. For example, taking method 1 in Fig. 6 as an
lated uniqueness model is similar to the construct-error- illustration, instead of modeling a unique method factor,

Table 6 Correlation matrix for example examining construct validity (see Fig. 6)

Semantic differential
1. Attitude 1.00
2. Desire .35 1.00
3. Intention .30 .58 1.00
Likert
4. Attitude .78 .38 .34 1.00
5. Desire .33 .83 .54 .39 1.00
6. Intention .28 .48 .85 .31 .59 1.00
Peer evaluation
7. Attitude .65 .37 .21 .60 .29 .16 1.00
8. Desire .27 .74 .38 .18 .72 .35 .35 1.00
9. Intention .21 .32 .70 .14 .35 .71 .19 .49 1.00

N=207
J. of the Acad. Mark. Sci. (2012) 40:8–34 21

Table 7 Summary of findings for construct validity illustration (see Table 6 and Fig. 6)

Construct-error Construct-error-method Correlated uniqueness


model model model

λ11 .90(.06) .88(.06) .94(.06)


λ21 .86(.06) .83(.06) .87(.06)
λ31 .71(.06) .74(.06) .75(.06)
λ42 .92(.05) .82(.06) .90(.05)
λ52 .91(.06) .90(.06) .90(.05)
λ62 .79(.06) .96(.06) .88(.06)
λ73 .93(.05) .83(.06) .92(.05)
λ83 .92(.05) .88(.06) .90(.05)
λ93 .76(.06) .89(.06) .85(.06)
λ14 − .08(.06) −
λ44 − .43(.11) −
λ74 − .59(.11) −
λ25 − .27(.07) −
λ55 − .37(.10) −
λ85 − .35(.09) −
λ36 − .11(.07) −
λ66 − .32(.18) −
λ96 − .17(.10) −
θε1 .18(.04) .20(.05) .16(.05) θε1, ε4 =−.06(.02)
θε2 .26(.05) .20(.04) .26(.04) θε1, ε7 =−.05(.02)
θε3 .49(.06) .44(.05) .49(.06) θε4, ε7 =.08(.02)
θε4 .15(.03) .17(.04) .16(.03) θε2, ε5 =.09(.03)
θε5 .18(.03) .05(.03) .20(.03) θε2, ε8 =.04(.02)
θε6 .37(.04) .01(.06) .34(.04) θε5, ε8 =.10(.02)
θε7 .14(.03) -.01(.09) .15(.03) θε3, ε6 =.12(.03)
θε8 .16(.03) .10(.03) .16(.03) θε3, ε9 =.10(.04)
θε9 .43(.05) .18(.04) .40(.05) θε6, ε9 =.25(.04)
ψ31 .35(.07) .28(.07) .35(.07)
ψ21 .44(.06) .37(.07) .43(.06)
ψ32 .62(.05) .55(.06) .56(.05)
ψ64 − -.16(.44) −
ψ54 − .66(.13) −
ψ65 − −1.11(1.15) −
χ2(df) 241.64(24), p=.00 19.55(12), p=.08 40.80(15), p=.00
RMSEA .19 .054 .089
NNFI .81 .99 .96
CFI .87 1.00 .98
SRMR .05 .04 .067

Standard errors in parentheses

we model three correlated errors: θε4ε1, θε7ε1, θε7ε4. With satisfactory, and the RMSEA is somewhat too high in
all three methods taken into account, this means that a value. Another problem with this particular application of
total of nine residuals need to be correlated. Column 3 in the correlated uniqueness model is that some of the
Table 7 presents the findings for the correlated uniqueness correlated errors are negative and significant, some positive
model applied to the data shown in Table 6. This model and significant, even for the same factor. This is difficult to
does not fit the data well and gives what might be termed a reconcile and suggests that the model is not appropriate in this
mediocre fit in that only the NNFI, CFI, and SRMR are case. But again to reiterate, the correlated uniqueness model is
22 J. of the Acad. Mark. Sci. (2012) 40:8–34

often appropriate in practice, and we present it here for experimentation is given pride of place for establishing
illustrative purposes. causality by psychologists: “Experiments explore the effect
of things that can be manipulated…. Nonmanipulable
events … cannot be causes in experiments because we
“Causal” models cannot deliberately vary them to see what then happens”
(Shadish et al. 2002, p. 7, emphasis in original). This
Up to this point, we have focused on measurement issues, perspective on causality fits well the manipulability theory in
with particular emphasis upon the CFA model and its philosophy (Von Wright 1974; for discussion of various
manifestation as a means for testing scales, computing models of causality, see Bagozzi 1980, ch. 1). However, as a
reliability, choosing reflective or formative indicators, and sole or even primary basis for justifying claims of causality,
ascertaining construct validity. Because factors are embed- one philosopher notes that the manipulability theory “may
ded in SEMs, many of the principles discussed above apply well have things backward: the concept of action seems to be
to the topics to which we now turn. a richer and more complex concept that presupposes the
We placed quotation marks around the word causal in concept of cause, and an analysis of cause in terms of action
this section’s title to draw attention to the point that SEMs could be accused of circularity” (Audi 1995, p. 111).
have applicability to testing causal hypotheses yet are Experimentation is certainly a fruitful area where causal
relevant as well to testing functional relationships, models are applicable (e.g., Bagozzi 1977; Bagozzi and
generalizations, cross-validations, and predictions. Also Yi 1989; Bagozzi et al. 1991b). But even in the
we wish to acknowledge that researchers have sometimes experimental case, a number of issues deserve consideration
over-claimed the power of SEMs to show causation, and (Bagozzi 2010). First, in any experiment, there are likely to
readers have sometimes been misled in that regard. To be one or more threats to validity, some unknown or
make clear our position, we stress that, like any statistical unaddressed, and as a consequence, a failure to reject the
procedure, SEMs by themselves do not prove causality. null hypothesis perhaps is at best fallible and tentative
Statistical procedures need to be deployed along with evidence for causality. Second, among the many competing
sound methodological procedures (e.g., experimentation, interpretations of causation in the philosophy of science, the
quasi-experiments, longitudinal designs) to support causal assumptions and implications of each constrain the meaning
claims. Even here, the issues are complex, and philo- of causation and at the same time harbor advantages and
sophical, theoretical, and operational concerns need to be disadvantages vis-a-vis other perspectives. So it seems that
scrutinized to interpret any research investigation. As a even under experimental conditions, it is best not to be
consequence, we begin with a discussion of causality overly sanguine that one is truly observing causality.
before turning to specific models, issues, and examples of Depending on which of the many competing models of
“causality.” causality one embraces, non-experimental approaches (e.g.,
so called quasi-experiments, cross-sectional and longitudinal
Causality surveys, even qualitative research methods) might satisfy
some requirements for causality better than experimental
methods in certain instances. All of this is simply to say that,
A claim of proof of cause and effect must carry with it
in particular studies, it will not in general be indisputably
an exploration of the mechanism by which the effect
clear that an experimental approach accords with all criteria
is produced.
for causation better than a non-experimental approach does.
William G. Cochran
It all depends on what model of causality is followed and
Researchers using SEMs often interpret relationships how well the competing methods satisfy the criteria under
between exogenous and endogenous variables as causal the models. Note that SEMs can be and have been applied to
relationships, and at the same time, they assert that relation- experimental data, so we are not necessarily comparing
ships between latent variables and manifest variables are also ANOVA analyses of traditional experimental data to SEM
causal (e.g., Bagozzi 1980; Edwards and Bagozzi 2000). analyses of augmented experimental data.
However, such claims are particularly acute, contentious, and There is a further issue worthy of comment concerning
in need of further discussion. inferences of causality between latent variables. Consider
Consider first the point of view that relationships among the following definition of causality, which is perhaps
exogenous and endogenous latent variables are, or at least general enough so as not to run afoul of most of the
can be, causal. Researchers, particularly psychologists, tend competing models of causality. Causality is
to reserve claims of causality for cases where independent
variables can be manipulated under controlled conditions so the relation between two events that holds when,
as to observe changes in a dependent variable. Indeed, given that one occurs, it produces, or brings forth, or
J. of the Acad. Mark. Sci. (2012) 40:8–34 23

determines, or necessitates the second; equally we say credence as causal. Further, cross-sectional survey data in
that once the first has happened the second must this regard might be better interpreted as yielding evidence
happen or that the second follows on from the first…. for functional relationships or alternatively relationships
[Furthermore, causation] suggest[s] that states of believed to be consistent with causal relationships as far as
affairs or objects or facts may also be causally related. they go but not sufficiently strong enough to suggest
(Blackburn 1994, p. 59) causality to the degree that experiments do. Longitudinal
survey data potentially support stronger interpretations than
Most philosophers of science seem to regard causality as do cross-sectional data but weaker interpretations than
something we infer between physical or material entities or experimental data in the typical case. To avoid either/or,
changes therein, hence reference above to “events,” “states categorical thinking leading to overly strong claims of
of affairs,” “objects,” and “facts.” But latent variables are causality for experimental research or premature dismissal
abstractions or unobservables and are nonmaterial, though of survey research as giving no support whatsoever for
we hope they represent or capture variance observed in causal claims, it is best to think about a causality
manifest variables. The “causal” parameters derived from continuum, marked by relatively strong (experiments)
estimation of SEMs (γ’s and β’s) are inferred statistics from and relatively weak (surveys) labels as endpoints, and
relationships amongst (material) manifest variables (e.g., longitudinal surveys somewhere in between strong and
observations or measures of events, states of affairs, etc.). weak. Another method with intermediate claims of
The parameter estimates of causal relationships between causal credence might be field or quasi-experiments,
latent variables might be best construed as imperfect, where somewhat less control than pure experiments is
fallible signs of whatever causal process one is studying afforded (e.g., testing hypotheses across multiple groups in
that occurs between the measures of causes and effects. naturalistic settings or between groups formed fortuitously
SEMs may be specified to correct for random and akin to controlled experiments or where events treated as
systematic errors, and thus γ’s and β’s may be purged of independent variable happen fortuitously). SEMs apply in all
such errors, but it is important to recognize that causality is these cases and suggest differing bases and different degrees
at best estimated (inferred) from the data and predicated on of evidence for concluding causality. Some sub-criteria for
the nature of the data and methodological conditions causality might be better met in naturalistic field experiments
applied in any study. Researchers using SEMs, even under and longitudinal designs than in pure experiments in certain
the control of experimental conditions, should not go too albeit relatively rare cases.
far in suspending conceptual, empirical, and methodolog- What about the designation predictive? If a study tests a
ical beliefs and assumptions, and in overclaiming causality. theory and exogenous and endogenous variables are linked
Now to the claim that the relationship between a latent significantly according to the theory, we might term the
variable and its manifest or measured variables is causal. relationship explanatory (e.g., ξ explains η) and then
This is a longstanding point of view exposited early on by decide whether or not, or to what degree, causality can be
Blalock, Bollen, Heise, and many others from the 1970s claimed. When the exogenous and endogenous variables
and continuing to the present by Edwards and Bagozzi are separated in time, the relationship might be called an
(2000), Jarvis et al. (2003), and nearly every treatment of explanatory prediction. Nevertheless, we prefer to use the
the relationship in the literature. An argument can be made term prediction when an existing theory leads to the
that use of “causal language” to characterize the relation- forecast or discovery of a new phenomenon or outcome.
ship between latent and manifest variable is misguided. It This latter usage is consistent with some philosophy of
appears that the relationship in question is not causal, per science characterizations of what constitutes a (strong)
se, but rather one of hypothetical measurement. That is, the theory. That is, a theory that explains what it is supposed to
relationship is between an abstract, unobserved concept and explain is given less acclaim than one that also leads to new
a concrete, observed measurement; the relationship is partly discoveries or predictions. To keep this important distinc-
logical, partly empirical, and partly theoretical (conceptual), tion concerning theories, it seems best to speak of
with the inferred factor loading representing only part of the explanatory prediction and prediction as pointing to still
empirical meaning of the relationship (Bagozzi 1984, 2011a). another continuum for interpretive purposes.
In sum, SEMs represent different relationships that
require a healthy application of interpretation. Relationships Higher-order causal models
amongst latent exogenous and endogenous variables are
best construed as imperfect representations of causal As a bridge between CFA models and causal models,
relationships, with those founded on experimental data consider the second-order model displayed in Fig. 7.
coming closest to achieving the designation or accolade, Heretofore, our presentation of latent variables always
causal, and those arising from survey research given less showed them connected directly to indicators and linked
24 J. of the Acad. Mark. Sci. (2012) 40:8–34

Fig. 7 Second-order confirmatory factor analysis and causal model (see Table 8)

thereby with factor loadings. But in Fig. 7, we illustrate a parameter estimates and avoid false inferences due to
second-order factor, ξ, which is only indirectly connected multicollinearity. Multicollinearity may become a problem
to manifest variables (through η1–η3). when the correlations of independent variables with a
To understand what a second-order factor is and what dependent variable are lower than the correlations among
value it has, imagine that we wish to represent the social the independent variables. A common outcome in such
identity of consumers with a company and test the effects of cases is that a true positive (negative) effect turns out to be
this identity on support for the company’s brand. Social nonsignificant or even becomes negative (positive), dem-
identity is said to be comprised of three components or onstrating reversals in sign.
dimensions: cognitive identity (sometimes termed identifica- One form of multicollinearity happens as redundancy
tion, or alternatively self-awareness of group membership), due to multiple measures of the same constructs. The
emotional identity (often called affective commitment), and cognitive, emotional, and evaluative identities shown in
evaluative identity (alternatively characterized as collective Fig. 7 each have two indicators. If one were to regress
or group-based esteem) (Bagozzi et al. 2011). In Fig. 7, we measured brand responses directly on these indicators by
might construe ξ as capturing social identity as a highly use of multiple regression, then multicollinearity outcomes
abstract representation of overall social identity arising from might occur. The three first-order identity factors shown in
and displayed toward a company, whereas η1–η3 constitute Fig. 7 not only eliminate these problems but also take into
less abstract, specific components of social identity in the account the unreliability present in y1–y6. Before the
form of cognitive, affective, and evaluative responses. The development of SEMs, researchers investigating hypotheses
second-order model in Fig. 7 then hypothesizes that social implied in Fig. 7 either treated all six indicators as separate
identity influences brand responses (η4) such as positive independent variables, averaged the respective pairs of
word of mouth about the brand (y7) and resilience to indicators and used the averages as independent variables,
negative information about the brand (y8). or ran two regressions: one with one indicator each
In addition to representing the structure of a higher-level corresponding to η1–η3, and a second with the other
construct and its effects on a dependent variable(s), the indicator each for η1–η3. None of these traditional analyses
second-order model provides a way to address certain takes into account all the information available or corrects
forms of multicollinearity. It is useful to consider two for measurement error.
manifestations of multicollinearity and how higher-order The other form of multicollinearity arises due to high
SEMs might overcome problems with the efficiency of correlations among latent variables. Often independent
J. of the Acad. Mark. Sci. (2012) 40:8–34 25

variables are naturally correlated at high levels due to their on the data accordingly. This model fit well overall:
nature. For example, different mental events—cognitions, χ2(14)=16.99, p=.26, RMSEA=.028,NNFI=1.00, CFI=
motives, attitudes, desires, intentions—are frequently high- 1.00, and SRMR=.013. However, consistent with our
ly correlated because they refer to a common event or target portrayal of multicollinearity above, two of three paths
or because they influence each other. Yet they are distinct from η1–η3 to η4 were non-significant: β41 =.21(s.e.=.12),
mental events, possibly with different as well as shared β42 =.26 (s.e.=.11), and β43 =.20 (s.e.=.12).
causes and effects. Cognitive, emotional, and evaluative
aspects of social identity are instances of such highly Illustration of a causal model
correlated variables. When the correlations among a set of
latent variables functioning as antecedents are high, multi- We are finally ready to provide an example of a causal
collinearity problems might take place. This source of model that is fairly typical of those found in the literature.
redundancy occurs as a result of high correlations between Table 10 presents a summary of data for a sample of people
measures of different constructs. considering the adoption of an innovation. The data were
The second-order model shown in Fig. 7 is most valid tested against the TAM in Fig. 1.
and meaningful when the first-order factors can be Table 11 sums up the findings. Notice first that the model fits
interpreted as dimensions or components of the more fairly well: χ2(46)=100.35, p=.00, RMSEA=.07,NNFI=.97,
abstract second-order concept. Moreover, the first-order CFI=.98, and SRMR=.04. Factor loadings are high and error
factors should be relatively highly correlated. When either variances low. Perceived ease of use (ξ) significantly
or both of the above mentioned criteria are not met, the influences perceived usefulness (η1, γ11 =.43 (s.e.=.08)),
second-order factor approach to multicollinearity or for attitude (η2, γ21 =.25 (s.e.=.07)), and subjective norm
representing higher-order constructs will not be justified. (η3, γ31 =.30 (s.e.=.09)). In turn, perceived usefulness (η1)
To illustrate the second-order model shown in Fig. 7, we has a strong impact on both attitude (η2, β21 =.36 (s.e.=.07))
applied it to the data summarized in Table 8. Table 9 and subjective norm (η3, β31 =.23 (s.e.=.08)). Next, attitude
presents the results where we can see that the model fits (η2) affects intention (η4, β42 =.46 (s.e.=.08)), as does
well: χ2(16)=17.00, p=.39, RMSEA=.013,NNFI=1.00, subjective norm (η3, β43 =.26 (s.e.=.06)). Finally, intention
CFI=1.00, and SRMR=.013. The high factor loadings (η4) determines usage (η5, β54 =.93 (s.e.=.10)). The respec-
and low error variances demonstrate that the proposed tive explained variances in usefulness, attitude, subjective
indicators capture well the constructs that they were norm, intention, and usage are .16, .32, .16, .38, and .34
hypothesized to measure. One can obtain a sense of the (calculated as one minus the respective standardized error
relative contribution of each component of social identity to variances for η’s).
overall social identity by inspection of the γ’s shown in A number of model fitting and specification issues can
Table 9. Here the standardized values suggest that the be pointed out in this example. Ideally, the theory that one
components are relatively uniform in their measurement of develops or draws upon will be strong, hypotheses will be
social identity: γ1 =.80, γ2 =.85, and γ3 =.80. The amount translated into a causal model well, and care will be taken
of explained variance in brand responses can be computed in the operationalization of constructs. Under these con-
as R2 ¼ 1  y hstd4 ¼ 1  :68 ¼ :32: ditions, the model could fit the data well. But when a
To show what would have happened if brand posited model fails to fit the data, what is one to do?
responses (η4) were modeled as direct functions of the Sometimes one or more indicators might be especially
social identity components (η1–η3), without including a poor. An inspection of the magnitude and pattern of
second-order social identity factor (ξ), we ran the model loadings on a factor can point to this problem. For example,

Table 8 Covariance matrix for second-order confirmatory factor analysis example (see Fig. 7)

Brand response 1 3.69


Brand response 2 2.64 3.53
Cognitive identity 1 1.02 .94 2.31
Cognitive identity 2 1.03 1.04 1.86 2.22
Affective identity 1 1.17 1.18 1.50 1.38 2.72
Affective identity 2 1.33 1.18 1.46 1.38 2.40 2.86
Evaluative identity 1 .96 .97 1.24 1.29 1.42 1.43 2.46
Evaluative identity 2 1.09 1.04 1.22 1.18 1.51 1.50 1.96 2.50

N=254
26 J. of the Acad. Mark. Sci. (2012) 40:8–34

Table 9 Results for second-order confirmatory factor analysis example (see Table 8 and Fig. 7)

Parameter estimates

Factor loadings Error variances

Unstandardized Standardized Unstandardized Standardized

Cognitive identity 1 1.00a .91 .38(.09) .17


Cognitive identity 2 .96(.06) .90 .43(.09) .19
Affective identity 1 1.00a .94 .33(.09) .12
Affective identity 2 1.00(.05) .92 .46(.09) .16
Evaluative identity 1 1.00a .88 .54(.10) .22
Evaluative identity 2 1.02(.07) .89 .50(.11) .20
a
Brand response 1 1.00 .86 .95(.27) .26
Brand response 2 .96(.10) .85 1.00(.25) .28
Variances for errors in equations
Unstandardized: ψ11 =.68(.12), ψ22 =.68(.14), ψ33 =.69(.13), ψ44 =1.87(.31)
Standardized: ψ11 =.35, ψ22 =.28, ψ33 =.36, ψ44 =.68
Second-order factor loadings
Unstandardized: γ1 =1.12(.09), γ2 =1.31(.10), γ3 =1.11(.09), γ4 =.93(.12)
Standardized: γ1 =.80, γ2 =.85, γ3 =.80, γ4 =.56
Goodness-of-fit
χ2(16)=17.00, p=.39
RMSEA=.013
NNFI=1.00
CFI=1.00
SRMR=.013

Standard errors in parentheses


a
Constrained parameter

one loading may be low and two others high. The indicator regard. A somewhat similar outcome that occurs now and
corresponding to the low loading might be dropped and the then is that two indicators, say, load relatively low on a
model re-run. Of course, such a practice risks capitalizing factor, whereas two others load high. This might mean that
on chance, so it would be a good idea to get at the bottom the two indicators loading at a low level are again suspect
of why the poor indicator performed as it did. Further data as measures of the factor under consideration, or it might
collection or cross-validation might provide insight in this suggest that the factor was specified incorrectly in the sense

Table 10 Covariance matrix of indicators for the technology acceptance model in Fig. 1

Usefulness 1 2.76
Usefulness 2 2.33 2.92
Usefulness 3 2.48 2.53 2.96
Attitude 1 1.11 1.20 1.05 2.31
Attitude 2 1.11 1.06 1.07 1.73 2.19
Subjective norm 1 .90 .75 .96 .96 .98 3.06
Subjective norm 2 .78 .83 .78 1.02 .92 2.59 2.82
Intention 1 .46 .58 .61 1.13 1.08 1.09 1.12 2.40
Intention 2 .50 .54 .49 1.10 1.14 1.18 1.16 2.17 2.46
Usage .53 .67 .59 1.12 1.02 1.49 1.35 1.85 2.02 5.29
Ease of use 1 .83 .75 .95 .71 .79 .77 .87 .38 .33 .37 2.62
Ease of use 2 .76 .89 .97 .86 .81 .69 .79 .44 .39 .51 1.98 2.46

N=204
J. of the Acad. Mark. Sci. (2012) 40:8–34 27

Table 11 Findings for the technology acceptance model (see Table 10 and Fig. 1)

Unstandardized Standardized Unstandardized Standardized

Measurement parameter estimates


Factor loadings Error variances
ly11 1.00a .91 θε1 .46(.07) .17
y
l21 1.02(.05) .90 θε2 .54(.07) .19
ly31 1.08(.05) .95 θε3 .28(.06) .09
ly42 1.00a .87 θε4 .55(.12) .24
ly52 .99(.08) .89 θε5 .47(.12) .21
ly63 1.00a .91 θε6 .50(.15) .16
y
l73 1.01(.07) .96 θε7 .21(.15) .07
ly84 1.00a .93 θε8 .35(.08) .14
ly94 1.05(.05) .96 θε9 .19(.08) .08
ly10;5 1.00a 1.00a θε10 .00a .00a
lx11 1.00a .86 θδ1 .69(.20) .26
x
l21 1.02(.11) .91 θδ1 .43(.20) .18
Structural parameter estimates
Gamma (γ’s) Beta (β’s)
γ11 .43(.08) .40 β21 .36(.07) .41
γ21 .25(.07) .26 β31 .23(.08) .22
γ31 .30(.09) .26 β42 .46(.08) .43
β43 .26(.06) .29
β54 .93(.10) .58
Error in equations (ζ) and variances and covariances (ψii, ψij)
Unstandardized: ψ11 =1.94(.24), ψ22 =1.19(.18), ψ33 =2.15(.28), ψ44 =1.28(.16), ψ55 =3.52(.36), ψ32 =.50(.14)
Standardized: ψ11 =.84, ψ22 =.68, ψ33 =.84, ψ44 =.62, ψ55 =.66, ψ32 =.23
Explained variances: R2usf ¼ :16; R2att ¼ :32; R2SN ¼ :16; R2I ¼ :38; R2u ¼ :34
Goodness-of-fit
χ2(46)=100.35, p=.00
RMSEA=.07
NNFI=.97
CFI=.98
SRMR=.04
a
Fixed parameter; standard errors in parentheses

that it was not actually unidimensional but multidimen- Another relatively common occurrence is that some
sional. If multidimensionality is the reason for the poor unspecified paths between latent variables keep the model
model fit and makes sense conceptually, then the model from fitting well. For latent variables that were specified as
might be re-run with two factors replacing the original unconnected in a test of the model, there may be reason to
single one. Examples of this have occurred in the literature believe that they are truly related. Alternatively, a variable
a number of times when researchers proposing a single has been omitted in the model that affects two or more
dimension for attitudes used semantic differential and unlinked latent endogenous variables. When the above two
Likert items such that some were largely evaluative in cases happen, model fit may be improved significantly by
content and others affective (cf. Bagozzi et al. 2001). Still either adding a path(s) or allowing corresponding error
another possibility is that all four indicators are sound from terms of the latent variables in question to be correlated.
a specification or theoretical point of view, but the two How is one to detect the above problems leading to poor
offending indicators share a bias not common to the two model fit and/or specification errors? Beyond relying on
satisfactory indicators. If a basis exists for hypothesizing such intuition, logic, and trial and error, one can examine what
a bias, then the residuals for the two offending indicators can are termed in some programs Lagrange Multipliers or
be allowed to be correlated, and the model re-run. modification indexes. These indexes suggest whether add-
28 J. of the Acad. Mark. Sci. (2012) 40:8–34

ing a path to a model or allowing error terms to be hypothesis that the estimated or implied variance-covariance
correlated would improve a model. Programs indicate for matrix of indicators reproduces the observed or sample
each fixed or constrained parameter in a model how far a χ2 variance-covariance matrix. For SEMs, a good fit is obtained
value will drop if the constraint is relaxed and the model re- when the χ2 statistic is nonsignificant, which by convention
run. The largest index in the output provides this informa- is taken to happen for p-values≥.05. As Barrett (2007) notes,
tion and may be feasible to rely upon when the χ2 value the χ2 test is “the most direct and obvious test of model fit”
decreases by 3.84 or more. Again, one should always be (p. 823) and indeed “is the ONLY statistical test for SEM
wary of capitalizing on chance, a theoretical or methodo- models at the moment” (p. 818).
logical rationale should be offered (an example of the Because the χ2 is sensitive to sample size, it becomes
latter might be common method bias for certain difficult to achieve satisfactory models fits as the sample
indicators pointing to specification of correlated errors), size increases. As a result, researchers have proposed a
and cross-validation and/or new data collection should number of indexes of practical fit. Before we describe
ideally be performed. these, it might be mentioned that some researchers are
To illustrate some of these ideas, we examined the rather dismissive of the use of such indexes. Barrett (2007),
modification indexes for the output summarized in Table 11. for example, is quite critical of the use of practical indexes
The largest modification index was 20.64 for θ62 (i.e., the and feels that their uses in “social science, business,
correlation between the error term for the second usefulness marketing, and management journals are littered with … a
indicator and the first indicator of subjective norm). Freeing plethora of forgettable, non-replicable, and largely ‘of no
up this parameter would thus improve model fit practical measureable consequence’ models” (p. 823). It is
considerably in terms of the χ2 statistic. In fact, the beyond the scope of this article to explore Barrett’s
model fit is now χ2(45)=78.00, p=.00, RMSEA=.056, thoughtful comments. For a nice treatment of many of the
NNFI=.98, CFI=.99, and SRMR=.037, with θ62 =−.22 issues of concern, see Marsh et al. (2004). Our presentation
(s.e.=.05). However, we can think of no plausible rationale here will be limited to “best practices” as currently followed
for doing this, and the negative covariance does not make in the literature for evaluating model fit.
sense, so we conclude that this is a spurious capitalization on At this time, the generally recognized and recommended
chance, and therefore we do not modify the model accord- practical fit indexes are the RMSEA (root mean square
ingly. Probably, as the overall model fit satisfactory without error of approximation) which gives the average amount of
estimating these correlated residuals, to relax the original misfit for a model per degree of freedom, the NNFI
restriction would constitute an instance of overfitting. Gener- (nonnormed fit index; also termed the TLI, Tucker and
ally, when one achieves a good fitting model and yet chooses Lewis index) which rewards for model parsimony/penalizes
to relax a restriction, such a practice might be suspect and for model complexity, the CFI (comparative fit index)
capitalize on chance. which is an indicator of relative noncentrality between a
One last point: for causal models of the sort shown in hypothesized model and the null model of modified
Fig. 1, which is intended to represent a researcher’s independence (i.e., a model where only error variances are
hypotheses under TAM, it is incumbent on the researcher estimated), and the SRMR (standardized root mean square
to demonstrate and test for full mediation (e.g., whether residual) which is the square root of the average squared
intention channels all the effects of attitude on usage or residuals. Hu and Bentler (1998, 1999) recommend the
whether a direct path also exists from attitude to usage). following standards for assessing models: RMSEA≤0.06,
We did this for the model and data at hand and NNFI≥0.95, CFI≥0.95, and SRMR≤.08. The rationale for
confirmed that direct paths not hypothesized were relying on these four indexes seems to be the absence of a
indeed nonsignificant. For an example of how this is single standard for evaluating model fit (except perhaps for
done and how findings are presented in this regard, see the χ2 statistic), where one desires (1) an index confined to
Bergami and Bagozzi (2000, pp. 570–572) and Bagozzi a precise range such as 0.00 to 1.00 inclusive (the NNFI
and Dholakia (2006, pp. 55–56). can exceed 1.00 for very good model fits, which happens
rarely and then tends to go above 1.00 by small amounts; the
CFI is arbitrarily restricted to a maximum of 1.00, but without
Miscellaneous topics this restriction also could exceed 1.00 by a small amount), (2)
distinct cut-off values (but some disagreement exists on what
Goodness-of-fit these should be; see below), (3) provision for penalizing
model complexity/rewarding model parsimony (the CFI tends
Many indexes of goodness-of-fit exist to appraise an entire to fit more complex models better than parsimonious ones; the
model, but the χ2 statistic is in some ways the most NNFI and RMSEA tend to reward for parsimony/penalize for
fundamental. The χ2 statistic can be used to test the null complexity, but can disagree between themselves at times),
J. of the Acad. Mark. Sci. (2012) 40:8–34 29

and (4) indexes independent of sample size (the CFI and that “SEM analyses based upon samples less than 200 should
RMSEA are relatively independent of sample size; the SRMR simply be rejected outright for publication unless the
and NNFI are not). Although no single index meets all the population from which a sample is hypothesized to be drawn
above criteria, the set of four presented above collectively is itself small or restricted in size” (Barrett 2007, p. 820).
provides satisfactory criteria for overall model evaluation. Hu Another researcher believes that we should “shoot for a
and Bentler (1999) seem to suggest that it may be sufficient sample size of at least 50” and that such rules of thumb that
to rely on two of the four indexes (RMSEA and CFI). But of sample sizes should be greater than 200 are “surely
course, a more stringent standard would be to satisfy all four simplistic” (Iacobucci 2010, p. 92). Pressed, we would have
indexes (e.g., Fan and Sivo 2005). to say that rarely (e.g., in a factor analysis of a small number
A common outcome in everyday research is that the χ2 of items with “well-behaved data”) would a sample size
test is significant, and one or more of the four indexes is below 100 or so be meaningful, and that one should
unacceptable under the Hu and Bentler recommendations endeavor to achieve a sample size above 100, preferably
presented above. What should one do in such circumstances? above 200. However, exclusive focus on sample size may
As mentioned above, the χ2 test can be very sensitive to miss the point. Other issues are often more important under
sample size (see discussion below on sample size). If one certain circumstances. The critical issue is the distributional
suspects that a “large” sample size is the cause of a properties of measures in SEMs, because the frequently used
significant χ2 test, it may be satisfactory to scrutinize the maximum likelihood (ML) estimation procedure requires
four practical fit indexes. But some ambiguity exists as to multivariate normality. Kurtosis is especially a problem in
what acceptable cut-offs should be. The Hu and Bentler this regard, but the ML estimation technique has been shown
(1999) recommendations are a good starting point, but to be robust to departures from normality. So even with a
we offer the following caveats. Because one or a few relatively small sample size, the ML procedure may be
“high” residuals might be obscured under the average satisfactory, if the distributional properties of measures are
provided by the SRMR, we recommend that a more satisfactory or are not too far out of range.
conservative cut-off of ≤.07 be applied. On the other A frequently promoted rule of thumb concerns the
hand, the criteria suggested for the RMSEA, NNFI, and minimum recommended ratio of sample size to number of
CFI may be too stringent. It is difficult to be definitive parameters to be estimated in an SEM. Bentler and Chou
in this regard, but, given SRMR≤.07, a model might be (1987, p. 91) state that the ratio “may be able to go as low
satisfactory with RMSEA≤.07, NNFI≥.92, and CFI≥.93 as 5:1 under normal and elliptical theory, especially when
(cf. Marsh et al. 2004). But more research is needed to there are many indicators of latent variables and the
establish this. Note also that one might discount the NNFI associated factor loadings are large,” but they also believe
and RMSEA when models are by design or necessity that “a ratio of at least 10:1 may be more appropriate for
complex, such as happens with construct validity inves- arbitrary distributions.” This conservative advice is well
tigations with method factors. Last, when data are not taken, but we have found in practice that satisfactory
normally distributed, and the χ2 statistic and standard models have been obtained with ratios near 3:1, even close
errors of parameter estimates are detrimentally affected, it to 2:1 on occasion. Again, the distributional properties of
may be fruitful to apply the Satorra-Bentler scaled χ2 measures are important, not sample size or ratios of sample
(Satorra and Bentler 1994), which is now provided size to free parameters, per se.
routinely in many SEM programs. A final implication of sample size is the following.
Two other goodness-of-fit indexes we wish to Because the χ2-test is proportional to sample size, there is a
mention are analogous to explained variance in regres- danger that too small a sample size may lead to a decision
sion models. The GFI (goodness-of-fit index) and AGFI to accept an invalid model; similarly, there is a risk that a
(adjusted goodness-of-fit index) range from 0.00 to 1.00 large sample size may prompt one to reject a model that, in
inclusive. No commonly accepted cut-off criteria have been fact, is a reasonable one. But there is nothing mysterious
proposed for the GFI and AGFI, both indexes are dependent about these outcomes. Larger sample sizes also make it
on sample size, and simulations show that both do not perform more likely that smaller parameter estimates will become
as well as the RMSEA, NNFI, CFI, and SRMR. Hence, the significant and might lead one to accept a path that is
received view at this time is to rely on the χ2 statistic and the trivially relevant.
RMSEA, NNFI, CFI, and SRMR.
Multiple group analyses
Sample size issues
A useful and powerful aspect of SEMs is the test of
Some disagreement exists with respect to recommended hypotheses across samples. For example, one might desire
sample sizes for SEMs. One researcher goes so far as to assert to test whether men and women differ in their attitudes, or
30 J. of the Acad. Mark. Sci. (2012) 40:8–34

whether job satisfaction in one organization is higher than To test whether a path from one latent variable to another
job satisfaction in another organization. Another reason to differs across groups, it is necessary first to show that at
compare samples is to test whether the effect of one latent least one factor loading per factor (preferably all) is
variable on another depends on group membership or some invariant for the factors under study. Why? If the factor
measureable differences between people, institutions, loadings were found to differ, then any difference discov-
countries, etc. This is an example of an interaction effect, ered for the path coefficient could not be interpreted
which will be discussed in the next subsection of the article. unambiguously as a true difference. In such a case, the
A common objective might be to ascertain the following difference could be due to differential reliability of
for sets of indicators of multiple factors: measures of the factors, a true difference in the path
coefficient, or both differential reliability and a true
1. Does the same factor pattern or structure hold for two
difference. So we need to establish factor loading invari-
or more groups?
ance before testing for and interpreting differences in path
2. Given similar structures, are the factor loadings the
coefficients.
same or invariant across the groups?
Finally, it is possible to test for the means of factors
3. Are the measures of a factor(s) equally reliable across
across groups (sometimes termed structured means). To test
groups?
for the equality of factor means, given that the equalities of
4. Do the factors covary to the same extent across groups?
factor loadings and intercepts are first established, one can
5. Are the means of factors different across groups?
examine whether the factor means of one group differ from
One could also test whether error variances are the same another by fixing the means of factors in one group to zero
across groups, but this often does not hold or is not of much as a baseline and estimating the factor means in the other
interest. groups. Estimates of mean differences and their standard
To test whether factor patterns are identical, one specifies errors are used for this purpose.
the same model in each group under scrutiny, and with In sum, multiple group analyses open up a number of
input command files stacked after each other, determines interesting hypotheses for study. They can be used to test
whether the model fits well for all groups. for equality of factor patterns, loadings, factor variances,
To test whether all factor loadings are the same across factor covariances, and error variances across groups. They
groups, one constrains factor loadings to be equal across are useful to verify whether reliabilities of measures of a
groups. The resulting χ2 test is compared to the χ2 for factor are similar for different groups. They can be used to
invariant factor patterns. A nonsignificant χ2 difference (# 2d ) test for the equality of path coefficients across groups. They
with degrees of freedom equal to the difference between are helpful in testing whether the means of factors differ
degrees of freedom from the two models suggests that all across groups.
factor loadings are invariant. Equality of factor loadings for
multiple groups supports the hypothesis that the correspon- Moderating variables
dence between factors and their indicators is the same for
the groups. Sometimes the test of invariance of factor Most of the models discussed in the SEM literature deal
loadings is rejected, and one or more factor loadings differ with linear models. But forms of interactions can be
across groups. When this happens, it is necessary to test accommodated in SEMs, albeit with some difficulties and
factor loading invariance for one loading at a time, until the strong assumptions. One model allowing for nonlinearities
loading(s) leading to the rejection of the equality of all in the form of construct (trait) by method interactions, for
loadings is (are) found. tests of construct validity, is termed the direct product model
Equal reliability for a set of indicators of a factor can (e.g., Bechger and Maris 2004; Wothke and Browne 1990).
also be ascertained. This is demonstrated when both the For examples, see Bagozzi and Yi (1990, 1992) and Bagozzi
factor loadings and the variance of the factor are shown to et al. (1999).
be invariant across groups. Multiple group analyses can be used to investigate
To test whether a covariance between two factors is interactions between a latent variable and another variable,
the same across groups, one must first show that the as noted earlier. A comparison of regression parameters
factor pattern and at least one factor loading per factor across men and women might support an interaction
(ideally, all factor loadings) are invariant. Given invariance between gender and the independent variable, if the path
in this sense, one then constrains the covariance in question to coefficient is significantly different across gender. The
be equal across groups and compares the resulting χ2. A multiple group approach to tests of moderation has the
significant # 2d test with one degree of freedom provides disadvantage of not taking into account all information
evidence that the covariance is statistically different in available in the data. If no significant interaction is found,
magnitude across the groups. then we cannot rule out the possibility that this was due to a
J. of the Acad. Mark. Sci. (2012) 40:8–34 31

failure to take into account enough information. On the It may be that two phenomena cause each other but do
other hand, when significant differences are found for so over time. Figure 8b shows a cross-lagged panel model
regression parameters across groups, one might conclude where one event or state of affairs, ξ1 and η1, measured at
that there is some merit for claiming an interaction effect. two points in time, is thought to cause or be caused by
It is possible to more formally test for interaction effects another event or state of affairs, ξ2 and η2, also measured at
with SEMs. For technical discussions, see Jöreskog and two points in time. That is, the cross-lagged panel model
Yang (1996), Marsh et al. (2004), and Mooyaart and Bentler can be used to ascertain whether ξ1 causes η2, ξ2 causes η1,
(2010). The only substantive example that we are aware of both ξ1 causes η2 and ξ2 causes η1, or no causality is
in the literature demonstrating an interaction effect between supported. The paths from ξ1 to η1 and from ξ2 to η2 are
two latent variables is in Bagozzi et al. (2004). Jöreskog included to capture the stability of the two phenomena over
and Yang (1996) caution that a strong theory is needed for times 1 and 2. The cross-lagged panel model can be
specifying an interaction, very large samples are needed to unstable and produce counter-intuitive parameter estimates,
test hypotheses by use of the recommended weighted least analogous to that found when multicollinearity is a
squares procedures (e.g., based on the augmented moment problem, and especially so when little change in phenom-
matrix), and quite complex constraints must be posed when ena happens over the period of time under study. Be that as
specifying models. If one does not have a sufficiently large it may, the cross-lagged panel model usually provides more
sample, then the use of maximum likelihood methods will valid bases for supporting causality than that afforded by
not give correct χ2-tests and standard errors. cross-sectional studies.
Finally, SEMs can be used in experimental designs to More generally, the strategic use of time in longitudinal
test for interactions, analogous to ANOVA and MANOVA. studies can reduce some threats to validity inherent in
Somewhat larger cell sizes are needed to do this than with cross-sectional studies. The more points in time one
traditional designs, but when feasible such a practice offers specifies separating variables proposed to be causally
the advantages of taking into account measurement error, linked, the greater the confidence that regression parameters
incorporating manipulation checks formally into the anal- will support causation. Also, use of multiple methods
yses if desired, and testing for mediation or step-down and modeling of method bias, in conjunction with
effects. Some discussions for the use of SEMs in longitudinal designs, can increase the evidence for
experimental designs can be found in Bagozzi (1977), causality. Figure 8c presents a simple longitudinal model
Kühnel (1988), Bagozzi and Yi (1989), Bagozzi et al. for testing the theory of planned behavior, where measure-
(1991b), Russell et al. (1998), and Curran and Hussong ments are taken at three points in time. This model could
(2002). be improved further by obtaining objective measures of
behavior, as well as by employing multiple methods to
Non-recursive and longitudinal models obtain data.
Two topics related to longitudinal models are latent
A recursive model is one where no feedback or reciprocal growth curve models and multilevel models. These topics
causation occurs between latent variables. All the models are beyond the scope of this article. The reader is referred to
we have considered up to this point are recursive models. Duncan et al. (2006) and Heck and Thomas (2009).
Reciprocal causation (simultaneity) is modeled occa-
sionally in the literature. Figure 8a presents the simplest Data screening issues
identified model for reciprocal causation; one can add more
exogenous or endogenous variables to this model and still The last topic we consider deals with practices that actually
identify the paths between η1 and η2. A problem with should be performed before estimating parameters on data
reciprocal causation models is that it is difficult to justify and testing hypotheses. A basic step is to make sure no
mutual causality between η1 and η2 when these are coding or recording errors have been made. Another check
measured at the same point in time. At least to the extent is inspection for outliers. Bollen (1989, pp. 24–32) provides
that temporal priority of cause to effect is a defining quality procedures for detecting outliers, but of course such a
of causation, such a model may not be defensible, although procedure should be used with caution, and exclusion of
it can be estimated and tested with SEM procedures. any outliers should be guided by sound scientific standards.
Causality is regarded fundamentally to be a recursive See also Yuan and Bentler (1998, 2007).
phenomenon under most conceptualizations in philosophy. An important issue, as indicated earlier, is the distribu-
Some researchers nevertheless defend reciprocal causality tion of measures and whether the assumption of normality
under the assumption that the processes implied are in some is warranted. Many SEM programs provide convenient
sense in equilibrium. However, the assumption of equilib- ways for examining univariate and multivariate normality.
rium is difficult to specify in practice. Kurtosis may especially be a concern; kurtosis for normal-
32 J. of the Acad. Mark. Sci. (2012) 40:8–34

Fig. 8 Nonrecursive and longi-


tudinal models
A. Reciprocal causation (simultaneity)

η η1

η2

B. Cross-lagged panel model


Time 1 Time 2

ξ1 η1 δ1

ξ2 η2 δ2

C. Longitudinal model
Time 1 Time 2 Time 3

Attitude

Subjective Intention Behavior


norm
Perceived
behavioral
control

ity is three. Skewness for normality should be zero. A missing partially due to relationships with other variables
number of transformations can be used to improve the in the data or due to method or other bias, should be used
distribution of measures—measures based on proportions with caution. Special care should be taken in questionnaire
should be transformed by the arcsine transformation, the design, data collection, sampling, motivation of respondents,
log or square root transformation can be used to correct and other factors. Sometimes one has no other choice but to
positively skewed data, an inverse transformation can be eliminate suspect missing data.
employed for negatively skewed data.
Missing data also pose problems for model estimation
and hypothesis testing. Savalei and Bentler (2006, p. 355) Conclusion
suggest that missing data below about 5% might be
ignored by use of listwise deletion when computing We have attempted to provide a thorough introduction to
correlation or covariance matrixes. For missing data to a SEMs for the user of such techniques in substantive
greater extent, many remedies have been suggested. But research and the general reader of substantive research
such old standbys as mean substitution, regression employing SEMs. There are many topics that we could
estimation, pairwise computations, and case-wise dele- not deal with in depth and were forced to mention just
tion can lead to biased χ2 statistics, parameter estimates, briefly. In addition, there are many complex statistical
and standard errors (Bentler 2010, p. 218). Two improved issues with SEMs that are beyond the scope of this
procedures for handling missing data are case-wise maximum article. But we hope that the introduction herein will
likelihood (Jamshidian and Bentler 1999) and an approach provide the reader with a sound foundation and motivation for
based on missing at random (Savalei and Bentler 2009). further study, and in its course empower readers to make
However, any procedure that is unable to take into account important discoveries in, and contributions to, whatever field
unknown systematic reasons for missing data, whether within which they work.
J. of the Acad. Mark. Sci. (2012) 40:8–34 33

References Bagozzi, R. P., Yi, Y., & Phillips, L. (1991a). Assessing construct validity
in organizational research. Administrative Science Quarterly, 36,
421–458.
Arbuckle, J. L. (2009). AMOS 18 user’s guide. Crawfordville: Amos Bagozzi, R. P., Yi, Y., & Singh, S. (1991b). On the use of structural
Development Corporation. equation models in experimental designs: two extensions.
Audi, R. (Ed.). (1995). The Cambridge dictionary of philosophy. International Journal of Research in Marketing, 8, 125–140.
Cambridge: Cambridge University Press. Bagozzi, R. P., Yi, Y., & Nassen, K. (1999). Representation of
Bagozzi, R. P. (1977). Structural equation models in experimental measurement error in marketing variables: review of approaches
research. Journal of Marketing Research, 14, 209–226. and extension to three facet designs. Journal of Econometrics,
Bagozzi, R. P. (1980). Causal models in marketing. New York: Wiley. 89, 393–421.
Bagozzi, R. P. (1984). A prospectus for theory construction in Bagozzi, R. P., Lee, K., & Van Loo, M. F. (2001). Decisions to donate
marketing. Journal of Marketing, 48, 11–29. bone marrow: the role of attitudes and subjective norms across
Bagozzi, R. P. (1994). The effects of arousal on the organization of cultures. Psychology and Health, 16, 29–56.
positive and negative affect and cognitions: application to attitude Bagozzi, R. P., Moore, D. J., & Leone, L. (2004). Self-control and the
theory. Structural Equation Modeling, 1, 222–252. self-regulation of dieting decisions: the role of prefactual
Bagozzi, R. P. (1996). The role of arousal in the creation and control attitudes, subjective norms, and resistance to temptation. Basic
of the halo effect in attitude models. Psychology and Marketing, and Applied Social Psychology, 26, 199–213.
13, 235–264. Bagozzi, R. P., Bergami, M., Marzocchi, G. L., & Morandin, G.
Bagozzi, R. P. (2007). On the meaning of formative measurement and (2011). Customer-organization relationships: development and
how it differs from reflective measurement: comment on test of a theory of extended identities. Journal of Applied
Howell, Breivik, and Wilcox. Psychological Methods, 12, Psychology, in press.
229–237. Barrett, P. (2007). Structural equation modeling: adjusting model fit.
Bagozzi, R. P. (2010). Structural equation models are modeling tools Personality and Individual Difference, 42, 815–824.
with many ambiguities: comments acknowledging the need for Bechger, T. M., & Maris, G. (2004). Structural equation modeling of
caution and humility in their use. Journal of Consumer multiple facet data: extending models for multitrait-multimethod
Psychology, 20, 208–214. data. Psicologica, 25, 253–274.
Bagozzi, R. P. (2011a). Measurement and meaning in information Bentler, P. M. (2008). EQS 6: Structural equations program manual.
systems and organizational research: methodological and philo- Encino: Multivariate Software.
sophical foundations. MIS Quarterly, 35, 261–292. Bentler, P. M. (2010). SEM with simplicity and accuracy. Journal of
Bagozzi, R. P. (2011b). Alternative perspectives in philosophy of mind Consumer Psychology, 20, 215–220.
and their relationship to structural equation models in psychol- Bentler, P. M., & Chou, C. P. (1987). Practical issues in structural
ogy. Psychological Inquiry, 22, 88–99. equation modeling. Sociological Methods & Research, 16, 78–
Bagozzi, R. P., & Baumgartner, H. (1994). The evaluation of structural 117.
equation models and hypothesis testing. In R. P. Bagozzi (Ed.), Bentler, P. M., & Weeks, D. G. (1980). Linear structural equations with
Basic principles of marketing research (pp. 386–422). Oxford: latent variables. Psychometrika, 45, 289–308.
Blackwell. Bergami, M., & Bagozzi, R. P. (2000). Self-categorization and
Bagozzi, R. P., & Dholakia, U. M. (2006). Antecedents and purchase commitment as distinct aspects of social identity in the
consequences of customer participation in small group brand organization: conceptualization, measurement, and relation to
communities. International Journal of Research in Marketing, antecedents and consequences. British Journal of Social
23, 45–61. Psychology, 39, 555–557.
Bagozzi, R. P., & Edwards, J. R. (1998). A general approach for Blackburn, S. (1994). The Oxford dictionary of philosophy. Oxford:
representing constructs in organizational research. Organizational Oxford University Press.
Research Methods, 1, 45–87. Bollen, K. A. (1989). Structural equations with latent variables. New
Bagozzi, R. P., & Heatherton, T. F. (1994). A general approach to York: Wiley.
representing multifaceted personality constructs: application to Curran, P. J., & Hussong, A. M. (2002). Structural equation modeling
state self-esteem. Structural Equation Modeling, 1, 35–67. of repeated measures data. In D. Moskowitz & S. Hershberger
Bagozzi, R. P., & Phillips, L. (1982). Representing and testing (Eds.), Modeling intraindividual variability with repeated meas-
organizational theories: a holistic construal. Administrative ures data: Methods and applications (pp. 59–86). New York:
Science Quarterly, 27, 459–489. Lawrence Erlbaum.
Bagozzi, R. P., & Yi, Y. (1988). On the evaluation of structural Davis, F. D. (1989). Perceived usefulness, perceived ease of use, and
equation models. Journal of the Academy of Marketing Science, user acceptance of information technology. MIS Quarterly, 13,
16, 74–94. 319–340.
Bagozzi, R. P., & Yi, Y. (1989). On the use of structural equation Davis, F. D., Bagozzi, R. P., & Warshaw, P. R. (1989). User acceptance
models in experimental designs. Journal of Marketing Research, of computer technology: a comparison of two theoretical models.
26, 271–284. Management Science, 35, 982–1003.
Bagozzi, R. P., & Yi, Y. (1990). Assessing method variance in Diamantopoulos, A., & Siguaw, J. A. (2006). Formative versus
multitrait-multimethod matrices: the case of self-reported affect reflective indicators in organizational measure development:
and perceptions at work. Journal of Applied Psychology, 75, a comparison and empirical illustration. British Journal of
547–560. Management, 17, 263–282.
Bagozzi, R. P., & Yi, Y. (1992). Testing hypotheses about methods, Diamantopoulos, A., & Winklhofer, H. M. (2001). Index construction
traits, and communalities in the direct product model. Applied with formative indicators: an alternative to scale development.
Psychological Measurement, 16, 373–380. Journal of Marketing Research, 38, 269–277.
Bagozzi, R. P., Fornell, C., & Larcker, D. (1981). Canonical Duncan, T. E., Duncan, S., & Strycker, L. A. (2006). An introduction
correlation analysis as a special case of a structural relations to latent variable growth curve modeling. Mahwah: Lawrence
model. Multivariate Behavioral Research, 16, 437–454. Erlbaum.
34 J. of the Acad. Mark. Sci. (2012) 40:8–34

Edwards, J. R., & Bagozzi, R. P. (2000). On the nature and direction of Markland, D. (2007). The golden rule is that there are no golden rules:
relationships between constructs and measures. Psychological a commentary on Paul Barrett’s recommendations for reporting
Methods, 5, 155–174. model fit in structural equation modeling. Personality and
Fan, X., & Sivo, S. A. (2005). Sensitivity of fit indices to misspecified Individual Difference, 42, 851–858.
structural or measurement model components: rationale of two- Marsh, H. W., Hau, K., & Wen, J. (2004). In search of golden rules:
index strategy revisited. Structural Equation Modeling, 12, 343– comment on hypothesis-testing approaches to setting cutoff
367. values for fit indexes and dangers in overgeneralizing Hu and
Gefen, D., Straub, D. W., & Boudreau, M. (2000). Structural equation Bentler’s (1999) findings. Structural Equation Modeling, 11,
modeling and regression: guidelines for research practice. 320–341.
Communications of the Association for Information Systems, 4, Marsh, H. W., Wen, J., & Hau, K. (2004). Structural equation models
1–78. of latent interactions: evaluation of alternative estimation strate-
Heck, R. H., & Thomas, S. L. (2009). An introduction to multilevel gies and indicator construction. Psychological Methods, 9, 275–
modeling techniques. New York: Routledge. 300.
Howell, R. D., Breivik, E., & Wilcox, J. B. (2007a). Reconsidering McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah:
formative measurement. Psychological Methods, 12, 205–218. Lawrence Erlbaum.
Howell, R. D., Breivik, E., & Wilcox, J. B. (2007b). Is formative McDonald, R. P., & Ho, M.-H. R. (2002). Principles and practices in
measurement really measurement: reply to Bollen (2007) and reporting nonlinear latent variables models. Psychological Methods,
Bagozzi (2007). Psychological Methods, 12, 238–245. 7, 64–82.
Hu, L., & Bentler, P. M. (1998). Fit indexes in covariance structure Mooyaart, A., & Bentler, P. M. (2010). An alternative approach for
modeling: sensitivity to underparameterized model misspecification. nonlinear latent variable models. Structural Equation Modeling,
Psychological Methods, 3, 424–453. 17, 357–373.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in Muthén, L. K., & Muthén, B. O. (2010). Mplus: Statistical analysis
covariance structure analysis: conventional criteria versus new with latent variables user’s guide. Los Angeles: Muthén and
alternatives. Structural Equation Modeling, 6, 1–55. Muthén.
Iacobucci, D. (2009). Everything you always wanted to know about Petter, S., Straub, D., & Rai, A. (2007). Specifying formative
SEM (structural equation modeling) but were afraid to ask. constructs in information systems research. MIS Quarterly, 31,
Journal of Consumer Psychology, 19, 673–680. 623–656.
Iacobucci, D. (2010). Structural equation modeling: fit indices, sample Podsakoff, P., Mackenzie, S. B., Podsakoff, N. P., & Lee, J. (2003).
size, and advanced topics. Journal of Consumer Psychology, 20, The mismeasure of man(agement) and its implications for
90–98. leadership research. The Leadership Quarterly, 14, 615–656.
Jamshidian, M., & Bentler, P. M. (1999). ML estimation of mean and Russell, D. W., Kahn, J. H., Spoth, R., & Altmaier, E. M. (1998).
covariance structures with missing data using complete data Analyzing data from experimental studies: illustration of a latent
routines. Journal of Educational and Behavioral Statistics, 24, variable structural equation modeling approach. Journal of
21–41. Counseling Psychology, 45, 18–29.
Jarvis, C. B., MacKenzie, S. B., & Podsakoff, P. M. (2003). A critical Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and
review of construct indicators and measurement model misspe- standard errors in covariance structure analysis. In A. von Eye & C.
cification in marketing and consumer research. Journal of C. Clogg (Eds.), Latent variable analysis: Applications to
Consumer Research, 30, 199–219. development research (pp. 399–419). Newbury Park: Sage.
Jöreskog, K. G., & Goldberger, A. S. (1975). Estimation of a model Savalei, V., & Bentler, P. M. (2006). Structural equation modeling. In
with multiple indicators and multiple causes of a single latent R. Grover & M. Vriens (Eds.), The handbook of marketing
variable. Journal of the American Statistical Association, 70, research: Uses, misuses, and future advances (pp. 330–364).
631–639. Newbury Park: Sage.
Jöreskog, K. G., & Sörbom, D. (1996a). LISREL 8: User’s reference Savalei, V., & Bentler, P. M. (2009). A two-stage approach to missing
guide. Chicago: Scientific Software International. data: theory and application to auxiliary variables. Structural
Jöreskog, K. G., & Sörbom, D. (1996b). LISREL 8: Structural Equation Modeling, 16, 477–497.
equation modeling with the SIMPLIS command language. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental
Chicago: Scientific Software International. and quasi-experimental designs for generalized causal inference.
Jöreskog, K. G., & Yang, F. (1996). Nonlinear structural equation Boston: Houghton Mifflin.
models: The Kenney-Judd model with interaction effects. In G. Von Wright, G. H. (1974). Causality and determinism. New York:
A. Marcoulides & R. E. Schumacker (Eds.), Advanced structural Columbia University Press.
equation modeling: Issues and techniques (pp. 57–88). Mahwah: Wothke, W., & Browne, M. W. (1990). The direct product model for
Lawrence Erlbaum. the MTMM matrix parameterized as a second order factor
Kühnel, S. M. (1988). Testing Manova designs with LISREL. analysis model. Psychometrika, 55, 255–262.
Sociological Methods & Research, 16, 504–523. Yuan, K., & Bentler, P. M. (1998). Robust means and covariance
Little, T. D., Cunningham, W. A., Shahar, G., & Widaman, K. F. structure analysis. British Journal of Mathematical and Statistical
(2002). To parcel or not to parcel: exploring the question, Psychology, 51, 63–88.
weighing the merits. Structural Equation Modeling, 9, 151–173. Yuan, K., & Bentler, P. M. (2007). Robust procedures in structural
Lord, F., & Novick, M. R. (1968). Statistical theories of mental test equation models. In S. Lee (Ed.), Handbook of latent variable
scores. Reading: Addison-Wesley. and related models (pp. 367–397). Amsterdam: North-Holland.

You might also like