You are on page 1of 35

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/232525744

The importance of being valid: Reliability and the process of construct


validation

Chapter · January 2007

CITATIONS READS

92 498

2 authors, including:

Christopher J Soto
Colby College
31 PUBLICATIONS   1,645 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Personality and Political Attitudes View project

Personality Measurement View project

All content following this page was uploaded by Christopher J Soto on 30 July 2014.

The user has requested enhancement of the downloaded file.


CHAPTER 27

The Importance of Being Valid


Reliability and the Process of Construct Validation

Oliver P. John
Christopher J. Soto

The truth is rarely pure and never simple. Modern life would be very
tedious if it were either.
-OSCAR WILDE, The Importance of Being Eamest

empirical sciences all design measure- measure its volume exactly (which would in-
procedures to obtain accurate informa- deed put the fetus at risk). Instead, medical
about objects, individuals, or groups. researchers have developed an ingenious
astronomers measure the behavior of multistep procedure, using two-dimensional
and stars, when molecular biologists ultrasound images to estimate the three-
the expression of hormones in cells, when dimensional volume of fluid present in the
;;tnedi,:aI doctors assess how much amniotic uterus. The test is administered by a technician,
is left late in a pregnancy, they all worry who produces a single number, which doctors
· •." 'netn''f their measurement procedures provide then rely on to make critical treatment deci-
kind of information they are looking for- sions. However, as doctors well know (but
is, is the information generally applicable most patients do not), these numbers are
does it capture the phenomenon they are hardly perfect. Two different technicians as-
1)1tereste:d in? Across all these disciplines, reIi- sessing the same woman do not always come
and validity are fundamental concepts up with the same number; the same technician
researchers evaluate how well their assessing the same group of women on 2 subse-
meaSllres work. quent days may come up with different results,
Consider the amniotic fluid test doctors use and in some cases a low amniotic fluid score is
determine whether a fetus may be at risk late obtained even though other measures confirm
a pregnancy. Exact measurement, as in some that the fetus is not at risk. In measurement
the physical sciences, is not possible here; the terms, doctors worry about the degree to which
can hardly pump out all the amniotic their tests show interjudge agreement, retest re-
that is left in the mother's amniotic sack to liability, and predictive validity, respectively,

461
462 ANALYZING AND INTERPRETING PERSONALITY DATA

and expectant parents (and patients more gen- measurements (e.g., across different stirn
erally) should be equally worried about these experimenters, or observers) and then aggr
issues. As this example shows, having a keen gate them into a more generalizable compoS!
understanding of issues related to reliability score.
and validity may be important to your own
health or that of your loved ones, and this
chapter is intended to provide both historical Personality Measurement:
and contemporary ideas about these key mea- Formulating and Evaluating
surement issues. In addition, this chapter also Models in the
provides answers to questions commonly asked Psychometric Tradition
by students of personality psychology, such as t:':",·;
whether an alpha reliability of .50 is high
enough, why we should care whether a scale is What "is measurement and how may it beq:~"j2
unidimensional, and what one should do to fined? Early on, Stevens (1951) suggested th.a.l:
show that a measure is valid. measurement is the assignment of numbers otP
objects or events according to rules. More ie
cently, Dawes and Smith (1985) and oth~
Some Basic Considerations have argued that measurement is best und~
in Evalnating stood as the process of building models th.~
Measurement Procedures represent phenomena of interest, typically"iri
quantitative form. The raw data in persona}i
These questions all illustrate the fundamental research initially exist only in the form of mffi
concern of empirical science with generaliz- ute events that constitute the ongoing behavio,
ability, that is, the degree to which we can and experience of individuals. Judd an
make inferences from our measurements or ob- McCleHand (1998, pp. 3-4) suggest that
servations in regard to other samples, items, measurement is the process by which these iri#;
measures, methods, outcomes, and so on nitely varied observations are reduced to compa
(Cronbach, GIeser, Nanda, & Rajaratnam, descriptions or models that are presumed to rep#;
1972; see also Brennan, 2001). If we cannot ar
sent meaningful regularities in the entities that
make such generalizations, our measurements observed ... Accordingly, measurement consis
are obviously much less useful than if we can of rules that assign scale or variable values to en
provide explicit evidence for generalizability. ties to represent the constructs that are thought£;
Good measurement implies not only that we be theoretically meaningful. (emphasis added),):;;
can reproduce or replicate a certain score, but
that we can trust that the measurement has a Like most models, measurement models (e.g;,
particular meaning-we want to be able to tests or scales) have to be reductions or simplj"':
make inferences about other variables that in- fications to be useful. Although they shoul~
terest us. In the amniotic fluid test example, the represent the best possible approximation8f-
volumetric measurements would be useless if the phenomena of interest, we must expep'
they failed to help doctors predict which babies them, like all "working models," to be eventti
are at risk and should be delivered soon. An- aHy proven wrong and to be superseded~!i
other basic idea is that all measures-self- better models. For this reason, measuremel1t,.
reports, observer ratings, even physiological models must be specified explicitly so that th"Y
measures-are prone to errors and that we can- can be evaluated, disconfirmed, and improveg;
not simply assume that a single measurement Moreover, we should not ask whether a partis+
will generalize. Anyone measurement may be ular model is true or correct; instead,"":¢,
distorted by numerous sources of error (e.g., should build several plausible alternative mOel;
the medical technician may have made a hu- els and ask, Given everything we know, whiC-
man error, or the position of the fetus and the models can we rule out and which model is curS
umbilical cord may have been unusual, etc.), rently the best at representing our data? 0,&'
and the resulting observation (or score) is even more clearly, which model is the /eaf:&
therefore related only imperfectly to what we wrong? This kind of comparative model testillg
want to measure, namely, the risk to the baby. (e.g., Judd, McClelland, & Culhane, 1995)'is(
To counteract this limitation of single measure- the best strategy for evaluating and improvirt
ments, psychologists aim to obtain multiple our measurement procedures.
Reliability and Construct Validation 463

:nt stimuli, Organization of This Chapter traditionally been called ,.eliability. Reliability
:hen aggre- refers to the consistency of a measurement pro-
: composite This chapter is organized into three major cedure, and indices of reliability all describe the
parts. We begin with historically early concep- extent to which the scores produced by the
tions of reliability, then move on to increasingly measurement procedure are reproducible.
complex views that emphasize the construct
validation process, and finally consider moqel
Reliability in Classical Test Theory
g testing as an integrative approach. Specifically,
we first consider issues traditionally discussed Issues of reliability have traditionally been
under the heading of reliability, review still- treated within the framework of classical test
persistent "types" of reliability coeffi- theoty (Gulliksen, 1950; Lord & Novick,
cients, then suggest generalizability theory as a 1968). If a given measurement X is subject to
'y it be de- broader perspective, and finally discuss in some error e, then the measurement without the er-
~gested tha t detail the problems and misuses of coefficient ror, X - e, would represent the accurate or
numbers to alpha, the most commonly used psychometric "true" measurement T. This seemingly simply
s. More re- index in personality psychology. Second, we formulation, that each observed measurement
and others discuss five kinds of evidence that are com- X can be partitioned into a true score T and
best under- monly sought in the process of construct vali- measurement error e, is the fundamental as-
nodels that dation, which we view as the most crucial issue sumption of classical test theory. Conceptually,
typically in in psychological measurement. In the third each true score represents the mean of a very
personality part, we consider model testing in the context large number of measurements of a specific in-
)rm of min- of construct validation; following a brief intro- dividual, whereas measurement error repre-
19 behavior duction to measurement models in structural sents all of the momentary variations in the cir-
Judd and equation modeling (SEM), we discuss an em- cumstances of measurement that are unrelated
;t that pirical example that presents the issue of to the measurement procedure itself. Such er-
:h these infi~
dimensionality as an aspect of structural valid- rors are assumed to be random (a rather strong
d to compact ity. assumption, to which we return later), and it is
ned to repre- this assumption that permits the definition of
tities that are error in statistical terms.
nent consists From a Focus on Conceptions of reliability all involve the no-
alues to enti- "Reliability Coefficients" tion of repeated measurements, such as over
re thought to to Generalizability Theory time or across multiple items, observers, or rat-
sis added) ers. Classical test theory has relied heavily on
As our introductory examples illustrate, most the norian of parallel tests-that is, two tests
lOdels (e.g., measurement procedures in psychology and that have the same mean, variance, and distri-
LS or simpli- other empirical disciplines are subject to "er- butional characteristics and correlate equally
hey should ror." In personality psychology, the observa- with external variables (Lord & Novick,
cimation of tions, ratings, or judgments that constitute the 1968). Under these assumptions, true score and
lUst expect measurement procedure are typically made by measurement eIfor can be treated as indepen-
) be eventu- humans who are subject to a wide range of dent. It follows that the variance of the ob-
'erseded by frailties. Research participants may become served scores equals the sum of the variance of
easurement careless or inattentive, bored or fatigued, and the true scores and the variance of the measure-
iO that they may not always be motivated to do their best. ment error:
I improved. The particular conditions and point in time
Ler a partic- ~hen ratings are made or recorded may also
nstead, we contribute error. Further errors may be intro-
lative mod- duced by the rating or recording forms given to Reliability can then be defined as the ratio of
10W, which the raters; the instructions, definitions, and the true-score variance to the observed-score
lodel is cur- questions on these forms may be difficult to un- variance, which is equivalent to 1 minus the ra-
. data? Or, \clerstand or require complex discriminations, tio of error variance to observed-score vari-
s the least '<lgain entering error into the measurement. ance:
odel testing These various characteristics of the partici-
e, 1995) is pant, the testing situation, the test or instru-
improving em, and the experimenter can all introduce
easurement error and thus affect what has If there is no error, then the ratio of true-score
464 ANALYZING AND INTERPRETING PERSONALITY DATA

variance to total variance (and hence reliabil- correlations. This equation estimates the makiJ
ity) would be 1; if there is only error and no pected observed correlation (rXY)' given prohi
true-score variance, then this ratio (and hence true correlation between the constructs partie
reliability) would be o. sured by X aud Y (PXy) and the geometric range
age of the two measures' reliabilities (rxx large
Tn): corre
Costs of Low Reliability, and
quite
Correcting Observed Correlations for = pXy ~ r.\"X rn ·
r.xy reliar
Attenuation Due to Low Reliability
certai
As shown in Table 27.1, if X and Y reliar
Classical test theory (Lord & Novick, 1968)
have a high reliability of .90 (or an average unah
suggests that researchers ought to work hard to
ability of .90), then the losses are modest. .30 r:
attain high reliabilities because the reliability of
example, a true correlation of .70 (an He
a measure constrains how strongly that mea-
ally large effect size) would result in an .70c:
sure may correlate with another variable (e.g.,
served correlation of .63 and a true correlatit>~ show
an external criterion). If error is truly random,
of.30 (a common effect size) would still pIe si
as classical test theory assumes, the upper limit
in an observed correlation of .27. In short, easil)
of the correlation for a measure is not 1.0 but
loss due to unreliability would be quite still b
the square root of its reliability (i.e., the corre-
with observed correlation~ being 90% of els. 1
lation of the measure with itself). Thus, the true
true correlations-that is, only 10% lower. even
correlation between the measure and another
an average reliability of .70 (a common effecl
variable may be underestimated (Le., attenu-
tion in personality research), the losses a gra
ated) when reliability is inadequate. In other
be more pronounced: A true correlation of licab
words, low reliability comes at a cost.
would be rednced to .49, and a true correlatlon now
Students sometimes ask questions like 'IMy
of .30 to .21, with observed correlations tistic
scale has a reliability of .70-isn't that good
only 700/0 of the true correlations-a loss &M
enough?'" and are frustrated when the answer
30%. At an average reliability of .50, the Re
is, "That depends." Although it would be quite
would be drastic: A true correlation of to cc
convenient to have a simple cookbook for mea-
would become a mere .35, and a true cone'la; meas
surement decisions, there is no minimum or op-
tion of .30 would be reduced to .15-a The
timum reliability that is necessary, adequate, or
loss. Lord
even desirable in all conte:x1:s. Over the years a
It is easy to see that with small obsel
convention seems to have evolved, often cred-
and true effect sizes typically in the prod
ited to Nunnally (1978), that regards "reliabil-
range, discovering real effects with unlrellial>Ie
ities of .7 or higher" (p. 245) as sufficient.
measurements becomes increasingly diltiCllIt,
However, a reliability of .70 is not a bench-
mark every measure must pass. In the words of
Tr
Pedhazur and Schrnelkin (1991),
ciati(
TABLE 27.1. The Cost of Low, taina
Does a .5 reliability coefficient stink? To answer Medium, and High Reliability:
this question, no authoritative source will do. two
Observed Correlations as a Function
Rather, it is for the user to detemtine what time~
of True Conelations and Three
amount of error variance be or she is willing to Levels of Reliability twee
tolerate, given the specific circumstances of the sure~

study. (po 110, emphasis in original) True Mean reliability of X and Y indic
correlation .50 .70 .90 be if
We wondered, then, is there something useful relia
.70 .35 .49 .63
in the widely shared view of .70 as the l'sweet .60 .30 .54 beu1
.42
spot" of reliability? To find out, we examined .50 .25 .35 .45 fect
the relative costliness of various levels of reli- .40 .20 .28 .36 mea~

ability, as presented in Table 27.1. We derived .30 .15 .21 .27 anal:
the numbers in Table 27.1 by rewriting the for- .20 .10 .14 .18 Bog~
mula traditionally used to correct observed
.10 .05 .07 .09 catie
correlations for attenuation due to unreliability distil
Note. Table entries ate observed correlations estimated via
(Cohen, Cohen, West, & Aiken, 2003; Lord & equation for the correction of attenuation due to unreliabil- ity a
Novick, 1968) and solving it for the observed ity {see te.\..}. surel
Reliability and Construct Validation 465

making the costs of reliabilities in the .50 range However, the ease with which this correction
prohibitive. For example, with a sample of 100 is made should not be seen as a license for
participants and tcue correlations in the .30 sloppy measurement. In many situations, low
range, an average reliability of .70 is barely reliability will create problems for estimating
large enough to observe statistically significant effect sizes, testing hypotheses, and estimating
correlations. If we assume that this scenario is the parameters in structural models-problems
quite common in the field, then the benchmark that cannot be overcome by simply correcting
reliability of ".70 or above" makes some sense; for attenuation due to unreliability. This is es-
certainly one would not want to accept pecially true in multivariate applications, such
reliabilities lower than .70 if that means being as multitrait-multimethod matrices (discussed
unable to detect expected correlations in the below), where unequal reliabilities might bias
.30 range. conclusions about convergent and discriminant
However, the costs of reliabilities lower than validity (West & Finch, 1997). In general, then,
.70 can be at least partially offset. As Table 27.1 researchers are well-advised to invest the time
shows, if true-correlation sizes are large (or sam- and effon needed to construct reliable mea-
ple sizes are large), lower reliabilities are mOre sures and consult Table 27.1 to gauge the
easily tolerated because expected effects would amount of measurement error that they are
be detected at conventional significance lev- willing to tolerate, given the goals of their re-
els. Nonetheless, it must be emphasized that search.
even under these favorable conditions, the true
effect sizes will be severely underestimated-
Evidence for Reliability: Traditional
a grave disadvantage, given that obtaining rep-
Types of "Reliability Coefficients"
licable estimates of the size of a correlation is
now deemed much more important than its Sta- The three most common procedures to assess
,, tistical significance in anyone sample (see Fraley reliability are shown in Table 27.2: internal
f & Marks, Chapter 9, this volume). consistency (or split-half), retest (or stability),
Researchers sometimes use reliability indices and interrater agreement designs. The Ameri-
to correct observed correlations between two can Psychological Association (APA) commit-
measures for attenuation due to unreliability. tee on psychological tests articulated these
The correction formula (Cohen et aI., 2003; types of designs to clarify thar "reliability is a
Lord & Novick, 1968) involves dividing the generic term referring to many types of evi-
observed correlation by the square root of the dence" (American Psychological Association,
product of the two reliabilities: 1954, p. 28). Clearly, the different study
, designs in Table 27.2 assess rather different
p~y=~ sources of errOr. Internal consistency proce-
.JTXXTyy dures offer an estimate of error associated with
This correction expresses the size of the asso- the particular selection of items; error is high
ciation relative to the maximum correlation at- (and internal consistency is low) when items
tainable, given the imperfect reliabilities of the are heterogeneous in content and lack content
two measures. This kind of correction is some- saturation and when respondents change how
times used to estimate the true correlation be- they respond to items designed to measure the
tween the latent constructs underlying the mea- same characteristic (e.g., owing to fatigue). Re-
Sures (see also the section on SEM below), thus test (or stability) designs estimate how much
indicating what the observed correlation would responses vary within individuals across time
be if both constructs were assessed with perfect and situation, thus reflecting error due to dif-
reliability. Correction for attenuation can also ferences in the situation and conditions of test
be useful when researchers want to compare ef- administration or observation.! InteTrater or
fect sizes across variables or studies that use interjudge agreement designs estimate how
measures of varying reliabilities, as in meta- much scores vary across judges or observers
analyses (see Roberts, Kuncel, Viechtbauer, & (see von Eye & Mun, 2005), thus reflecting er-
Bogg, Chapter 36, this volume). Another appli- ror due to disagreements between raters and to
cation is in contexts where researchers want to individual differences among raters in response
distinguish the long-term stability of personal- styles, such as the way they scale their re-
ity and attitudes from the reliability of mea- sponses. It is imponant to note that, as Table
Surement. 27.2 shows, there are several different reliabil-
466 ANALYZING AND INTERPRETING PERSONALITY DATA

TABLE 27.2. Traditional Reliability Coefficients: Stndy Design, Statistics, Sources of


and Facets of Generalizability
Traditional
reliability Major sources
coefficient Study design Reliability statistic of error

Internal Measure participants at a Cronbach's coefficient alpha Heterogeneous


consistency single time across multiple (split-half correlations rarely item content;
items used today) participant fatigue

Retest Measure the same Correlation between Change in


participants across t\vo or participants' scores at the participants'
more occasions or times two times responses; change
using the same set of items in measurement
situation

Interrater Obtain ratings of a set of 1. Mean painvise interrater Disagreement


(interjudge) stimuli (individuals, video agreement correlation (for among raters;
recordings, transcribed reliability of a typical variation in raters'
interviews) from multiple single rater) response styles
raters (e.g., observers, 2. Cronbach's coefficient
coders) at one time alpha (for reliability of
the mean rating)
3. Cohen's kappa (for
agreement of categorical
ratings)

ity indices, and not all of them are based on time in both the research participants andth
correlations; therefore, different criteria for testing conditions (e.g., prior to vs. just aft
evaluating the reliability of a measure will be September 11, 2001).
needed. Values approaching 1.0 are not ex- Third, the types-of-reliability approac
pected for all reliability indices-for example, masked a major shortcoming of classicalt~§
Cohen's kappa, which measures agreement theory: If all these measures were indeed par~IT!.
among categorical judgments (see, e.g., von leI and all errors truly random, then all th~~2
Eye & Mun, 2005). approaches to reliability should yield the SaIne
answer. Unfortunately, they do not, because:r'e+
liability depends on the particular facet Af genera
Moving beyond the Classical generalization being examined (Cronba,:JI( ing to I
Conception of Reliability: Rajaratnam, & GIeser, 1963). For example,t9"c test he
Generalizability Theory
address the need for superbrief scales of the lIi~c velope
The distinctions among "types of reliability co- Five trait domains for use in surveys and experj< cultur:
efficients" had a number of unfortunate conse- imental contexts, researchers have recentlY All
quences. First, what had been intended as heu- constructed scales consisting of only twoot legitin
ristic distinctions became reified as the stability four items each (Gosling, Rentfrow, & SwamI; tion 0
coefficient or the alpha coefficient even though 2003; Rammstedt & John, 2006, 2007). studie
the notion of reliability was intended as a gen- These items were not chosen to be redundantiB::, signs,
eral concept. Second, the classification itself meaning but instead to represent the broad Bis::, signs
was too simple, equating particular kinds of re- Five domains, as well as to balance scoring ,by: the f,
liability evidence with only one source of error including both true-scored and false-scor~4' score!
and resulting in a restrictive terminology that items. Not surprisingly, the resulting scales hag each
cannot fully capture the broad range and com- very low internal consistency (alpha) reliabil; word
bination of multiple error sources that are of ities. Does this mean that these scales are gen~f:"i timat
interest in most research and measurement ap- ally unreliable? Not at all. They do show irrf, able 1
plications (e.g., Shavelson, Webb, & Rowley, pressive reliability when other facetsO~ estim
1989). For example, as we show in Table 27.2, generalizability are considered. For exampl~; varia
retest reliability involves potential changes over with less than one-quarter of the items of th~ fully
Reliability and Construct Validation 467
of Error, full-length 44-item Big Five Inventory (BF!; see tradirional reliability coefficients listed in Table
John & Srivastava, 1999), the BFI-I0 scales 27.2, we should use more general estimates,
can still represent the content of the full scales such as intraclass correlation coefficients
Facer of with an average part-whole correlation of .83; (Shrout & Fleiss, 1979), to probe particular as-
generalizabiliry 6-week retest correlations average .75. In pects of the dependability of measures. For ex-
Items short, different rypes of reliability have concep- ample, intraclass coefficients can be used to in-
ruaI1y distinct meanings that do not necessarily dex the generalizability of one ser of judges to a
cohere. universe of similar judges (von Eye & Mun,
Therefore, the American Psychological Asso- 2005).
Occasions ciation (e.g., 1985) recommended in subse- Generalizability rheory should hold consid-
quent editions of the Standards for Educational erable appeal for psychologists because the ex-
and Psychological Testing that these distinc- tent to which we can generalize across items,
tions and terminology be abolished and re- instruments, contexts, groups, languages, and
placed by the broader view advocated by cultures is crucial to the claims we can make
Raters generalizability theory (Cronbach et aI., 1963). about our findings. Despite excellent and read-
Regrettably, however, practice has not changed able introductions (e.g., Brennan, 2001;
much over the years, and generaIizabiIity the- Shavelson et aI., 1989), generalizability theory
ory has not fully replaced these more simplistic is still not used as widely as it should be. A re-
notions. Note that the last column in Table cent exception is the flourishing research on the
27.2 spells out the facet ofgeneralizability that determinants of consensus among personality
is being varied and studied in each of these raters (e.g., Kenny, 1994; see also John &
generalizability designs. Robins, 1993; Kashy & Kenny, 2000; Kwan,
Generalizability theory holds that we are in- John, Kenny, Bond, & Robins, 2004).
terested in the "reliability" of an observation Generalizability theory is especially useful
or measurement because we wish to generalize when data are collected in nested designs and
from this observation to some other class of multiple facets may influence reliability, as il-
lts and the observations. For example, Table 27.2 shows lustrated by King and Figueredo's (1997) re-
. just after that a concern with interjudge reliability may search on chimpanzee personality differences.
actually be a concern with the question of how The investigators collected ratings of chimpan-
approach accurately we can generalize from one judge to zees, differing in age and sex (subject variables)
3ssical test another (pairwise agreement) or from a given on 40 traits (stimulus variables) at several dif-
leed paral- set of judges to another set (generalizability of ferent zoos (setting variables), from animal
n all these aggregated or total scores). Or we may want to keepers familiar with the animals to varying
i the same know how well scores on an attitude scale con- degrees (observer variables). They then used a
lecause re- structed according to one set of procedures generalizability design to show how these fac-
. facet of generalize to another scale constructed accord- ets affected agreement among the judges. It is
:ronbach, ing to different procedures. Or we may want to unfortunate that generalizability theory, as well
~ample, to test how generalizable is a scale originally de- as Kenny's (1994) social relations model, have
of the Big veloped in English to a Chinese language and been perceived as '''technical.'' With clear and
,nd exper- cultural context. accessible introductions available, it is high
recently All these facets of generalizability represent time that these important approaches to vari-
y two or legitimate research concerns (see the later sec- ance decomposition achieve greater popularity
Ie Swann, tion on construct validation), and they can be with a broader group of researchers.
, 2007). studied systemarically in generalizability de-
lndant in signs, both individually and rogether. These de-
)road Big Coefficient Alpha: Personality
signs allow the researcher ro deliberately vary
:oring by Psychology's Misunderstood Giant
the facets that potentially influence observed
ie-scored SCores and estimate the variance attributable to Cronbach's (1951) coefficienr alpha is an index
:ales had each facet (Cronbach et aI., 1972). In other of internal consistency that has become the de-
reliabil- words, whereas classical test theory tries to es- fault reliability index in personaliry research.
re gener- timate the portion of variance that is attribut- Any recent issue of a personality journal will
[lOW im- able to "error," generalizability theory aims to show that alpha is the index of choice when re-
cets of estimate the ex'tent to which specific sources of searchers want to claim that their measure is
xample, variance contribute to test scores under care- reliable. Often it is the only reliability evidence
s of the fully defined conditions. Thus, instead of the considered, contrary to the recommendations
468 ANALYZING AND INTERPRETING PERSONALITY DATA

in the Standards (American Psychological As- gardless of the number of items on the
sociation, 1985). Finally, alpha is defined meaningfully only
We suspect that alpha has become so ubiqui- K ;;:: 2 because at least 2 items are needed
tous because it is easy to obtain and compute. compute the mean interitem correlation
Alpha does not require collecting data at two quired for the formula. Consider a
different times from the same subjects, as retest naire scale with 9 items and mean ri; ::; .42:
reliability does, or the construction of two al- pha would be (9 x .42)/(9 x .42 + [1 - .42])
ternate forms of a measure, as parallel-form re- 3.78/(3.78 + .58) = .87.
liability (now rarely used) would require. Al- What exactly does alpha mean, then? An
pha is a "least effort" reliability index-it can pha of .87 means, in plain English, that the
be used as long as the same subjects responded tal score derived fro11't. aggregating these 9
to multiple items thought to indicate the same would correlate .87 with the total score aerzL'erl
construct. And, computationally, SPSS and from aggregating another (imaginary) set
other statistical packages now allow the user to equivalent items; that 1S alpha captures
3

view the alpha of many alternative scales generalizability of the total score from one
formed from any motley collection of items set to another item set. The term internal
with just a few mouse clicks. Unfortunately, sistency is therefore a misleading label for
whereas alpha has many important uses, it also pha: The homogeneity or interrelatedness
has important limitations-long known to the items on the scale and the length of
methodologists, these limitations are less well scale have been aggregated and thus integl:at,:d
appreciated by researchers and thus worth re- into the total score, and the ~c,,,c,.a,.'Lau,,,uy
viewing in some detail. this total score (i.e., alpha) can therefore
longer tell us anything concrete about the
nal structure or consistency of the scale.
Alpha Is Determined by Both Item
hypothetical data presented in Table 27.3
Homogeneity (Content Saturation)
designed to make these points as concrete
and Scale Length
vivid as possible.
The alpha coefficient originated as a general- Table 27.3 shows the interitem correlation
ization of split-half reliability, representing the matrices for three hypothetical questiOlm:,ire
corrected mean of the reliabilities computed scales. Following Schmitt we
from all possible split-halves of a test. As such, structed our examples in correlational
alpha is a function of two parameters: (1) the than covariance) terms for ease of int:erlpreta-
homogeneity or interrelatedness of the items in tion. Scale A is the one we just considereli,
a test or scale (as indexed by the mean inter- 9 items, a mean interitem correlation
correlation of all the items on the test, rij) and and an alpha of .87. Scale B has 6 items, and
(2) the length of the test (as indexed by the has the same alpha of .87 as Scale A. But note
number of items on the test, k). The formula is that Scale B atrained that alpha in a rather
krii ferent way. Scale B has 3 fewer items, but
deficiency is offset by the greater homc,geneity
krii + (l- 'ij) or content saturation of its items: items
Conceptually, note that the term on the top of more highly intercorrelated (mean ri; = not Ie
the fraction allows alpha to increase as the than are the 9 items of Scale A (mean ri; = ond,
number of items on the scale goes up and as the This example ilJustrates the idea that minis.
mean intercorrelation between the items on the length can compensate for lower levels leads
scale increases. However, to constrain alpha to interitem correlation, an idea that is IDl ::::]:' addin
a range from 0 to 1, the same term repeats at in the Spearman-Brown prophecy f beer c
the bottom of the fraction plus a norming term which specifies the relation between test consu
(1 - mean fij) that increases the divisor as the and reliability (see, e.g., Lord & NIJVi,ck, gaine<
mean interitem correlation decreases. If that 1968). For any mean interitem correlation, are hi
interitem correlation were indeed 1, the formula computes how many items are needed than'
norming term would reduce to 0 and alpha to achieve a certain level of alpha. Figure {e.g., 1
would be at its maximum of 1.0 regardless how shows this relation for mean interitem correla- need
many items were on the scale. Conversely, if the tions of .20, .40, .60, and .80. Three points are must!
interitem correlation were 0, the numerator worth noting. First, the alpha reliability of the may (
would become 0 and so would alpha, again re- total scale always increases as the number of tions I
Reliability and Construct Validation 469

TABLE 27.3. Interitem Correlation Matrix for Three Hypothetical Scales


with Equal Coefficient Alpha Reliability
Scale A: 9 items, mean inreritem correlation = .42, a = .87
1 2 3 4 5 6 7 8 9
1
2 .42
3 .42 .42
4 .42 .42 .42
5 .42 .42 .42 .42
6 .42 .42 .42 .42 .42
7 .42 .42 .42 .42 .42 .42
8 .42 .42 .42 .42 .42 .42 .42
9 .42 .42 .42 .42 .42 .42 .42 .42-

Scale B: 6 items, mean inrerirem correlation = .52, a = .87


1 2 3 4 5 6
1
2 .52
3 .52 .52
4 .52 .52 .52
5 .52 .52 .52 .52
6 .52 .52 .52 .52 .52

Scale C: 6 items, mean inreritem correlation = .52, a; = .87


1 2 3 4 5 6
1
2 .70
3 .70 .70
4 .40 .40 .40
5 .40 .40 .40 .70
6 .40 .40 .40 .70 .70

Note CFA analyses showed that for Scale A, all items load .648 on a single factor; fit is perfect. For Scale B,
aU items load .721 on a single factor; fit is perfect. In contrast, Scale C is not unidimensional; for the one·
factor model, aU items load .72] and standardized root mean residual (RMR) is only .124. For two-factor
models for Scale C, aU items load .837 on their factor; the inrerfactorcorrelation is .571 and fit is perfect.

items increases (as long as adding items does These considerations make clear that alpha
not lower the mean interitem correlation). Sec- is a statistic that applies only to the total score
ond, the utility of adding ever more items di- (mean or sum) derived by aggregating across
miuishes quickly, so that adding the 15th item multiple items, observations, or even observers.
leads to a much smaller increase in alpha than Often researchers are more concerned about
adding the 5th item, just as consuming the 15th the homogeneity of the items, and the only way
beer or chocolate bar adds less enjoyment than to estimate that is a direct index of item content
consuming the earlier ones. Third, less is to be saturation, the simplest being the mean inter-
gained from adding more items if those items item correlation (rii)' We strongly recommend
are highly intercorrelated (e.g., mean ri; = .60) that researchers routinely compute both alpha
than when they show little content saturation and the mean interitem correlation (available
(e.g., mean l'i; = .20). The lesson here is that we iu the SPSS Reliability program under statis-
need to be careful in interpreting alpha; we tics). The two indexes provide different infor-
must recognize that a given magnitude of alpha mation: The mean interitem correlation tells us
may be achieved via many possible combina- about the items-how closely are they related?
tions of content saturation and scale length. how unique versus redundant is the variance
470 ANALYZING AND INTERPRETING PERSONALITY DATA

1.0[~;:::::::s:::==~r=
0.9 .
r=.6
r= .4 .B
O.B
f!
8 0.7
o

:s'0rn 0.6

0.5
~
:0 0.4
.!!!
iii 0.3 -------- --.-- .
c:
0.2 ------- .. -.-- -.. --- ------------

0.1 --.-- -.--.-----.---- .

O.O+---~~~~~~~~~~~~~~~
1 3 5 7 9 11 13 15 17 19
Number of items in scale

FIGURE 27.1. Cronbach's coefficient alpha reliability as a function of the number of items on a
(k) and the mean of the correlations among all the items (mean ri;); see the text for the formula used
generate this graph.

they are capturing?-whereas alpha teUs us Clearly, the responses to these 6 items
about the total or aggregated scale score. function of not one, but two, factors:
2, and 3 correlate much more sulJst,mtially)
with each other (mean r = .7) than they
Alpha Does Not Index Unidimensionality
late (mean r = .4) with items 4, 5, and 6,
The third scale in Table 27.3 highlights a sec- in turn correlate more highly
ond issue with alpha. Contrary to popular be- selves (mean,. = .7). Alpha completely dis.guises
lief, alpha does /lot indicate the degree to which this rather important difference between
the interitem intercorrelations are homoge- C and B.
neously distributed, nor does a high alpha indi- Because alpha cannot address it, unidiJTIelo;i;
cate that a scale is unidimensional. In fact, al- sionality needs to be established in other
though Scale C in Table 27.3 has the same as we describe in later sections of this ch:aptefc
alpha and mean interitem correlation as Scale on structural validity and on model
B, it differs radically in the dispersion (or vari- there we also discuss factor analyses of the
ance) of the correlations between its items. For ample data in Table 27.3. Here, it is imlDOltarlt
Scale B, the correlations are completely homo- to emphasize that the issue of error (or
geneous (all are .52, with a standard deviation, ability) present in an item is separate from
SD, of 0 in this hypothetical example), whereas issue of multidimensionality. In other
for Scale C they vary considerably (from .40 to unidimensionality does not imply lower
.70, with an SD of .15). Because computation of measurement error (i.e., unreliability),
of alpha does not consider this variability, multidimensionality does not imply higher
Corrina (1993) derived an index that reflects els of error. Once we know that a test is
the spread of the interitem correlations and ar- unidimensional, can we go ahead and still

;~;;~~r~~~~~~:~
gued that this index should be reported along
with alpha. A large spread in interitem correla- alpha asisa no.
answer reliability index? of a
The reliability
tions is a bad sign because it suggests that ei- sional scale can be estimated only
ther the test is multidimensional or the allel forms, which must have the same
interitem correlations are distorted by substan- structure (Cronbach, 1947, 1951). In fact,
tial sampling error. In our example, the pattern the test is not unidimensional, then alpha
of item intercorrelations for Scale C sug- derestimates reliability (see Schmitt, 1996,
gests that the problem is multidimensionality. an example). Thus, if a test is found to be
Reliability and Construct Validation 471

tidimensional, one should score two unidimen- esteem scale, which has alphas approaching .90
sional subscales and then use alpha to index and some interitem correlations approaching
their reliabilities separately.2 .70 (Gray-Little, Williams, & Hancock, 1997;
Robins, Hendin, & Trzesniewski, 2001). Not
surprisingly, some of these items turn out to be
Evaluating the Size of Alpha
almost synonymous, such as "I certainly feel
According to Classical Test Theory, increasing useless at times" and "At times I think I am no
alpha can have only beneficial effects. As dis- good at all." Although such redundant items
cussed previously, higher reliability means that a increase alpha, they do not add unique (and
'greater proportion of the individual differences thus incremental) information and can often be
in measurement scores reflect variance in the omitted in the interest of efficiency, suggesting
construct being assessed (as opposed to error that the scale can be abbreviated without much
variance), thus increasing the power to detect loss of information (see, e.g., Robins et aI.,
significant relations between variables. In real- 2001). More recently, considerable item tedun-
ity, however, instead of assuming that a bigger dancy was noted by the authors of the otiginal
alpha coefficient is always bette~ alpha must be and revised Experiences in Close Relationships
interpreted in terms of irs two main parame- questionnaires (ECR and ECR-R; Brennan,
ters-interitem correlation and scale length- Clark, & Shave~ 1998; Fraley, Walle~ &
a.nd in the context of how these two parameters Brennan, 2000). One example of a tedundant
fit the definition of the particular construct to be item pai~ from the ECR anxiety scale, is "1
measured. In anyone context, a particular level worry about being abandoned" and "I worry a
ofalpha may be too high, too low, or just right. fair amount about losing my partner." A sec-
ond example, from the ECR avoidance scale, is
"Just when my partner starts to get close to me,
Alpha and Item Redundancy
I find myself pulling away" and "I want to get
Consider a researcher who wants to measure close to my partner, but I keep pulling back."
the broad construct of neuroticism-which in- We have observed interitem correlations in
cludes anxiety, depression, and hostility as excess of .70 for each of these pairs (Soto,
more specific facets (see Figure 27.2). The re- Gorchoff, & John, 2006).
searcher has developed a scale with the follow- The fear-of-insects items on the hypotheti-
ing items: "I am afraid of spiders," "I get anx- cal neuroticism scale illustrate how easy it is
ious around creepy-crawly things," "I am not to boost alpha by writing redundant items.
bothered by insects" (reverse scored), and "Spi- However, unless one is specifically interested
ders tend to make me nervous." Note that in insect phobias, this strategy is not very
these items are essentially paraphrases of each useful. The narrow content representation
other and represent the same item content (I.e., high content homogeneity) would make
(atachnophobia or being afraid of insects) this scale less useful as a measure of the
stated in slightly different ways. Cattell (1972) broader construct of neuroticism. Although
considered scales of this kind to be "bloated the scale may predict the intensity of emo-
specifics"-they have high alphas simply be- tional reactions to spiders with great preci-
cause the item content is extremely redundant sion (at fidelity), it is less likely to telate to
and the resulting interitem correlations are very anything else of interest because of its very
high. Thus, alphas in the high .80's or even narrow bandwidth. Conversely, broadband
.90's, especially for short scales, may not indi- measures (e.g., a neuroticism scale) can pre-
cate an impressively reliable scale but instead dict a wider range of outcomes or behaviors
signal redundancy or narrowness in item con- but generally do so with lower fidelity. This
tent. Such measures are susceptible to the so- phenomenon is known as the bandwidth-
called attenuation paradox: Increasing the in- fidelity tradeoff (Cronbach & Glese~ 1957)
ternal consistency of a test beyond a certain and has proven to be of considerable impor-
point will not enhance validity and may even tance in many literatures, including personal-
'come at the expense of validity when the added ity traits (Epstein, 1980; John, Hampson, &
items emphasize one narrow part of the con- Goldberg, 1991) and attitudes (Eagly &
struct over other important parts. Chaiken, 1993; Fishbein & Ajzen, 1974). In
An example of a measure with high item re- general, predictive accuracy is maximized
dundancy is the 10-item Rosenberg (1979) self- when the trait or attitude serving as the pre-
472 ANALYZING AND INTERPRETING PERSONALITY DATA

dietar is measured at a similar level of ab- they closer to .70, indicating a high
straction as the criterion to be predicted. dundancy, or to .30, suggesting more
The close connection between the hierarchi- overlap? Table 27.4 provides real data
cal level of the construct to be measured and sample of University of Callif,orrlia--BI"kdl
the content homogeneity of the items is illus- undergraduates (N = 649); for this 111IlSt"atior
trated in Figure 27.2. Anxiety, depression, and we used their responses to a subset of
hostility are three trait constructs that tend to roticism items selected from Costa and
be positively intercorrelated and together de- Crae's (1992) NEO Personality Im'enltorv__R,
fine the broader construct of neuroticism (e.g., vised (NEO-PI-R) anxiety (A) and depress:id
Costa & McCrae, 1992). (Our initial example (D) facet scales. In the NEO-PI-R,
of the insect phobia scale might be represented Big Five personality domains is defined
as an even lower-level construct, one of many "facet" scales that each have 8 items;
more specific components of anxiety.) Consider suIting 48-item Big Five scales are very
now the anxiety scale on the left side of Figure and thus all have alphas exceeding .90.
27.2. Because its six items represent a narrow We examine the reliability of the anxiety
range of content (e.g., being fearful, nervous, depression facet scales first. Consider the
and worrying), item content should be rela- vant within-facet interitem correlations
tively homogeneous, leading to a reasonably are set in bold in Table 27.4. All these
high mean interitem correlation and, with 6 tions were positive and significant; their
items, a reasonable alpha reliability. Similar ex- was .38 for the 6 anxiety items and .39 for
pectations should hold for the six-item depres- 6 depression items, as shown in Figure
sion and hostility scales. That is, even for these lower-level facet
Researchers rarely publish or even discuss the items on the scale correlated only m(ld,:r,?
interitem correlations, and so far we have fo- ately with each other. With these
cused on hypothetical data to illustrate general interitem correlations, the 6-item facet
issues. How high are typical interitem correla- attained alphas of .78 for anxiety and .79
tions on personality questionnaire scales? Are depression (Figure 27.2).

NEUROTICISM NEUROTICISM
{mean rij = .26, (1.= .87) (mean rij = .28, a= .70)

ANXIETY DEPRESSION HOSTILITY


(mean rlj = .38, (1.= .78) (mean rij = .39, (1.= .79) (mean rjl = .33, a= .75)

A1A2A3A4A5A6 010203040506 H1H2H3H4H5H6 01 02

FIGURE 27.2. Illustration of hierarchical relations among constructs and homogeneity of item con-
tent: A general neuroticism factor and more specific 6-itern facets of an..xiety (A), depression (D), and
hostility (H); the results of internal consistency (alpha) analyses for each scale are shown in parentheses
(N = 649).
Reliability and Construct Validation 473

LE 27.4. How High Are Interitem Correlations on Personality Questionnaire Scales?:


helations among 12 Neuroticism Items Selected from the NEO-PI-R Anxiety (A)
1
Depression (D) Facet Scales
Al A2 A3 A4 AS A6 D1 D2 D3 D4 DS D6

.39
.44 .43
.33 .31 .51
.32 .22 .40 .27
.43 .36 .47 .44 .34
.26 .19 .37 .35 .33 .31
.21 .27 .33 .42 .20 .34 .45
.25 .20 .40 .39 .27 .32 .59 .48
.21 .23 .24 .30 .14 .30 .29 .39 .25
.28 .23 .29 .32 .14 .40 .24 .43 .29 .29
.22 .34 .29 .37 .21 .30 .38 .61 .43 .30 .44

·oie.N = 649. For all correlations, P< .01. Items AI-A6 are the first six items of the eight~item NEO·PI·R (Costa & McCrae,
·'!}f) anxiety facet scale; DI-D6 are the first six items of the eight-item depression scale. False-keyed items have been reverse-
red here. Within-facet imerirem correlations (mean T = .38) are set in bold, whereas between-facer interitem correlations
~a:nr = .28) are set in regular type. The imeritem correlations for item A4 that arc shown in italics illustrate discriminant-
'~liciity problems; specifically, three within·facer correlations (with anxiety items 1,2, and 4) were lower than four of its
()~s-facet correlations (with depression items 1,2, 3, and 6). The broader theoretical structure model for these data, and the
~an mterirem correlarions and alpha reliabiliries for the resulring scales, are shown in Figure 27.2, and rhe resulrs of explor-
Q,ryfacror analysis (EPA) and CFA analyses in Figures 27.3 and 27.4, respectively.

Now consider the betv.,reen-facet interitem facet scales, the mean correlation between fac-
6rr elations, set in regular type; their mean was ets within the same superordinate Big Five
8, lower on average than the (bold) within- domain was AD, indicating that same-domain
set correlations. This is as expected: AI- facet scales share some common variance but
"ugh all 12 of these neuroticism items should also retain some J.!niqueness. This moderate
't¢rcorrelate, the cross-facet (or discriminant) correlation also reminds us that superordinate
#elations of anxiety items with depression dimensions can be measured reliably with rela-
rns should he lower than the within-facet (or tively heterogeneous "building blocks" as long
F9Iivergent) correlations. This convergent- as there are enough such blocks-six facets per
.discriminant pattern generally held for the indi- Big Five domain in the case of the NEO-PI-R.
igtlal items, but there was at least one prob- Finally, scales for broadband constructs like
tIlatic item, A4 ("I often feel tense and neuroticism must address issues of item hetero-
~,ry"). Consider the italicized interitem cor- geneity. Consider a 6-item neuroticism scale
ations for A4. Three of its within-facet cor- (shown on the right side of Figure 27.2), con-
"ations (with AI, Al, and AS) were lower sisting of two anxiety items, A1 and A2; two
'a.Il four of its cross-facet correlations (with depression items, D1 and D2; and two hostility
1,D2, D3, and D6), indicating that this item items, HI and H2. As compared with the
·d not clearly differentiate anxiety from de- lower-level facet scales, the item content on this
ression. superordinate scale is much more heteroge-
The hierarchical-structure model in Figure neous, which should lead to a lower mean
7. 2 implies that the facet scales should cohere interitem correlation and thus a lower alpha,
s,components of the broader neuroticism do- given that scale length is constant at 6 items.
Clin but also differentiate anxious from de- Indeed, the analyses in our student sample bear
Fessed mood. How high, then, should facet out this prediction; the mean interitem correla-
tercorrelations be? In OUf student sample, the tion was only .28 and alpha .70.
.ery scale and depression scale correlated One implication of Figure 27.1 is that if one
~;'even when corrected for attenuation due to wants to measure broader constructs such as
eliability, the estimated true correlation of neuroticism, one should probably include a
4 remained clearly below 1.0. More gen- larger number of items to compensate for the
reIly, using the full-length 8-item NEO-PI-R greater content heterogeneity. For example,
474 ANALYZING AND INTERPRETING PERSONALITY DATA

one might use all 18 items, from A1 to H6, to beyond a certain point, the length of a
measure the superordinate neuroticism con- will be inversely related to its usefulness
struct defined on the left side of Figure 27.2. As many researchers-researchers are
one would expect, the mean interitem correla- under constraints on the recruitment and
tion for the 1S-item scale was .26, just as low sessment of participants. We must also
as that for the 6-item neuroticism scale, but this our desire to maximize interitem com:lat:iOIQS
longer scale had an impressive alpha of .87. As by way of item redundancy. We can
we discuss nC:h.'t, however, the strategy of in- this by making sure that, throughout the
creasing alpha by increasing scale length can be development process, the breadth of the
taken too far. struct we intend to measure is reflected in
breadth of the scale items we intend to me:aSllre
it with.
Alpha and Scale Length
Whereas scales with unduly redundant item
Item Response Theory
content have conceptuallirnitations, scales that
bolster alpha by including a great many items Classical test theory has also been criticized
have practical disadvantages. Overlong scales advocates of item response theory (IRT;
or assessment batteries can produce respondent Embretson, 1996; Embretson & Reise,
fatigue and therefore less reliable responses Mellenbergh, 1996). In classical theory,
(e.g., Burisch, 1984). Lengthy scales also con- characteristics of the individual test taker
sume an inordinate amount of participants' those of the test cannot be
time, making it likely that researchers will use {Hambleton, Swaminathan, & Rogers,
them only if their interests lie solely with the That is, the person's standing on the nnnpclv;;no
construct being measured. Recognition of these construct is defined only in terms of responses
disadvantages has led to a growing number of on the particular test; thus, the same
very brief measures. may appear quite liberal on a test that inc:luljes
Indeed, for very brief scales, alpha may not many items measuring extremely conservative
be a sensible facet of generalizability at all. For beliefs but quite conservative on a test that
example, in our discussion of generalizability dudes many items measuring radical liberal
theory, we noted that very brief Big Five scales liefs. The psychometric characteristics of the
did not have high interitem correlations, be- test also depend on the particular sample of re-
cause the items were chosen to represent very spondents; for example, whether a belief
broad constructs as comprehensively as possi- from a conservatism scale reliably discriJni-
ble (e.g., Rarnmstedt & John, 2006, in press). nates high and low scorers depends on the level
Not surprisingly, these scales had paltry alphas; of conservatism of the sample, so that the
more important, they showed substantial retest test may work well in a very liberal student
reliability and predicted the longer scales that sample but fail to make reliable distinctions
they were designed to represent quite well, among the relatively more conservative respon-
findings that also hold for other innovative dents in an older sample. In short, classical test
short measures, such as the single-item self- theory does not apply if we want to compare
esteem scale (Robins et aI., 2001). individuals who have taken different tests mea-
suring the same construct or if we want to com-
pare items answered by different groups of in-
Resisting the Temptation of Too-High Alphas dividnals.
What can be done to prevent the construction Another limitation of classical test theory is
of scales whose internal consistencies are too the asswnption that the degree of measurement
high? Some rules of thumb can serve as a start. error is the same for all individuals in the sam-
We suggest that scale developers review all ple-an implausible assumption, given that
pairs of items that intercorrelate .60 or higher tests and items differ in their ability to discrimi-
to decide whether one item in the pair may be nate among respondents at different levels of
eliminated. Ultimately, however, there is no the underlying construct (Lord, 1984). More-
foolproof empirical solution; as scale develop- over, classical theory is test-oriented rather
ers, only good judgment can save us from the than item-oriented and thus does not make pre-
siren song of inappropriately maximized dictions about how an individual or group will
alphas. Specifically, we must keep in mind that, perform on a particular item.
Reliability and Construct Validation 475

cale These limitations can be addressed in IRT In the absence of such criterion standards,
for (see Morizot, Ainsworth, & Reise, Chapter 24, personality psychologists have long been con-
ting this volume). Briefly put, IRT provides quanti- cerned with ways to conceptualize the validity
as- tative procedures to describe the relation of a of their measurement procedures. Although the
ltrol "particular item to the latent construct being first American Psychological Association com-
Ions measured in terms of difficulty and discrimina- mittee on psychological tests distinguished ini-
ieve tion parameters. This information can be use- tially among several "types" of validity,
cale ful for item analysis and scale construction, Cronbach and Meehl (1955) had already rec-
:on- permitting researchers to select items that best ognized that all validation of psychological
the measure a particular level of a construct and to measures is fundamentally concerned with
sure detect items biased for particular respondent what they called construct validity----evidence
groups. IRT is increasingly being applied that scores on a particular measure can be in-
to personality measures, such as self-esteem terpreted as reflecting variation in a particular
(Gray-Little et aI., 1997) and romantic attach- construct (i.e., an inferred characteristic) that
ment (Fraley et aI., 2000). has particular implications for human behav-
d by To summarize, in this section we focused on ior, emotion, and cognition.
e.g., classical test theory approaches to reliability,
000; the costs associated with low reliability and the
Process and Evidence
the practice of correcting for attenuation, specific
and types of reliability indices, and issues with coef- The idea of construct validity has been elabo-
'ated ficient alpha (test length, unidimensionality, rated upon over time by such investigators as
)91). and construct definitions). In our discussion, Loevinger (1957), Cronbach (1988), and
.ymg we mentioned such concepts as latent (or un- Messick (1995), and it is now generally recog-
mses derlying) constructs, construct definitions, nized as the central concern in psychological
rson dimensionality, criterion variables, and dis- measurement (see also Braun, Jackson, &
udes criminant relations, but did not discuss them Wiley, 2002; Kane, 2004). The 1999 edition of
ative systematically. These concepts are complex and the Staltdards for Educatioltal altd Psychologi-
Lt in- go beyond the classical view of reliability, em- cal Testing (American Educational Research
tl be- phasizing that the meaning and interpretation Association, American Psychological Associa-
E the of measurements is crucial to evaluating the tion, & National Council on Measurement in
)f re- ,"quality of our measurements. Traditionally, is- Education, 1999) emphasizes that "validation
item ,sues of score meaning and interpretation are can be viewed as developing a scientifically
rimi- discussed under the heading of validity, to sound validity argument to support the in-
level which we now turn. tended interpretation of test scores and their
same relevance to the proposed use" (p. 9). Yet con-
(dent struct validity continues to strike many of us,
tions . Construct Validation from graduate students to senior professors, as
;pon- a rather nebulous or "amorphous" concept
1test Measurements of psychological constructs, (Briggs, 2004), perhaps because there is no
Lpare such as neuroticism, rejection sensitivity, or such thing as "'the construct validity coeffi-
mea- smiling, are fundamentally different from basic cient," no single statistic that researchers can
com- physical measurements (e.g., mass), which can point to as proof that their measure is valid. Be-
)f in- often be based on concrete standards-such as cause of this, it may be easier to think of con-
the 1-kilogram chunk of platinnm and irid- struct validity as a process (i.e., the steps that
lry is ium that standardizes the measurement of one would follow to test whether a particular
ment mass. Unfortunately, the behaviorist movement interpretation of a particular measure is valid)
sam- ~parked a preoccupation with "gold stan- than as a property (i.e., the specific thing that a
that dards" (or platinum-and-iridium standards) measurement interpretation must have in order
rimi- for psychological measures (e.g., Cureton, to be valid).
,Is of 1951) that lasted into the 1970s (see Kane, Keeping with this emphasis on construct va-
10re- ~004). Eventually researchers came to recog- lidity as a process rather than a property, Smith
uher nize that for most psychological concepts there (2005b) articulated four key steps in the valida-
:pre- ~:xists no single, objective, definitional criterion tion process. First, a definition of the theoreti-
I will standard against which all other such measures cal construct to be measured is proposed. Sec-
Can be compared. ond, a theory, of which the construct in
476 ANALYZING AND INTERPRETING PERSONALITY DATA

question is a part, is translated into hypotheses example, Snyder (1987) wrote an entire
about how a valid measure of the construct book to summarize the, e:v~;a~:I~id~i.~ty;rcl~~;;~J~~~:
would be expected to act. Third, research de- (Cronbach, 1988) for his s
signs appropriate for testing these hypotheses strucr, drawing on everything that
are formulated. Fourth, data are collected us- learned about th.is construct in more than 15
ing these designs and observations based on years of empirical research and construct
these data are compared to predictions. (For a apment. More recently, meta-analytic tech-
similar, process-oriented approach based on niques have proven useful to make such
definition and evidence, see Kane, 2004.) summaries more manageable and ob,ie':ti,,.
A fifth step, revision of the theory, construct, (Schmidt, Hunter, Pearlman, & Hirsch, 1985).
or measure (and repetition of steps one through Westeo and Rosenthal (2003) proposed two
four), highlights the idea that the construct val- heuristic indices to operationalize construct va..,
idation process is a basic form of theory test- lidity in terms of the relative fit of observations
ing: "To validate a measure of a construct is to to hypotheses, thus addressing the fourth step
validate a theory" (Smith, 2005a, p. 413). As in Smith's (2005b) process model. Nonetheless,
with any other theory or model, the validity of attempts to quantify construct validity
the particular score interpretation can never be controversial (cE. Smith, 2005a, 2005b;
fully established but is always evolving to form & Rosenthal, 2005).
an ever-growing "nomological network" of Another way to elaborate the notion of
validity-supporting relations (Wiggins, 1973). struct validity in personality psychology is
Given that multiple pieces of evidence are consider the kinds of evidence that peJcsoJlali
needed to cumulatively support the hypothe- psychologists typically seek as part the
sized construct, it is often difficult to quickly stIuer validation process (d. Messick,
summarize the available validity evidence. For 1995). Here we focus on five major forms

TABLE 27.5. Five Commonly Sought Forms of Evidence for Construct Validity
and Some Examples
Forms of evidence for construct validity Examples of study designs
1. Generalizability: Reliability and replication: Test whether score
Evidence that score properties and properties are consistent across occasions (i.e., retest
interpretations generalize across population reliability), samples, and measurement methods (e.g.,
groups, settings, and rasks self-report and peer report)

2. Content validity: Expert judgments and review: Test whether experts


Evidence of content relevance, agree that items are relevant and represent the
representativeness, and technical quality of construct domain; use ratings to assess item
items characteristics, such as comprehensibility and clarity

3. Stmctural validity: Exploratory or cOl1finnatory factor analysis: Test


Evidence that the internal structure of the whether the factor structure of the measure matches
measure reflects the internal structure of the the hypothesized structure of the construct
construcr domain

4. External validity: Criterion correlation: Test whether measurement


Evidence that the measure relates to other correlate with relevant criteria (e.g., membership in a
measures and to nontest criteria in theoretically criterion group)
expected ways
Multitrait-multimethod matrix: Test whether di;:~~:~v
measures of the same construct correlate more
than do measures of different constructs that use the
same and different methods (e.g., instruments, data
sources)

5. Substantive validity: Mediation analysis: Test whether measurement scores


Evidence that measurement scores meaningfully mediate the relationship between an experimental
relate to theoretically postulated domain manipulation and a behavioral outcome in an eXI,ected
processes way
Reliability and Construct Validation 477

evidence; they are listed and defined briefly in beyond which the interpretation of the measure
Table 27.5.' We emphasize at the outset that cannot be extended. An issue of particular im-
this list is not meant to constrain the kinds portance for personality researchers is the de-
of evidence that should be considered in the gree to which findings generalize from «conve-
validation process, that particular kinds of nience" samples, such as American college
evidence may be more Or less important for students, to groups that are less educated,
supporting the validity of a particular measure- older, or come from different ethnic or cultural
ment interpretation, and that these five kinds backgrounds.
of evidence are not intended as mutually exclu-
sive categories.
Evidence for Content Validity
A second form of validational evidence in-
Evidence for Generalizability
volves coment validity; such evidence is pro-
Generalizability evidence is needed in a test vided most easily if the construct has been ex-
validation program to demonstrate that score plicated theoretically in terms of specific
interpretations apply across tasks or contexts, aspects that exhaust the content domain to be
times or occasions, and observers or raters (see covered by the construct. Common problems
Table 27.2). The inclusion of generalizability involve underrepresenting an important aspect
evidence here makes explicit that construct val- of the conStruct definition in the item pool and
idation includes consideration of "error associ- overrepresenting another one. An obvious ex-
ated with the sampling of tasks, occasions, and ample is the multiple-choice exams we often
scorers [that] underlie traditional reliability construct to measure student performance in
concerns" (Messick, 1995, p. 746). That is, the our classes; if the exam questions do not sam-
notion of generalizability encompasses tradi- ple fairly from the relevant textbook and lec-
tional conceptions of both reliability and crite- ture material, we cannot claim that the exam
rion validation; they may be considered on a validly represented what students were sup-
continuum, differing only in how far general- posed to learn (i.e., the course content).
izability claims can be extended (Thorndike, Arguments about content validity arise not
1997). Traditional reliability studies provide only between professors and students, but also
relatively weak tests of generalizability, in research. The Self-Monitoring Scale (see
:'Whereas studies of criterion validity provide Snyder, 1974) is a good example because it be-
;'stronger tests of generalizability. gan with a set of 25 rationally derived items;
For example, generalizing from a test score when evidence later accumulated regarding
to another test developed with parallel proce- the structure and external correlates of these
dures (e.g., a Form A and Form B of the same items, Snyder (e.g., 1987) made revisions to
#~st) does increase our confidence in the test both the construct and the scale, excluding a
but does 50 only modestly (i.e., providing evi- number of items measuring other-directed self-
clence of parallel-form equivalence). If we find presentation. As a result, behavioral variability
Yvf. can also generalize to other times or occa- and attitude-behavior inconsistency were rep-
$ions, our confidence is further strengthened, resented to a lesser extent in the revised scale.
',put not by quite as much as when we can show Because all items measuring public performing
generalizability to other methods or even to skills were retained, the construct definition in
:.gantest criteria related to the construct the test the new scale shifted toward a conceptually un-
:'Was intended to measure. Thus, generalizabi- related construct, extraversion (John, Cheek,
Ho/ can be thought of as similar to an onion- & Klohnen, 1996). This example shows that
'I}Clt because it smells bad, but because it discriminant aspects are also important in con-
,iilvolves layers. The inner layers represent rela- tent validation: To the extent that the items
::tiyely modest levels of generalization, and the measure aspects not included in the construct
';lter layers represent farther-reaching general- definition, the measure would be contaminated
ations to contexts that are more and more re- by construct-irrelevant variance. For example,
waved from the central core (i.e., dissimilar when validating scales to measure coping or
:,¥am the initial measurement operation). emotion regulation, the item content on such
yThe kind of validity evidence Messick (1989) scales should not assess variance that must be
';;spnsidered under the generalizability rubric is attributed to distinct constructs, such as psy-
fUcial for establishing the limits or boundaries chological adjustment or social outcomes that
478 ANALYZING AND INTERPRETING PERSONALITY DATA

are theoretically postulated to be direct conse- (or relevance) of a large number of acts for a
quences of the regulatory processes the scales patticulat trait (Buss & Ctaik, 1983).
are intended to assess (see, e.g., John & Gross, Content validity can also be considered in
2004, 2007). the context of the quality and adequacy of fot-
To address questions about content valid- mal or technical aspects of items. In the domain
ity, researchers may use a number of valida- of self-reports and questionnaires, it is impor-
tion procedutes (see also Smith & McCarthy, tant to recognize that the researcher is trying to
1995). Reseatchets might ask expett judges communicate accurately and efficiently with
to review the match between item representa- the research participants, and thus formal item
tion and construct domain specification, and characteristics, such as the clarity of wording,
these conceptual-theoretical judgments can easy comprehensibility, low ambiguity, and so
then be used to add or delete items. For ex- on, are crucial linguistic and pragmatic con-
ample, Jay and John (2004) adopted the Di- cerns in the design of items (e.g., Angleitner,
agnostic and Statistical Manual of Mental John, & Loht, 1986).
Disotders (4th ed.) (DSM-lV) symptom list
for major depression as an explicit construct
Evidence for Structural Validity
definition fot their California Psychological
Inventory (CPI)-based Deptessive Symptom Structural validity requires evidence that the
(DS) scale. Advanced gtaduate students in correlational (or factor) structure of the items
clinical psychology then provided expett on the measure is consistent with the hypothe-
judgments classifying all the ptoposed DS sized internal structure of the construct
items as well as the items of several other de- domain. We noted the issue of multidimen-
pression self-report scales according to the sionality in the section on reliability, pointing
DSM symptoms. Moreover, in an effort to out that coefficient alpha does not allow infet-
address discriminant validity early in the scale ences about the dimensionality of a measure.
construction process, the judges were also The structure underlying a measure or scale is
given the choice to classify items as more rel- not an aspect of reliability; rather, it is central
evant to anxiety than to depression, reason- to the interpretation of the resulting scores and
ing that depression needs to be conceptually thus needs to be addtessed as patt of the con-
differentiated from anxiety and therefore an..x- struct validation program. Researchers have
iety items should not appear on a depressive used both exploratory and confirmatory factor
symptom scale. In this way, Jay and John analysis for this purpose; we return to this im-
(2004) wete able to (1) focus their scale on portant issue below in the context of evaluating
item content uniquely related to depression measurement with structural equations mod-
rather than anxiety symptoms, (2) examine els.
how comprehensively their DS item set repre-
sented the intended construct (e.g., they
Evidence for External Validity:
found that only one DSM symptom clustet,
Convergent and Discriminant Aspects
suicidal ideation, was not represented), and
(3) compare the construct representation of External validity has been at the cote of what
the DS items with those of othe~ commonly most petsonality psychologists think validity is
used depression scales. This theoretically all about: How well does a test ptedict concep-
based approach, when applied to question- tually relevant behaviors, outcomes, or crite-
naire construction, has become known as the ria? Wiggins (1973, p. 406) atgued that predic-
rational-intuitive approach; it has been tion "is the sine qua non of personality
widely used by petsonality and social psy- assessment," and Dawes and Smith (1985,
chologists focused on measuring theoretically p. 512) suggested that "the basis of all mea-
postulated constructs (e.g., Burisch, 1986). surement is empirical prediction." Obviously, it
Ptobably the most explicitly rational ap- makes sense that a test or scale should predict
proach to construct definition in personality construct-relevant criteria. It is less apparent
psychology is Buss and Ctaik's (1983) act fre- that we also need to show that the test does not
quency approach: Selection of act items was predict conceptually unrelated criteria. In other
based not on an abstract theoretical defini- words, a full demonstration of external aspects
tion of each trait construct but on folk wis- of construct validation requires a demonstra-
dom, captured in terms of college students' tion of both what the test measures and what it
aggtegated judgments of the ptototypicality does not measure.
Reliabiliry and Construct Validation 479

a l'n,dil:tir'g Criterion Group Membership lege professors likely differ from assembly
Nontest (External) Criterion Variables workers in more ways than just their personal
n need for cognition. In recognition of this prob-
long-popular method for demonsrrating lem, Gosling, Kwan, and John (2003) tried to
n external validity of a measure was to test rule out potential confounds, such as that the
V'whetl,er the measure can successfully distin- observers' behavior ratings of the dogs in the
o between criterion groups-groups that park were not based simply on appearance and
h presumed to differ substantially in their breed stereotypes that may be shared by both
n levels of the construct to be measured. owners and strangers.
,
"o fa~~(~f~:;~~ the Minnesota Multiphasic Per-
, Inventory (MMPI; Hathaway & Mc-
,- and CPI (Gough, 1957) were the
Multitrait-Multimethod Matrix
r; persc)nrility inventories developed accard- Campbell and Fiske (1959) introduced the
to the criterion (or contrast) group ap- terms convergent and discriminant to distin-
proacn. For example, items were selected for guish demonstrations of what a test measures
MMPI depression scale if they could dis- from demonstrations of what it does not mea-
patients hospitalized with a diagnosis
CW"UlIl"IC sure. The convergent validity of a self-report
Ie depression from nonpsychiatric coo- scale of need for cognition could be assessed by
's sulJje,:ts. Gough (1957) selected items for correlating the scale with independently ob-
e- subsequent achievement via conformance tained peer ratings of the subject's need for cog-
ot if they predicted grade point average nition and with frequency of effortful thinking
1- in high school (assumed to reflect con- measured by "beeping" the subject several
'g venltional achievement requiring rule follow- times during the day. Discriminant validity
r- and for his achievement via independence could be assessed by correlating the self-report
e. if they predicted GPA in college (assumed scale with peer ratings of extraversion and a
IS reflect more autonomous pursuit of achieve- beeper-based measure of social and sports ac-
al goals and interests). More recently, tivities. Campbell and Fiske were the first to
Id Ca(:ioI'PO and Petty (1982) developed the need formalize these ideas of convergent and dis-
n- cOI~nitionscale to measure individual differ- criminant validity into a single systematic de-
,e the preference and enjoyment of sign that crosses multiple traits or constructs
Jr thinking. As part of their construct (e.g., need for cognition and extraversion) with
n-

f
Ig :~;:~~~~~g,program, they conducted
college professors (assumedatostudy
need
multiple methods (e.g., self-report, peer rat-
ings, aod beeper methodology). They called
d- and assembly line workers (assumed this design a multitrait-multimethod (MTMM)
to need cognition). Consistent with the in- matrix, and the logic of the MTMM is both in-
)/telrpr'et,ltil,n of their measure as reflecting indi- tuitive and compelling.
differences in need for cognition, the What would we expect for our need for cog-
score of the professors was much higher nition example? Certainly, we would expect
the mean score of the assembly line work- sizable convergent validity correlations be-
at Gosling, Kwan, and John (2003) validated tween the need for cognition measures across
IS owne,,' judgments of their dogs' Big Five per- the three methods (self-report, peer reporr;
p- 's(mallty traits by showing that these judgments . beeper); because these correlations involve the
e- pr:edicted relevant behavior in a dog park, as same trait but different methods, Campbell and
c- by strangers who interacted with the dogs Fiske (1959) called them monotrait-hetero-
ry an hour. method coefficients. Moreover, given that need
5, A critical issue with the use of such e:>...'ternal for cognition is theoretically unrelated to extra-
a- is the "gold standard" problem men- version., we would expect small discriminant
it earlier-that the convergent and dis- correlations between the need for cognition
ct crilmirlarlt construct validity of the criterion it- measures and the extraversion measures. This
nt typically not well established. For condition should hold even if both traits are
ot eX'lml,le, patients with a diagnosis of major de- measured with the same method, leading to so-
er pn:ssionmay be comorbid with orher disorders called heterotrait-monomethod correlations.
:ts anxiety) Or may have been hospitalized Certainly, we want each of the convergent cor-
a- construct-irrelevant reasons (e.g., depressed relations to be substantially higher than the dis-
it In(jiv:idulals lacking social or financial support criminant correlations involving the same trait.
more likely to be hospitalized), just as col- And finally, the same patterns of intercorre-
480 ANALYZING AND INTERPRETING PERSONALITY DATA

lations between the constructs should emerge, ened considerably if psychological health were
regardless of the method llsed; in other words, measured with a method other than self-report,
the relations between the constructs should such as ratings by clinically trained observers
generalize across methods. (Jay & John, 2004).

Method Variance LOTS: Multiple Sources of Data


An important recognition inherent in the Beginning with Cattell (1957, 1972), psycholo-
MTM1vl is that we can never measure a trait or gists have tried to classify the many SOurces re-
construct by itself; rather, we measure the trait searchers can use to collect data into a few
intertwined with the method llsed: "Each mea- broad categories. Because each data source has
sure is a trait-method unit in which the ob- unique strengths and limitations, the construct
served variance is a combined function of vari- validation approach emphasizes that we should
ance due to the construct being measured and collect data from lots of different sources, and
the method used to measure that construct" so tbe acronym LOTS has particular appeal
(Rezmovic & Rezmovic, 1981, p. 61). The de- (Block & Block, 1980; see also Craik, 1986).
sign of the MTMM is so useful because it al- L data refer to life event data that can be ob-
lows us to estimate variance in our scores that tained fairly objectively from an individual's
is due to method effects-that is, errors system- life history or life record, such as graduating
atically related to Ollr measurement methods from college, getting married or divorced, mov-
and thus conceptually quite different from the ing, socioeconomic status, memberships in
notion of random error in classical test theory. clubs and organizations, and so on. Examples
These errors are systematic because they reflect of particularly ingenious measures derived
the influence of unintended constructs on from L data are counts of bottles and cans in
scores, that is, unwanted variance-something garbage containers to measure alcohol con-
we did not wish to measure but that is con- sumption (Webb, Campbell, Schwartz,
founding our measurement (Ozer, 1989). Sechrest, & Grove, 1981), police records of ar-
Method variance is indicated when two con- rests and convictions to measure antisocial
structs measured with the same method (e.g., behavior (Caspi et aI., 2005), and the use of oc-
self-reported attitudes and self-reported behav- cupational, marital, and family data to score
ior) correlate more highly than when the same the number of social roles occupied by an indi-
constructs are measured with different meth- vidual (Helson & Soto, 2005).
ods (e.g., self-reported attitudes and behavior o data refer to observational data, ranging
coded from videotape). Response styles, such from observations of very specific aspects of
as acquiescence, may contribute to method behavior to more global ratings (see Bakeman,
variance when the same respondent completes 2000; Kerr, Aronoff, & Messe, 2000). Exam-
more than one measure (So to, John, Gosling, ples are careful and systematic observations re-
& Potter, 2006). Another example involves corded by human judges, such as in a particular
positivity bias in self-perceptions, which some laboratory setting or carefully defined situa-
researchers view as psychologically healthy tion; behavior coded or rated from photos or
(Taylor & Brown, 1988). However, if positivity videos; and, broader still, reports from knowl-
bias is measured with self-reports and the mea- edgeable informants, such as peers, room-
sure of psychological health is a self-report mates, spouses, teachers, and interviewers that
measure of self-esteem, then a positive may aggregate information across a broad
intercorrelation between these measures may range of relevant situations in the individual's
not represent a valid hypothesis about two con- daily life. a data obtained tbrough unobtrusive
structs (positivity bias and psychological observations or coded later from videotape can
health), but shared self-report method variance be particularly useful to make inferences about
associated with narcissism (John & Robins, the individual's attitudes, prejudices, prefer-
1994); tbat is, individuals who see themselves ences, emotions, and other attributes of interest
too positively may be narcissistic and also rate to social scientists. Harker and Keltner (2001)
their self-esteem too highly. Discriminant valid- used ratings of emotional expressions in
ity evidence is needed to rule out this alterna- women's college yearbook photos to predict
tive hypothesis, and the construct validity of marital and well-being outcomes 30 years later.
the positivity bias measure would be strength- Gross and Levenson (1993) used frequency of
Reliability and Construct Validation 481

blinking while watching a disgust-eliciting film reports may be influenced by various con-
as an index of distress. Fraley and Shaver structs other than the intended one. Systematic
(1998) observed and coded how different ro- errors include, most obviously, individual dif-
mantic couples behaved as they were saying ferences in response or rating scale use, such as
good-bye to one another at an airport and acquiescence (see McCrae, Herbst, & Costa,
found that separation behavior was related to 2001; Soto, John, Gosling, & Potter, 2006;
adult attachment style. Another nice illustra- Visser et aI., 2000) and response extremeness
tion is a study that recorded seating position (Hamilton, 1968). Another potential source of
relative to an outgroup member to measure error is reconstruction bias, in which individu-
ethnocentrism (Macrae, Bodenhausen, Milne, als' global or retrospective ratings of emotions
& Jetten, 1994). and behaviors differ substantially from their
T data refer to information from test situa- real-time or "online" ratings (Scollon, Diener,
tions that provide standardized measures of Oishi, & Biswas-Diener, 2004).
'performance, motivation, or achievement, and Moreover, some theorists have argued that
from experimental procedures that have clear self-reports are of limited usefulness because
and objective rules for scoring performance. A they may be biased by social desirability re-
timed intelligence test is the most obvious ex- sponse tendencies. Two kinds of desirability bi-
ample; other examples include assessments of ases have been studied extensively (for a re-
the length of time an individual persists on a view, see Paulhus, 2002; also Paulhus &
'puzzle or delays gratification in a standardized Vazire, Chapter 13, this volume). Impression
situation (Ayduk et aI., 2000). Reaction times management refers to deliberate attempts to
are frequently used in studies of social cogni- misrepresent one's characteristics (e.g., "faking
tion, providing another kind of objective mea- good") whereas self-deceptive enhancement re-
$l1re of an aspect of performance. Recently, the flects honestly held but unrealistic self-views.
Implicit Association Test (IAT; Greenwald, Impression management appears to have little
;'l\1cGhee, & Schwartz, 1998), which uses effect in research contexts where individuals
: reaction-time comparisons to infer cognitive participate anonymously and are not moti-
~ssociations, has become a popular method of vated to present themselves in a positive light;
*ssessing implicit aspects of the self-concept; self-deception is not simply a response style but
Greenwald and Farnham (2000) provided related to substantive personality characteris-
evidence for the external and discriminant tics, such as narcissism (Paulhus & John,
tyalidity of IAT measures of self-esteem and 1998).
'Wasculinity-femininity. Fortunately, although personality psycholo-
Finally, S data refer to self-reports. S data gists still use self-report questionnaires and in-
ay take various forms. Global self-ratings of ventories most frequently, other methods are
~eneral characteristics and true-false responses available and used (Craik, 1986; Craik,Chap-
,to questionnaire items have been used most fre- ter 12, this volume; Robins, Tracy, & Sherman,
clJuently. However, self-reports are also studied Chapter 37, this volume). Thus, measures
::itidetailed interviews (see Bartholomew, based on L, 0, and T data can help evaluate
'Bendersbn, & Marcia, 2000), in narratives and provide evidence for the validity of more
d life stories (see Smith, 2000), and in survey easily and commonly obtained self-report mea-
~search (Visser, Krosnick, & Lavrakas, 2000). sures tapping the same construct. Unfortu-
Pilily experience sampling procedures (see Reis nately, research using multiple methods to mea-
Gahle, 2000) can provide very specific and sure the same construct has not been very
'97tailed self-reports of moment-to-moment frequent. Overall, multimethod designs have
'£unctioning in particular situations. been underused in construct validation efforts.
The logic underlying S data is rhat individu- Researchers seem more likely to talk about the
are in a good position to report about their MTMM approach than to go to the trouble of
ychological processes and characteristics- actually using it.
like an outside observer, they have access to There is an extensive and useful method-
ir private thoughts and eX1'eriences and they ological literature on the MTMM, which be-
observe themselves over time and across gan in the mid-1970s when SEM became avail-
Li''''tKlns. However, the validity of self-reports able and provided powerful analytical tools to
on the ability and willingness of indi- estimate separate trait and method factors
to provide valid reports, and self- (e.g., Kenny, 1976; Schwarzer, 1986; Wegener
482 ANALYZING AND INTERPRETING PERSONALITY DATA

& Fabrigar, 2000). A number of excellent re- should influence the individual's ere e
views and overviews are also available. For ex- thoughts, or feelings. In that sense, studies de"
ample, Judd and McClelland (1998) describe a substantive validity are at the boundary be- con
series of examples that illustrate Campbell and tween validational concerns and broader con- trat
Fiske's (1959) original principles of convergent cerns with theory building and testing. The the
and discriminant validation as well as the ap- concept of substantive validity thus serves to il- sou
plication of SEM techniques to estimate sepa- lustrate the back and forth (dialectic) of theory vali
rate trait and method effects. For specific issues and research. That is, when a study fails to eatl
in fitting SEM models, see Kenny and Kashy show the effect predicted for a particular con- for
(1992) and Marsh and Grayson (1995). Hypo- struct, it is unclear whether the problem in- tes1
thetical data may be found in West and Finch volves a validity issue (Le., the measure is ext
(1997), who illustrate three scenarios: (I) con- valid), or faulty theorizing (i.e., the theory is as
vergent and discriminant validity with minimal wrong), or both. sec
method effects, (2) strong method effects, and The consideration of substantive aspects of 'sit1
(3) effects of unreliability and lack of discrimi- validity illustrates that ultimately measurement cal
nant validity. John and Srivastava (1999) mod- cannot be separated from theory, and a good Cal

eled trait and instrument effects with data for theory is one that includes an account of the of
three commonly used Big Five instruments. relevant measurement properties of its con- Sa!
structs. For example, a theory of emotion err
might distinguish among multiple emotion w<
Evidence for Substantive Validity ca
components, such as subjective experience,
The final form of validational evidence in Table emotion-expressive behavior, and physiological flt
27.5 involves substantive validity. Substantive response patterns (see, e.g., Gross, 1999), and so
validation studies make use of substantive the- specify how these components are most validly
ories and process models to further support the measured, such as emotion experience with th
interpretation of the test scores. The strongest particular kinds of self-report measures, emo- eli
evidence for substantive validity comes from tion expressions with observer codings from p<
studies that use experimental manipulations video recordings of the individual, and physio- 2~
that directly vary the processes in question. For logical responding with objective tests. How h,
example, Petty and Cacioppo (1986) showed exactly these three emotion components ought er
that the process of attitude change was medi- to be related to each other is foremost a theo- v,
ated by need for cognition. Individuals scoring retical issue but also involves substantive valid- w
high on the scale were influenced by careful ex- ity issues; for example, if a study were to show cl
amination of the arguments presented in a mes- zero correlations between measures of sadness w
sage, whereas those scoring low were more in- experience and measures of sadness expression, E
fluenced by extraneous aspects of the context the theoretical notion of emotion as a ImltaTV a
or message (e.g., the attractiveness of the and coherent construct may have to be v
source of the message). Another example is fied because method variance and systematic e
n
Paulhus, Bruce, and Trapnell's (1995) use of factors (e.g., display rules; individual differ-
experimental data to examine an aspect of the ences in expressivity) might influence the co- ti
substantive validity of two social desirability herence of emotion experience and expression v
t
scales. When subjects were asked to intention- for particular emotions and particular individ-
r
ally present themselves in a favorable way (e.g., uals (see, e.g., Gross, John, & Richards, 2000).
as they might during a job interview), the self- Substantive validity, then, is the broadest of all
presentation scale showed the predicted in- five forms of validational evidence and, ulti-
crease over the standard "honest self- mately, indistinguishable from using theory- 1
1
description" instruction, but the self-deception testing to build the nomological network for a
scale did not, just as one would expect for a construct.
scale designed to measure unrealistically posi-
tive self-views that the individual believes are
Construct Validation: Summary
true of him or her.
and Conclusions
Substantive validity, then, is really about
testing theoretically derived propositions about To summarize, in this section we reviewed five
how the construct in question should function forms of evidence central to a program of con-
in particular kinds ,of contexts and how it struct validation (see Table 27.5). We consid-
Reliability and Construct Validation 483

's behavior, one of them, external validation, in some cause it provides a general analytic approach to
~, studies of highlighting the need to consider both assessing construct validity. As will become
lUndary be- cOlov'''gent and discriminant aspects and illus- clear, convergent validity, discriminant validity,
roader Con- the multitrait-multimethod approach, and random error can all be addressed within
:esting. The nature of method variance, and multiple the same general framework. To illustrate these
serves to il- smucl's of data. Clearly, these five forms of points, we return to our earlier numerical ex-
c) of theory val"Ull[J()ll'U evidence are not perfectly delin- amples (Tables 27.3 and 27.4) and show how
ldy fails to and they overlap somewhat. Conside~ these data can be analyzed and understood us-
ticular con- example, a study finding a high 2-year re- ing CFA-based measurement models.
lroblem in- correlation. For a measure of a trait like
asure is not extra,'w;iol', that finding could be considered
Measurement Models in SEM:
le theory is evidence both for generalizability (because
Convergent Validity, Discriminant
were consistent across time and testing
Validity, and Random Error
: aspects of situai:iolls) and for substantive validity (be-
easurement extraversion is conceptualized as a trait Like all factor analytic procedures (Floyd &
md a good .consttll':t predicted to show substantial levels Widaman, 1995; Tinsley & Tinsley, 1987; see
}unt of the temporal stability). In contrast, finding the also Lee & Ashton, Chapter 25, this volume),
Jf its con- high retest correlation for a measure of an CFA assumes that a large number of observa-
)f emotion eITlotIOllal state would not be reassuring but tions or items are a direct result (or expression)
e emotion undermine its substantive validity be- of a smaller number of latent sources (i.e., un-
experience, specific emotional states are assumed to observed, hypothetical, or inferred constructs).
lysiological fIuCtllai:e across time and situations (e.g., Wat- However, CFA eliminates some of the arbitrary
1999), and & McKee Walke~ 1996). features often criticized in exploratory facror
lOst validly Nonetheless, despite their imperfections, analysis (Gould, 1981; Srernberg, 1985). First,
ience with five forms of validational evidence in- CFA techniques require the researcher to spec-
ures, emo- most of the validity concerns that are im- ify an explicit model (or several competing
lings from oortamt to personality research; thus, Table models) of how the observed (or measured)
nd physio- or,ovid,,, a reasonably comprehensive and variables are related to the hypothesized latent
"ests. How heuristil:ally useful list for personality research- factors. Second, CFA offers advanced statistical
ents ought consider as they plan a program of techniques that allow the researcher to test
Jst a theo- research. More specifically, we how well the a priori model fits the particular
Itive valid- to reiterate two important points that are data; even more important, CFA permits com-
re to show to methodologists but have not yet been parative model testing ro establish whether the
of sadness adopted in our empirical journals: (1) a priori model fits the data better (or worse)
xpression, t."ld"nc:e concerning traditional issues of reli- than plausible alternative or competing mod-
a unitary ought to be part of the construct els.
be modi- v:;~~~t~~17t:,program, under the heading of gen- CFA models can also be displayed graphi-
;;ystematic .~ and (2) evidence about di- cally, allowing us to effectively communicate
'al differ- m,:nsionalii:y must also be included in valida- the various assumptions of each model. Two
:e the co- resear:ch, under the heading of structural examples are shown in Figure 27.3. Figure
~xpression valldltv. following section, we reconsider 27.3a shows a common-factor model in which
ir individ- now from the perspective of the a single underlying construct (neuroticism,
Js,2000). me,asUirement model in SEM. shown as an ellipse at the top) is assumed to
Jest of all give rise to the correlations between all 12
and, ulti- items, or responses A1 to D6 (the observed
5: theory- .1\'lloI1el Testing in Construct variables, shown in squares). Following con-
'ork for a ya:lidati.on and Scale Construction vention (Bentler, 1980), ellipses are used to rep-
resent latent variables, whereas squares repre-
measurement model in structural equation sent measured (or manifest) variables; arrows
(SEM; Joereskog & Sorbom, 1981; with one head represent directed or regression
also 1980) is based on confirma- parameters, whereas two-headed arrows
factor analysis (CFA). Kline (2004), (which are often omitted) represent covariance
~wed five .LUenun (2004), McArdle (1996), and Bollen of undirected parameters. Note that each mea-
t1 of con- Long (1993) have provided readable intro- sured variable has two arrows leading to it.
~ consid- CFA is particularly promising be- The arrow from the latent construct is a factor
484

A ANALYZING AND INTERPRETING PERSONALITY DATA

NEUROTICISM

.51 .49
.62
.61

.74
.76 .57 .58 .80
A6

.61
cp
.63
I
.55 .60 .79 .71
B

.67

ANXIETy

DEPRESSION

.64 .69 .43 .75 .54 .57

T~ I,,"", n, ,,' "'''o,'~.


.60 .42
.55

", '~'" '" '"'" ~~=,~ 1 fu",~" """~'" ,."'


.77 .71
"GURE n..'. --"" "=,'0"""0 ",,',",_., " "'0."." ,,' "from
T,", "AI. N
: 649. In each panel, the top row of values represents estimates of standardized regression coefficients

gen~
(i.e., factor loadings in correlational metric). The bottom row of values Provides estimates of error vari-

,,~, "-."..,.••••••", w,,, ,~ "=,'1 "',,~, "',..,


ances (i.e., uniquenesses expressed in variance terms). Panel (a) shows parameter estimates for the
eral neuroticism factor model (Modell in Table 27.6); panel (b) shows parameter estimates for rhe best-
both the anXiety and depression facrors (MadelA'4 '" ""'" 27.6).
in Table "" _ ,,' M
Reliability and Construct Validation 485

loading L m that represents the strength of the convergent loading (on its intended anxiety
effect that the latent construct has on each ob- factor) and a discriminant loading (on the de-
served variable. The other arrow involves an- pression factor).
other latent variable for each observed vari- Note that this model includes an arrow be-
able-these are unique factor scores (em) that tween the two constructs, indicating a correla-
represent the unique or residual variance UP) tion (or covariance); the two constructs are not
remaining in each observed variable. independent (orthogonal) but related (oblique).
Conceptually, this model captures a rather This correlation tells us about discriminant va-
strong structural hypothesis, namely, that the lidity at the level of the constructs. If the corre-
12 observed variables covary only because they lation is vety high (e.g., .90), we would worry
all measure the same underlying construct, that the two constructs are not sufficiently dis-
nelurcltic:isrn. In other words, we hypothesize tinct and that we really have only one con-
the only thing the items have in common is struct. If the correlation is quite low (e.g., .10),
one latent construct, and all remaining or we would be reassured that the two concepts
w;idual item variance is idiosyncratic to each show good discriminant validity with respect
item and thus unshared. This structural model to each other. There is another possibility here,
.61 pnJvi,Jes a new perspective on how to define namely, that the two constructs are substan-
important terms we have used in this chap- tially correlated (e.g., .70) because they are
the convergent validity of the item and ran- both facets, or components, of a broader,
error. In particula.r; the loading of an item superordinate construct that includes them
the construct of interest represents the con- both, which, of course, is the hierarchical
ve,-oent validity of the item, whereas its unique model of the NEO-PI-R from which this exam-
var-iarlce represents random error. However, in ple was taken. Note that here we are address-
simple measurement model, we cannot ad- ing issues that involve questions about the
discriminant validity. dimensionality and internal structure of the
Compare the measurement model in Figure constructs being measured. We discussed these
to the one in Figure 27.3b, which postu- issues earlier in the section on reliability, espe-
two factors (anxiety and depression) in- cially coefficient alpha, but, as we argued in the
tlu<'ncmg res'po,ns,,. to the same 11 items. Here section on validity, dimensionality issues
hypoth,,,i:zin.g two distinct constructs, should be considered part of the construct vali-
model incorporates an- dation program (see Table 27.5) because they
condition, known as simple structure. concern the structural validity of the interpre-
convergent validity loadings (represented tation of our measures.
arrows from the latent constructs to the ob-
items) indicate that the first 5 items are
'0
Structural Validity Examined
mtlluelnce:d by the first construct bur not the sec-
with SEM
construct, whereas the last 6 items are in-
uenced only by the second construct and not Structural validity issues resurface with great
e first. In other words, 11 of these items (all regularity in the personality literature. Some of
xcept item A4) can he uniquely assigned to the most popular constructs have endured pro-
nIy one consrrucr, thus greatly simplifying the tracted debates focused on their structural
measurement model. validity: self-monitoring, attributional style,
With two constructs in the measurement hardiness, Type A coronary-prone behavior
i:nodel, we can also address issues of discrimi- pattern, and, most recently, need for closure
pant validity. Whereas an item's loading on the (e.g., Hull, Lehn, & Tedlie, 1991; Neuberg, Ju-
c:onstruct of interest represents convergent va- dice, & West, 1997). Part of the problem is that
lidity and its unique variance random error, its many of these constructs, and the scales de-
loading on a construct otber than the intended signed to meaSure them, were initially assumed
pIle speaks to its discriminant validity. For item to be unidimensional, but later evidence chal-
',..;\-4, Our earlier correlational analyses had sug- lenged those initial assumptions. It is therefore
gested discriminant validity problems (see Ta- instructive to consider how SEM approaches
?le 27.4), and we therefore examined one can help address the underlying issues and to
I1l.odel (Model 4, which is shown in Figure provide some numerical examples to illustrate
,g7.3b) that allowed item A4 to have both a the issues.
486 ANALYZING AND INTERPRETING PERSONALITY DATA

Testing the Unidimensionality of Scales: the issue of multidimensionality. In these CFA


What Coefficient Alpha Reliability measurement models, the item loadings repre-
Cannot Do sent how much of the item variance is shared
across items (and is thus generalizable). Error is
As we noted earlier, coefficient alpha does not captured by the residual item variance (i.e., 1
address whether a scale is unidimensional. For minus the squared loading) indicating how
this purpose, factor analyses are needed; CFA much variance is unique to that item; the
provides the most rigorous approach because it proportion of shared to total item variance
can test how well the interitem correlation ma- is often referred to as content saturation.
trix for a particular scale fits a singleMfactor, Dimensionality, however, is captured by the rel-
rather than multifactor, model. In other words, ative fit of the one-factor model versus
how well can the loadings on a single factor re- multiple-factor models. Comparing Scales A
produce the correlation matrix actually ob- and C in Table 27.3, the longer Scale A is
served? clearly more unidimensional than C, yet its
Consider again the three scales for which we items do not show greater content saturation
presented interitem correlation data in Table (i.e., higher factor loadings and lower error
27.3; all had alphas of .87. What do CFAs of terms). In other words, unidimensionality does
these correlation matrices tell us about their not imply lower levels of measurement error
dimensionality? One-factor models perfectly fit (i.e., unreliability), and vice versa.
the data pattern for both Scales A and B, just as
expected for these unidimensional scales. CFA
Comparative Tests of Measurement Models:
also estimates factor loadings for the items,
Testing Alternative Models for
providing an index of content saturation. For
the Neuroticism Example
Scale A (which had nine items a11 intercor-
relating .42), the items alJ had the same factor The general model is shown in Figure 27.2,
loading of .648 (i.e., the square root of .42, along with the mean interitem correlations and
which was the size of the interitem correlations alpha reliability coefficients we obtained when
in this example). For Scale B (six items alJ we applied the traditional canon of internal
intercorrelating .52), the factor loadings were consistency analysis to these scales: a 12-item
alJ .721; this slightly higher value reflects that superordinate neuroticism scale with 6-item
the interitem correlations (and thus content an..xiety and depression facets.
saturation) were slightly higher for Scale B than The traditional method in the analysis of
for Scale A. structural validity is exploratory factor analy-
In contrast, for Scale C (which had heteroge- sis. When applied to this example (see the
neous interitem correlations of 040 and. 70, av- interitem correlation matrix given in Table
eraging .52) the fit of the one-factor model was 27.4), we expected evidence for both a general
unacceptable (for N ; 200, CFI ; .726, and neuroticism factor (the first unrotated principal
root mean square error of approximation component) and two rotated factors represent-
[RMSEA]; .263). Note, however, that the item ing the anxiety and depression items, respec-
loadings were alJ .721 (i.e., the square root of tively. Indeed, principal components analyses
.52, which was the mean of the interitem corre- resulted in eigenvalues that made it difficult to
lations), the same as for the truly unidimen- decide between the one- and two-factor solu-
sional Scale B. As expected, the two-factor tions: The fIrst unrotated component ac-
model fit Scale C better than did the one-factor counted for 39% of the total variance, almost
model, and perfect fit was obtained when we four times the size of the second component,
alJowed the factors to correlate. Reflecting which accounted for only 10.8%. After
their .70 correlations with each other, Scale C varimax rotation, however, the two factors
items 1,2, and 3 loaded .837 on factor 1 and 0 were almost identical in size, accounting for
on factor 2, whereas items 4, 5, and 6 loaded 0 26% and 23% of the variance.
on factor 1 and .837 on factor 2. The interitem The loadings for the two rotated factors are
correlation of 040 across the two subsets of shown in Figure 27.4, with the x-axis repre-
items was reflected in an estimated correlation senting the anxiety factor and the y-axis the de-
of .571 between the two latent factors. pression factor. Overall, there was some evi-
These results highlight that the error (or un- dence of simple structure, in that the items
reliability) present in an item is separate from tended to cluster close to their anticipated fac-
Reliability and Construct Validation 487

High
Anxiety High
1.00 Neuroticism
,
02 ,,
A ,
A ,,
06 ,
,,
,
,
,,11II
~,," A4 A6
,
,, 11II
,,
, 11II
c
a , A2 o<1J
,,
>: .~ , ~I
<1J _.
a Q) cn<O
~-1.00 ,, cn~
--'
11II A5(r) ,, 1.00
o'
o , ::>
, ,,
A3(r) 11II ,
, ,,
,,
,
,,
,, 01(r)
,
,
,, 1. 03(r)
,
,,
,,
Low -1.00
Neuroticism Low
Anxiety

r",uK.C 27.4. Loading plot for exploratory factor (principal components) analysis of 12 NEO-PI-R
nelJroticiism items, with 6 items each from the anxiety (A) and depression (D) facet scales (see Table 27.4
27.2). N = 649 college students. An "(r)" following the item number indicates items to be re-
verse-ke~led (i.e., items measuring low anxiety or low depression). The dashed line indicates the first
unrotated component representing the general neuroticism factor. Note how the loadings for item A4
often feel tense and jittery") place it about halfway between the anxiety and depression factor axes
close to the general neuroticism factor, suggesting that this item is a better indicator of general dis~
(Watson et aI., 1995) shared by both anxiety and depression than a unique (or primary) indicator
anxiety.

locations (e.g., all 3 low-anxiety items fall that includes the low-anxiety and low-
Cu,:errler jusr below the low-anxiety pole). But depression (reverse-keyed) items.
simple structure is not perfect; as expected, The dashed iine represents the location of
items formed a positive manifold (all items the first unrotated component, that is, the
either in the high-high or low-low quad- general neuroticism factor that accounted for
with the two a priori facet scales forming almost 40% of the item variance; all 7 true-
realsoualbly separable anxiety and depression keyed items had substantial positive loadings
cluste"s, especially in the lower-ieft quadrant on it, and all 5 false-keyed items had subsrau-
488 ANALYZING AND INTERPRETING PERSONALITY DATA

tial negative loadings. This factor clearly cap- simple observed correlation of .58 between the Ir
tures the positive correlation (r = .58) be- unit-weighted scales and just a tad below the fc
tWEeD the two item sets. Note how the estimate of .74 for the correlation corrected for
loadings for item A4 ("1 often feel tense and attenuation due to unreliability." Sc
jittery") place it about halfway between the as
iSl
orthogonal anxiety and depression factor
More Complex Models ce
axes and rather dose to the general neuroti-
Including External Validity in
cism factor; this item may be a better indica-
tor of general distress (Watson et aI., 1995) In a fully developed construct validation pro-
d,
shared by both anxiety and depression than a gram, of course, we would not stop here. Next, l'
tic
unique indicator of anxiety. one might begin studies of external validity,
These exploratory factor analyses leave us modeling the relations of these two correlated itl
with some alternative hypotheses that we can CFA-based constructs with other measures of
d,
anxiety and depression, preferably drawn from fa
test formally using CFA. The CFA results are
01
summarized briefly in Table 27.6. We begin other data sources, such as interview-based
st
with the one-factor model because it is the sim- judgments by clinical psychologists (Jay &
is
plest or "compact model" (Judd et aI., 1995). John, 2004). Using an MTMM design to ad-
c,
Because the models are all nested, we can sta- dress external validity, we would gather evi-
v,
tistically compare them with each other, testing dence about both convergent validity (e.g., self-
2'
the relative merits of more complex (i.e., full or reported anxiety with measures of anxiety
d,
augmented) models later. Without going into drawn from another data source) and discrimi-
lc
detail, the model-comparison results show that nant validity (e.g., self-reported anxiety with
we can clearly reject Modell (one general neu- measures of depression drawn from another
IT
roticism factor only) and Model 2 (two data source). Again, we would use SEM proce-
al
uncorrelated anxiety and depression factors), dures for these additional validation steps, be-
Vi
as both had substantially and significantly cause one can model the measurement struc-
IT
higher X' values than Model 3, which defines ture we have discussed so far, along with a
it
anxiety and depression as two distinct but cor- predictive (or convergent) validity relation. a1
related factors. Model 4, shown in Figure Note that this model addresses the criterion d
27.3b, also takes into account the discriminant problem that seemed so intractable in the early
a:
validity problems of item A4 by allowing it to treatments of validity. The criterion itself isnot (,
load on the depression factor; this model fits treated as a "gold standard" but is modeled as Ir
significantly better than the simpler Model 3, a construct that must also be measured with g,
though the change in Xl was not large in size. fallible observed indicator variables. We should ir
These conclusions were also consistent with a note that the models used to represent trait and c,
wide variety of absolute and relative fit indices, method effects in MTMM matrices are consid- a
like the Goodness-of-Fit Index (GFI) and the erably more complex than the simple models II
Comparative Fit Index (CFl). Table 27.6 also considered here. For example, McArdle (1996, e:
presents the estimated correlation between the Fig. 2) provides an elegant model for a more
two latent factors, which was .71 in Model 3. complete representation of the construct vali- "a
This value was, as expected, higher than the dation program. 51
h
a
TABLE 27.6. Using CFA to Examine Structural Validity: Comparative Model Fit for a General b
Neuroticism Factor as Well as Anxiety and Depression Facets
Interfacror
Model X' df t!.x' GFI CFI correlation
1. One general factor: neuroticism 504.5 54 212.1" .87 .81 N/A
2. Two uncorrelated factors: an.,'{iety and depression 547.5 54 255.1 * .89 .79 .00 II
3. Two correlated factors: anxiety and depression 292.4 53 N/A .92 .90 .71 l:
4. Two correlated factors, plus one cross-loading item 271.9 52 20.5* .93 .91 .67
r
Note. .1X1 : compared wirh Model 3; GFl, Goodness-of-Fir Index U6ereskog & S6rbom, 1981); CR, Comparative Fir Index s
1980); mean loading, mean of rhe standardized loading of each item on irs facror(s).
.. p < .05.
r
Reliability and Construct Validation 489

he Im.plic";t:jo~~Sof Construct Validation think insread about facets of generalizability


he Construction such as time, items, and observers. We railed
Dr against some of our pet peeves, such as
far, we have discussed construct validation overreliance on coefficient alpha, articulating
if the measure to be validated already ex- its limitations and arguing for a more nuanced
However, construct validation issues are understanding of this ubiquitous index. We ad-
not only during the evaluation of exist- vocated for a more process-oriented concep-
measures but also during each stage of their tion of construct validity, suggesting that the
)-
(see Simms & Watson, Chaprer validation process deserves the same thought-
t,
volume). Most modern scale construc- ful consideration as any other form of theory
efforts have adopted, implicitly or explic- testing. We illustrated, briefly, the power of DO-
y,
d many of the features of the construct vali- longer new SEM techniques to help model
,f program discussed in this chapter. In measurement error, as well as convergent and
n
much of our presentation here has spelled discriminant validity.
d the kinds of issues that researchers con- This chapter has noted some shortcomings
x: structmg a new measure must consider. There of current measurement conventions, practices
1- formula, but the integrated con- thar ought to be changed. Nonetheless, we are
t-
construct validity and the various up bear about the future. Specifically, over the
procedures summarized in Table past years we have become persuaded by the
y
provide a blueprint for the kinds of evi- logic of comparative model testing; we now see
to be gathered and procedures to be fol- it as the best strategy for evaluating and im-
1 proving our measurement procedures. We are
r construction, like measure- confident that as a new generation of personal-
more generally, involves theory building ity researchers "grows up" using model com-
thus requires an iterative process. It begins parison strategies, and as more of the old dogs
generating hypotheses, (2) building a among us learn this new trick, comparative
and plausible alternatives, (3) generating model testing will continue to spread and help
using construct definitions, generaliz- improve the validity of our measures. And be-
facets, and content validation proce- cause valid measures are a necessary precondi-
as guides (for information about item tion for good research, eve1)'thing that we do
response formats, see Visser et aI., 2000), as scientists comes back, in the end, to the im-
gathering and analyzing data, (5) confirm- portance of being valid.
and disconfirming the initial models, (6)
(en''tat:ing alternative hypotheses leading to (7)
models, (8) additional and more Acknowledgments
.c0lrrte.nt-valic items, (9) more data gathering,
so on. The cycle continues, until a working The preparation of this chapter was supported, in
has been established that is "good part, by a grant from the Retirement Research Foun-
that the investigator can live dation and by a Faculty Research Award from the
now, given the constraints and limits University of California-Berkeley. We are indebted
to Monique Thompson, Josh Eng, and Richard W.
'c'w,,," research. In other words, scale con-
Robins for their helpful comments on earlier ver-
~t:tI"ctlon and construct validation go hand in
sions.
one cannot be separated from the other,
both fundamentally involve rheory-
Fpllilding and theory-tesring efforts. Notes
1. Table 27.2 shows that both Pearson and inttac-
"0".,,, Final Thoughts lass correlations can be used to index retest stabil-
ity. Pearson correlations reflect changes only in
the relative standing of participants from one
chapter we have tried to strike a balance
time to the other, which is typically the prime
petw',en description and prescription, between concern in research on individual differences.
is" and "what should be" the practice of \Vhen changes in mean levels or variances are of
C'UO'""_u'cm and construct validation in per- interest toO, then the intraclass correlation is the
" • :~.~UdU'Y research. We reviewed the traditional appropriate index.
coefficients but urged the reader to 2. In one conte:>"L, this internal consistency concep-
490 ANALYZING AND INTERPRETING PERSONALITY DATA

tion does not apply. In most psychological mea- Loevinger, ]. (1957). Objective tests as instruments
surement, the indicators of a construct are seen as psychological theory. Psychological Reports, 3,
effects caused by the construct; for example, indi- 694.
viduals endorse items about liking noisy parties McArdle,].]. (1996). Current directions in strue'·ne.,1
because of underlying individual differences in factor analysis. Current Directions in PSJ,cholo'gicdl
e:x"traversion. However, as Bollen (1984) noted, Science, 5, 11-18.
constructs such as socioeconomic status (SES) are Messick, S. (1995). Validity of psychological
different. SES indicators, such as education and ment. American Psychologist 50, 741-749.
income, cause changes in SES, rather than SES Wiggins,]. S. (1973). Personality and prediction:
causing changes in education or income. In these pies of personality assessment. Menlo Park,
cases of "cause indicators,'" the indicarors are not Addison-Wesley.
necessarily correlated and the internal consis-
tency conception does nor apply.
3. As an additional, distinct form of validity evi- References
dence, Messick (l989) included what he called
consequential validity, which addresses the per- American Educational Research Association, A11n"ican
sonal and societal consequences of interpreting Psychological Association, and National Council Bri
and using a particular measure in a particular Measurement in Education. (1999). Standards
way (e.g., using an ability test to decide on school edzlcational and psyclJological testing. W"sbin!~O",
admissions). It requires the test user to confront DC: American Educational Research Association.
issues of test bias and fairness and is of central American Psychological Association. (1954). Tecb"iClli Bu
importance when psychological measures are recommendations for psychological tests and
used to make important decisions about individu- sis techniques. Psychological Bulletin, 51, LV'-.•"S,
als. This type of validity evidence is generally American Psychological Association. (1985). s~~:~~:~~.
more relevant in applied research and in educa- for educational and psychological testing. 1; Bu
tional and employment settings than in basic per- ton, DC: Author.
sonanty research, in which scale scores have little, Angleitner, A., Jobn, O. P., & Lohr, F. J. (1986).
if any, consequence for the research participant. what you ask and how you ask it: An itemnnetric
4. The mean loadings in Table 27.6 indicate that analysis of personality questionnaires. In
loadings were substantial in all models, with the Angleitner &]. S. Wiggins (Eds.), Personality Bu
general-factor model falling JUSt below the two- ment in questionnaires (pp. 61-108).
factor models. Figure 27.3b shows the final pa- Springer-Verlag.
rameter estimates for Model 4; consistent with Ayduk, 0., Mendoza~Denton,R., Mschel, W., u,'w,oey, Ca

::l:~O,:O;O)~.;~~1~s~:::~.
the discriminant-validity problems apparent in
Table 27.4 and Figure 27.4, item A4 had signifi- G.,
the Peake, P. K., & self:
interpersonal Rodriguez, M.
Strategic
cant loadings on both the anxiety and depression coping with rejection ser"i,ivity.Jo,·" Ca
latent facror. ity and Social Psycholog)~ 79, 776-792.
Bakeman, R. (2000). Behavioral observation and
ing.ln H. T. Reis & C. M. Judd (Eds.), Ha,ndbook
ReconunendedReadings research methods in social and personality Ca
ogy (pp. 138-159). New York, Cambridge Unliversity
Campbell, D. T., & Fiske, D. W. (1959). Convergent Press.
and discriminant validation by the multitrait- Bartholomew, K., Henderson, A. J. Z., & Marcia, J.
multimethod matrix. Psycbological Bulletin, 56, 81- (2000). Coded semistructured interviews in Ca
105. psychological research. In H. T. Reis & C. M.
Cronbach, 1.]. (1988). Five perspectives on the validiry (Eds.), Handbook of research methods in Ca
argument. In H. Wainer & H. I. Braun (Eds.), Test personality psychology (pp. 286-312). New
validity (pp. 3-17). Hillsdale, Nj, Erlbaum. Cambridge University Press. Co
Cronbach, L.]., & Meehl, P. E. (1955). Construct valid- Bentler, P. M. (1980). Multivariate analysis with
ity in psychological tests. Psychological Bulletin, 52, variables: Causal modeling. Ammal Review of
281-302. chology, 31, 419-456.
Gosling, S. D., Kwan, V. S. 1':., & John, O. P. (2003). A Bentler, P. M. (1990). Comparative fit indices in Co
dog's got personality: A cross~species comparative tural models. PsyclJological Bulletin, 107,238-246.
approach to personality judgments in dogs and hu- Block,]. H., & Block,]. (1980) The role of ego-control
mans. jOl/rnal of Personality and Social Psychology, and ego-resiliency in the organization of behavior. In Co
85,1161-1169. W. A. Collins (Ed.), Development of cognition,
Judd, C. M., & McClelland, G. H. (1998). Measme- (ed, and social relations: The Mimtesota Symposia
ment. In D. T. Gilbert, S. T. Fiske, & G. Lindzey Olt Child Psychology (Vol. 13, pp. 40-101). Cr;
{Eds.}, Halldbook of social psychology (Vol. 2, Hillsdale, NJ: Erlbaum.
pp. 180-232). Boston: McGraw~Hill. Bollen, K. A. (1984). Multiple indicators: Internal con~
Reliability and Construct Validation 491

ruments of sistency or no necessary relationship? Quality and Cronbach, 1. J. (1947). Test "reliability'"': Its meaning
·ts, 3, 635- Quantity, 18, 377-385. and determination. Psychometrika, 12, 1-16.
Bollen, K. A., & Long,]. S. (Eds.). (1993). Testing struc~ Cronbach, 1. J. (1951). Coefficient alpha and the inter-
structural tural equation models. Newbury Park, CA: Sage. nal structure of tests. Ps)'chometrika, 16, 297-334.
Ichological H. I., Jackson, D. N., & Wiley, D. E. (2002). Tbe Cronbach, L. J. (1988). Five perspectives on the validity
role of constmcts in psychological and educational argument. In H. Wainer & H. 1. Braun (Eds.J, Test
cal assess- measurement. Mahwah, N]: Erlbaum. validity (pp. 3-17). Hillsdale, NJ: Erlbaum.
9. Brennan, K. A., Cl",k, C. L., & Shaver; P. R. (1998). Cronbach, 1. ]., & Gleset:, G. C. (1957). Psychological
011: Princi~ Self-report measurement of adult attachment: An in- tests and personnel decisions. Urbana: University of
Park, CA, tegrative overview. In J. A. Simpson & W. S. Rholes Illinois.
{Eds.}, Attachment theory and close relatiollships Cronbach, L. J., Gleset:, G. c., Nanda, H., &
(pp. 46-76). New York: Guilford Press. Rajararnam, N. (1972). The dependability of behav~
Bn:nn.an, R. (2001). An essay on the history and fu- ioral measurements: Theory of generalizability for
ture of reliability from the perspective of replica- scores and profiles. New York: Wiley.
tions. journal of Educational Measurement, 38, Cronbach, 1. ]., & Meehl, P. E. {1955}. Construct valid-
American 295-317. ity in psychological tests. Psychological Bulletin, 52,
:ouncilon D. C. (2004). Comment: Making an argument 281-302.
ldards for design validity before interpretive validity. Mea- Cronbach, L. N., Rajararnam, N., & GIeser; G. C.
lshingron, surement: Interdisciplinary Research and Perspec- (1963). Alpha coefficients for stratified~paraJlel tests.
ciation. tives,2, 171-191. Educational and Psychological Measurement, 25,
Technical M. (1984). You don't always get what you pay 291-312.
Id diagno- Measuring depression with short and simple ver- Cureton, E. E. (l951). Validity. In E. F. Lindquist (Ed.),
01-238. sus long and sophisticated scales. journal ofResearch Educational measureme71t (pp. 621-694). Washing-
Standards in Personalit}" 18, 81-98. ron, DC: American Council on Education.
Washing~ M. (1986). Methods of personality inventory Dawes, R. M., & Smith, T. 1. (1985). Attirude and
development: A comparative analysis. In A. opinion measurement. In G. Lindzey & E. Aronson
986). Ir's Angleitner & J. S. Wiggins (Eds.), Personality assess- (Eds.), Handbook of social psychology (Vol. 1, pp.
:emmetric ment via questio1lnaire (pp. 109-120). Berlin: 509-566). New York: Random House.
In A. Springer-Verlag. Eagly, A. H., & Chaiken, S. (1993). The pS)'cbology of
it)' assess- D. M., & Craik, K. H. (1983). The act frequency attitudes. Fort Worth, TX: Harcourt Brace
. Berlin: appn>ach to personality. Psychological Review, 90, Jovanovich.
Embretson, S. E. (l996). The new rules of measurement.
Downey, T., & Perry, R. E. (1982). The need for cog- Psychological Assessment, 8, 341-349.
egulating journal of Personality and Social Psycholog)', Embretson, S. E., & Reise, S. P. (2000). Item response
ation fot 116-131. theory for psychologists. Mahwah, N]: Erlbaum.
Personal- (::arnpt>eU, D. T., & Fiske, D. W. (1959). Convergent Epstein, S. (1980). The stability of behavior: n. Implica-
discriminant validation by the multitrait- tions for psychological research. Americatt Psycholo~
and cod- mlI1tim,th"d matrix. Psychological Bulletin, 56, 81- gist, 35, 790-806.
dbook of Fishbein, M., & Ajzen, 1. (1974). Attitudes toward ob-
psycbol- A., McClay, J., Moffitt, T., Mill, ]., Martin, J., jects as predictors of single and multiple behavioral
Jniversity 1. W., et al. (2005). Role of genotype in the cy- criteria. Psychological Review, 81, 59-74.
of violence in maltreated children. Science, 297, Floyd, F. J., &Widaman, K. F. (1995). Facror analysis in
'cia,]. E. rhe development and refinement of clinical assess-
in social R. B. (1957). Personality and motivation struc- ment instruments. Psychological Assessment, 7,286-
M.Judd and measurement. New York: World Book. 299.
lcialand R. B. {l972}. Personality and mood by ques- Fraley, R. C., & Shaver, P. R. (1998). Airport separa~
w York: tiomI.,ir.,. San Francisco: Jossey-Bass. rions: A naturalistic study of adult attachment dy~
Cohen, P., West, S. G., & Aiken, 1. S. (2003). namics in separating couples. journal of Personality
ch latent multiple regression/correlation analysis for and Social Psychology, 75, 1198-1212.
. of Ps)'- behavioral sciences (3rd ed.). Mahwah, N]: Fraley, R. c., Waller, N. G., & Brennan, K. A. (2000).
An item response theory analysis of self-report mea-
in struc~ ]. M. (1993). What is coefficient alpha? An ex- sures of adult attachment. Journal ofPersonality and
g-246. ?m,in'''icm of theory and applications. journal of Ap- Social Psychology, 78, 350-365.
-control Psychology, 78, 98-104. Gosling, S. D., Kwan, V. S. Y., & John, O. P. (2003). A
lvior. In P. T., & McCrae, R. R. (1992). NEO-PI-R. Tbe dog's got personality: A cross~species comparative
'iOll, af~ NEG Personality Inventory. Odessa, FL: approach to personality judgments in dogs and hu-
!1nposia ~syd,ol,)gical Assessment Resources. mans. journal of Personality and Social Psychology,
0-101). K. H. (1986). Personality research methods: An 85,1161-1169.
hi"tolricorlperspective. Journal of Persollality, 54, 18- Gosling, S. D., Rentfrow, P. J., & Swann, W. B. (2003).
tal con- A very brief measure of the Big Five personality do-
492 ANALYZING AND INTERPRETING PERSONALITY DATA

mains. Journal of Research in: Personality, 37, 504- Joereskog, K. G., & S6rbom, D. (1981). LISREL i
528. User's guide. Chicago: National Educational Rei,:
Gough, H.G. (1957). The California Psychological llt- sources.
ventory administrator's guide. Palo Alto, CA: Con- John, O. P., Cheek, J. M., & Klohnen, E. C. (1996). On .
sulting Psychologists Press. the nature of self-monitoring: Construct explication
Gould, 5.]. (1981). The mismeaSllre of man. New York: via Q-son ratings. jOl/mal of Personality and SOcidi"
Norton. Ps)'cholog}', 71, 763-776.
Gray~Litde, B., Williams, S. L., & Hancock, T. D. John, O. P., & Gross, J. ]. (2004). Healthy and uul
(1997). An item response theory analysis of the healthy emotion regulation: Personality processes, inl
Rosenberg Self-Esteem Scale. Personality and Social dividual differences, and life span development. jOllr-
Psychology Bulletin, 23, 443-45l. nal of Personality, 72, 1301-1333.
Greenwald, A. G., & Farnham, S. D. (2000). Using the Jolm, O. P., & Gross,].]. (2007). Individual d;~::~<~~~:~
Implicit Association Test to measure self-esteem and in emotion regulation. In J.]. Gross (Ed.), 1-;
self-concept. Journal of Personality alld Social Psy- of emotion regulatioll (pp. 351-372). New
chology, 79, 1022-1038. Guilford Press.
Greenwald, A. G., McGhee, D. E., & Schwartz,]. 1. K. John, O. P., Hampson, S. E., & Goldberg, L. R.
(1998). Measuring individual differences in implicit The basic level in personality-trait h!,:rarch.ie"
cognition: The Implicit Association Test. jOl/mal of Studies of trait use and accessibility in different
Personality and Social Psychology, 74, 1464-1480. texts. Journal of Personalit)' and Social Ps:ycho!c<gy,
Gross, J. J. (1999). Emotion and emotion regulation. In 60, 348-361.
L. A. Fervin & O. F. John (Eds.), Handbook of per- John, O. P., & Robins, R. W. (1993). Determinants
sonality: Tbeory alld research (2nd ed., pp. 525- interjudge agreement on personality traits: The
552). New York: Guilford Press. Five domains, observability, evaluativeness, and
Gross,].]., John, O. P., & Richards, J. M. (2000). The unique perspective of the self. Journal
dissociation of emotion expression from emotion ex- 61, 521-551.
perience: A personality perspective. Personality and John, O. P., & Robins, R. W. (1994). Accuracy and
Social Ps)'cbology Bul/etin, 26, 712-726. in self-perception: Individual differences in
Gross,]. J., & Levenson, R. W. (1993). Emotional sup- enhancement and the role of narcissism. journal
pression: Physiology, self-report, and expressive Personality and Social Psychology, 66, 206-219.
behavior. journal of Personality and Social Psychol- John, O. P., & Srivastava, S. (1999). The Big Five
ogy, 64, 970-986. taxonomy: History, measurement, and
Gulliksen, H. (1950). Theory of mental tests. New perspectives. In 1. A. Pervin & Oliver P. John (Eds.),
York: Wiley. Handbook of personality: Theory and research
Hambleton, R. K., Swaminathan, H., & Rogers, H.]. ed., pp. 102-139). New York: Guilford Press.
(1991). Fundamentals of item response theory. Judd, C. M., & McClelland, G. H. (1998). Measure-
Newbury Park: Sage. memo In D. T. Gilbert, S. T. Fiske, & G. Lindzey
Hamilton, D. L. (1968). Personality attributes associ- (Eds.), Handbook of social psychology (Vol. 2,
ated with extreme response style. Psychological Bul- pp. 180-232). Boston: McGraw-Hili.
letin, 69, 192-203. Judd, C. M., McClelland, G. H., & Culhane, S. E.
Harker, L., & Keltner, D. (2001). Expressions of pOSt· (1995). Data analysis: Continuing issues in the every-
tive emotions in women's college yearbook pictures day analysis of psychological data. All/lIlal Review of
and their relationship to personality and life out- Psychology, 46, 433-465.
comes across adulthood. journal of Personality and Kane, M. (2004). Certification testing as an illustration
Social Psychology, 80, 112-124. of argument·based validation. Measurentent: Inter-
Hathaway, S. R., & McKinley,]. C. (1943). The Minne· disciplinary Research and Perspectives, 2,135-170.
sota Multiphasic Personality Inventory (rev. ed.). Kashy, D. A., & Kenny, D. A. (2000). The analysis of
Minneapolis: University of Mjnnesora Press. data from dyads and groups. In H. T. Reis & C. M.
Helson, R., & Soto, C. J. (2005). Up and down in mid- Judd (Eds.), Handbook of research methods in social
dle age: Monotonic and nonmonotonic changes in and personality psychology (pp. 451-477). New
roles, status, and personality. Journal of Personality York: Cambridge University Press.
and Social Psychology, 89, 194-204. Kenny, D. A. (1976). An empirical application of confir-
Hull, J. G., Lehn, D. A., & Ted!!e, J. C. (1991). A gen- matory factor analysis to the multitrait-multimethod
eral approach to testing multi-faceted personality matrix. journal of Experimental Social Psychology,
constructs. journal of Personality and Social Psy- 12,247-252.
cholog}', 61, 932-945. Kenny, D. A. (1994). Interpersonal perception: A social
Jay, M., & John, O. P. (2004). A depressive symptom relatiOlls analysis. New York: Guilford Press.
scale for the California Psychological Inventory: Kenny, D. A., & Kashy, D. A. (1992). Analysis of the
Construct validation of the CPI·D. Psychological As- multitrait-multimethod matrix by confirmatory fac-
sessment, 16, 299-309. tor analysis. Psychological Bulletin, 112, 165-172.
Reliability and Construct Validation 493

Kerr; N. L., Aronoff, j., & Messe, L. A. (2000). Ozer, D. j. (1989). Construct validity in personality as-
Methods of small group research. In H. T. Reis & C. sessment. In D. M. Buss & N. Cantor (Eds.), Person-
M. Judd (Eds.j, Handbook of research methods ill ality psychology: Recent trends and emerging direc-
social Gnd personality psychology (pp. 160-189). tiOns (pp. 224-234). New York: Springer~Veriag.
New York: Cambridge University Press. Paulhus, D. L. (2002). Socially desirable responding:
]. E., & Figueredo, A. J. (1997). The five~factor The evolution of a construct. In H. 1. Braun, D. N.
model plus dominance in chimpanzee personality. Jackson, & D. E. \Viley (Eds.), The role of constmcts
JOl/rnal of Research in Personality, 31, 257-271- in psychological and educational measurement
R. B. (2004). Principles alld practices of stmc- (pp. 49-69). Mahwah, N;' Erlballm.
tural equation modeling (2nd cd.). New York: Paulhus, D. L., Bruce, M. N., & Trapnell, P. D. (1995).
Guilford Press. Effects of self~presentation strategies on personality
V. S. Y., John, O. P., Kenny, D. A., Bond, M H., profiles and their structure. Personality and Social
& Robins, R. W. (2004). Recooceptualizing individ- Psychology Bulletin, 21, 100-108.
ual differences in self-enhancement bias: An interper- Paulhus, D. L., & John, O. P. (1998). Egoistic and mor~
sonal approach. Psychological Review, 111, 94-110. alistic biases in self~perception: The interplay of self~
]. C. (2004). Latent variable models: An intro~ deceptive styles with basic traits and motives. jOllmal
duction to factor, path, and structural analysis (4th of Personality, 66, 1025-1060.
ed.). Mahwah, N]: Erlbaum. Pedhazur, E. j., & Schmelkin, L. P. (1991). Measure~
Lo,evlng"r,]. (1957). Objective tests as instrumems of ment, desigll, and analysis: An integrated approach.
p~lcb,ol'Jgical theory. Psychological Reports, 3, 635- Hillsdale, N]: Erlbaum.
694. Petty, R. E., & Cacioppo, j. T. {1986}. Communication
F. (1984). Standard errors of measurement at dif- and persuasion: Central and peripheral rOlltes to atti-
ferent ability levels. journal of Educational Measure- tude change. New York: Springer-Verlag.
ment, 21, 239-243. Rammstedr, B., & John, O. P. (2006). Short version of
E, & Novick, M. R. (1968). Statistical theories of the Big Five Inventory: Development and validation
mental tests. New York: Addison-Wesley. of an economic inventory for assessment of rhe five
• ,"'"",. C. N., Bodenhausen, G. v., Milne, A. B., & factors of personality. Diagnostica, 51, 195-206.
]etten, j. (1994). Out of mind but back in sight: Ste- Rammstedt, B., & John, O. P. (2007). Measuring per-
on the rebound. JOIiTl/al of Personality and sonality in one minure or less: A 10-item shorr ver-
Psychology, 67, 808-817. sion of the Big Five Inventory in English and German.
H. W., & Grayson, D. (1995). Latent variable joltTllal of Research in Personality, 41, 203-212.
of multitrait-multimethod data. In R. H. Reis, H. T., & Gable, S. L. (2000). Evenr~samp[ing and
Hoyle (Ed.), Structural equation modeling: COllcepts, other methods for studying everyday experience. In
alld applications (pp. 177-198). Thousand H. T. Reis & C. M. Judd (Eds.), Handbook of re-
Oaks, CA: Sage. search methods ill social and personality psychology
<1"lCl"O.lC, J. J. (1996). Current directions in structural (pp. 190-222). New York: Cambridge University
analysis. Current Directions i/1 Psychological Press.
5, 11-18. Rezmovic, E. L., & Rezmovic, V. (1981). A confirma~
YVlc<",·ae,R. R., Herbst, J. H., & Costa, P. T., Jr. (2001). tory factor analysis approach to construct validation.
of acquiescence on personality factor struc- Educational and Psychological Measurement, 41,
In R. Riemann, F. M. Spinath, & F. Ostendorf 61-72.
Persona lit), and temperament: Genetics, evo- Robins, R. W., Hendin, H. M., & Trzesniewski, K. H.
and structure (pp. 217-231). Berlin: Pabst Sci- (2001). Measuring global self-esteem: Construct vali-
dation of a single-item measure and the Rosenberg
jj;ldllenbeI'gh, G. J. (1996). Measurement precision in self-esreem scale. Personality and Social Psychology
score and item response models. Psychological BlIlIetil1, 27, 151-161.
Me'/hc'ds, 1, 293-299. Rosenberg, M. (1979). Conceiving the self. New York:
+,""SSIC,k, S. (1989). Validity. In R. L. Linn (Ed.), Ed"ca- Basic Books.
measurement (3rd ed., pp. 13-103). New Schmidt, F. L., Hunter, J. E., Pearlman, K, & Hirsch, H.
Macmillan. R. (1985). Forty questions about validity generaliza-
S. (1995). Validity of psychological assess- tion and mera-analysis. Personnel Psychology, 38,
American Ps)'chologist, 50, 741-749. 697-798.
jf"."~b."g, S. L., Judice, T. N., & West, S. G. (1997). Schmitt, N. (1996). Uses and abuses of coefficient al-
the need for closure scale measures and what it pha. Psycbological Assessment, S, 350-353.
not: Toward differentiating among related Schwarzer, R. (1986). Evaluation of convergent and dis-
cpiste.mic motives. j01lntal of Personality and Social criminant validity by use of strucrural equations. In
l'sychc,logy, 72, 1396-1412. A. Angleimer &]. S. Wiggins {Eds.}, Personality as~
]. C. (1978). Psychometric theory (2nd ed.). sessment via questiOlmaire (pp. 192-213). Berlin:
York: McGraw-Hill. Springer-Verlag.
494 ANALYZING AND INTERPRETING PERSONALITY DATA

Scallon, C. N., Diener, E., Oishi, S., & Biswas-Diener, Thorndike, R. M. (1997). Measurement and
R. (2004). Emotions across cultures and methods. ill psychology and education (6th ed.). Upper
JOllrnal of Cross-Cllltural Psychology, 35, 304-326. River, N]: Prentice-Hall.
Shavelson, R. J., Webb, N. M., & Rowley, G. L. (1989). Tinsley, H. E., & Tinsley, D. ]. (1987). Uses of
Generalizability theory. American Psychologist, 44, analysis in counseling psychology research.
922-932. of COllnseling Ps)'chology, 34, 414-424.
ShrOUt, P. E., & Fleiss,]. L. (1979). Intradass correla- Visser, P. S., Krosnkk, J. A, & Lavrakas, P. J.
tions: Uses in assessing rater reliability. Psychological Survey research. In H. T. Reis & C. M. Judd
Bulletin, 86, 420-428. Handbook of research methods in social
Smith, C. P. (lOOO). Coment analysis and narrative anal· ality psychology (pp. 223-252). New
ysis. In H. T. Reis & C. M. Judd (Eds.), Handbook of bridge University Press.
research methods in social and personality psycho/- von Eye, A., & Mun, E. Y. (2005). An"ly,:;ng
Og)' (pp. 313-335). New York: Cambridge University agreement: Mam/est variable methods. Nl:am.Vall,
Press. Erlbaum.
Smith, G. T. (lOD5n). On the complexity of qmmrifying Watson, D., Clark, L. A., Weber, K., Assenheimer,
construct validity: Reply. Psychological Assessment, Strauss, M. E., & McCormick, R. A. (1995).
17,413-414. a tripartite model: II. Exploring the symptom
Smith, G. T. (2005h). On construct validity: Issues of ture of anxiety and depression in student, adult,
method and measurement. Psychological Assess~ patient samples. journal of Abllonnal l's:VChO/tlgy,
ment,l7,296-408. 104,15-25.
Smith, G. T., & McCarthy, D. M. (1995). Methodologi- Watson, D., & McKee Walker, L. (1996). The {orlg-tetn
cal considerations in the refinement of clinical assess~ stability and predictive validity of trait memlfe;
mem instruments. Psychological Assessment, 7,300- affect. journal of Personality and Sociall's)lch,alo.gy)'
308. 70, 567-577.
Snyder, M. (1974). Self-monitoring of expressive behav- Webb, E. ]., Campbell, D. T., Schw~rtz, R. D., ,cen"."
ior. journal of Personality and Social Psychology, 3D, L., & Grove,]. B. (1981). NOIl,""ct;ve mea",,"s;n
526-537. social sciences (2nd ed.). Boston: HtJUg:ht,on·Mifflin.
Snyder, M. (1987). Public appearances, private realities: Wegener, D. R., & Fabrigar, L. R. (2000). Analysis
The psychology of self-monitoring. New York: Free- design for nonexperimental data: Addressing
man. and noncausal hypotheses. In H. T. Reis & C.
Sora, C. J., Gorchoff, 5., & John, O. P. (2006). Adult at- Judd (Eds.), Handbook of research methods in
tachment styles and the Big Five: Different constructs and personality psychology (pp. 412-450)~
or different contexts? Manuscript in preparation. York: Cambridge University Press.
Soro, C. J., John, O. P., Gosling, S. D., & Potter, ]. West, S. G., & Finch, J. F. (1997). Measurement
(2006). Developmental psychometrics: The stmc~ analysis issues in the investigation of
tlIral validity of Big~Fjve personality self-reports in structure. In R. Hogan, J. Johnson, & S.
late childhood and adolescence. Unpublished manu~ (Eds.), Handbook of personality psychology
script. 143-164). Dallas, TX: Academic Press.
Sternberg, R. J. (1985). Human intelligence: The model Westen, D., & Rosenthal, R. (2003). Quantifying
is the message. Science, 230,1111-1118. struct validity: Two simple measures. Journal
Stevens, S. S. (1951). Mathematics, measurement, and sonality and Social Psychology, 84, 608-618.
psychophysics. In S. S. Stevens (Ed.), Handbook of Westen, D., & Rosenthal, R. (2005). Improving con-
experimental psychology (pp. 1-49). New York: struct validity: Cronbach, Meehl, and Neurath's ship.
Wiley. Psychological Assessment, 17, 409-412.
Taylor, S. E., & Brown, J. (1988). Illusion and well-be- Wiggins, J. S. (1973). Personality and prediction: Princi-
ing: A social psychological perspective on mental ples of personality assessment. Menlo Park, CA:
health. Psychological Bulletin, 103, 193-210. Addison-Wesley.

View publication stats

You might also like