Professional Documents
Culture Documents
R
E
S
E
A
R
C ·R VALIDITY
H E
P
o Samuel Messick
R
T
Samuel Messick
Educational Testing Service
Preliminary Points 3
SCores in Context 4
SCores As Signs and samples 5
Sources of Evidence in the Evaluation of Scores 6
References 169
VALIDI'lY
samuel Messick!
Educational Testing service
P.RELIMINARY POINTS
the terms "score" and "test" are used, they are to be taken in this
generalized sense.
SCORES IN CONTEXT
Thus, one might seek and obtain generality of score meaning across
different contexts I even though the attendant ~lications for action
may be context-dependent by virtue of interactions operative in the
particular group or setting. If the meaning of scores does change with
the group or setting, the scores may indeed lack valid1ty for r.m.1.tbm
use across the varied circumstances but not necessarily laclt validity
for differential use or even validity of meaning in specific contexts.
'n1ese issues of general izabi Ii ty will be addressed in oore detail later
once some conceptual machinery has been developed to deal with the
complexities.
evolved over the years. we will then present other ways of cutting
evidence, and of reconfiguring forms of evidence, to highlight major
facets of a unified validity conception.
At least since the early 1950s, test validity has been broken into
three or four distinct types - or, mre specifically, into three
tyPes, one of which comprises ~ subtypes. '1hese are content
validity, predictive and concurrent criterion-related validity, and
construct validity. Paraphrasing slightly statements in the 1954
Tecbn.ical RecatI1Jel1dations for FsycholOg'.ical Tests and Diagnostic
Techniques (American Psychological Association, 1954) and the 1966
Standards for Educat.ional and Psychological Tests and Manuals (APA,
1966), these three traditional validity types are described as follows:
standards for test use, thus explicitly introducing concern for bias,
adverse impact, and other social consequences of the uses and misuses
of tests.
Table 1
Facets of Validity
Test Interpretation Test Use
These distinctions may seem fuzzy because they are not only
interlinked but over lapping. For exaDQ;)le, social consequences
redounding to the testing are a form of evidence, so the evidential
basis referred to here includes all validity evidence except social
consequences , which are highlighted in their own right. FUrthermore.
as we shall see, utility is both validity evidence and a value
consequence. M:Jreover, to interpret a test is to use it, and all other
test uses involve interpretation either explicitly or tacitly. So the
test use referred to here incl~ all test uses other than test
interpretation per sa. The fuzziness -- or rather, the messiness -
of ~ distinctions derives from the fact that we are trying to
cut through what indeed is a unitary concept. Nonetheless, these
distinctions do provide us with a W<!rf of speaking about functional
- 18 -
PHILOSOPHICAL CONCEITS
For example, Hanson (1958) maintains that "a theory is not pieced
together from observed phenomena; it is rather what makes it possible
to observe phenomena as being of a certain sort, and as related to
other phenomena" (p. 90). Kuhn (1962) views everyday or "normal"
scientific research as occurring wi thin a "paradigm" or worldview which
cOJq;>rises a strong network of conceptual, theoretical, instrumental,
and mataphysica1 commitments. These paradigms determine the nature
of the problems addressed, how phenomena are vieM:!d, what counts as
a solution, and the criteria of acceptability for theories. Paradigms
thus have a normative or valuative component (SUppe, 1977a).
consistencies will be examined. First is the view that both test and
related nontest consistencies are manifestations of real traits or else
~e attributable to real stimulus contingencies in the environment.
The second view SUDlDarizes test and nontest consistencies in terms of '
a set of theoretical constructs and the relationships aJOOng them, with
me~ grounded in the theoretical network rather than in the real
world. Finally, in the third Viewpoint, test and nontest consistencies
are attributable to real trait or environmental entities but are only
understood in terms of constructs that SUJmlarize their empirical
properties in relation to other constructs in a theoretical system
(Messick, 1981a).
double arrows on these dashed lines indicate that constructs are not
only inferred from such consistencies but that in other circumstances
they also imply or predict such consistencies.
- 34 -
assumes that traits and other causal entities exist outside the
theorist's mind; it is constructive-realist because it assumes that
these entities cannot be comprehended directly but must be viewed
through constructions of that mind. By attributing reality
to causal entities but s.imultaneously requiring a theoretical
construction of observed relationships, this approach aspires to
attain the explanatory richness of the realist pesition while limiting
metaphysical excesses through rational analysis. At the same time,
constructive-realists hope to retain the predictive and SUDmarizing
Both the Leibnizian and the Lockean approaches are best sui ted for
attacking well-structured problems, that is, problems for Ntich there
is a sufficiently clear definition or consensus to be amenable to known
methods. For ill-structured problems, where the major issue is to
forJlLllate the nature of the problem itself, or for ostensibly well-
defined problems N1ere the conse.nsuaJ. basis is open to question, the
three remaining j,nquirmg systems are more apropos. In a Lockean
system, agIeemetlt about problem solution is a signal to terminate
inquiry. But in Kantian, Hegelian, and Singerian systems, agreement
is often a st.iJmllus to probe more deeply.
In this regard, we hold with Cook and campbell (1979) that the
relativists "have exaggerated the role of comprehensive theory in
scientific advance and have made experimental evidence seem'almost
irrelevant. Instead, exploratory experimentation . . . [has]
repeatedly been the source of great scientific advances. providing
the stubborn, dependable, replicable puzzles that have justified
theoretical efforts at solution" (p. 24). Once again, this appears
reminiscent of Shapere' s (1977) notion of the function of facts in a
scientific domain. The point is that, smce observations and meanings
are differentially theory-laden and theories are differentially value-
laden, appeals to multiple perspectives on meaning and values are
needed to illuminate latent assumptions and action implications in
the Dieaslirement of constructs (Messick, 1982b).
Table 2
Taxonomy of Research Strategies From The InteractIon
of InquIring Systems
Processed As
and C. SUppose the first measure taps facets A and B, along with some
irrelevant variance X; the second measure taps B and C with extraneous
variance Y; and, the third measure taps A, C, and Z. For example, the
complex cOnstruct of general self-concept mighteiDbrace multiple facets
such as academic self-concept, social self-concept, and physical self-
concept (Marsh &: Shavelson, 1985; Shavelson, Hubner, &: Stanton, 1916).
One measure (tapping academic and social self-concept) might be cast in
a self-rating format where the major contaminant is the tendency to
respond in a socially desirable manner; another measure (tapping social
and physical self-concept) cast in a true-false format where extraneous
variance stems from'the resPOnse set of acquiescence; and, a third
measure (tapping academic and physical self-concept) in a seven-point
Likert format of "strongly agree" to "strongly disagree" where
irrelevant variance derives from extremity response set.
The two. types of threat to construct validity and the role of both
convergent and discriminant evidence in construct validation should be
kept in mind in what follows. 'fllese considerations serve as background
- 49 -
At the same time, the very notion of content sampling has been
attacked in some quarters. 'This is mainly on the grounds that although
there may be a definable domain of content, there is no existipg
universe of items or testing conditions. Under such circumstances, how
can one talk about how well the test "samples" the class of situations
or subject matter (Loevinger, 1965)7 In point of fact, items are
constructed, not sampled. However, this argument· is countered by the
- 57 -
But knowing that the test is an item sample from a circumscribed item
universe merely tells us, at JOOSt, that the test measures whatever the
universe measures. And we have no evldeDce about what that might be,
other than a rule for generating i terns of a particular type.
But this may not be possible, and content coverage in terms of the
original specifications may appear to be eroded. SUch erosion may be
justified on the grounds that the resulting test thereby becomes a
better exemplar of the construct as empirically grounded in domain
structure. This might occur, for example because the eliminated items
f
formalized the call for rational scoring models by coining the term
structural fidelity, which refers to "the extent to which structural
relations be~~ test items parallel the structural relations of other
manifestations of the trait being measured" (p , 661). The structural
component of construct validity includes both this f ideli ty of the
scoring model to the structural characteristics of the construct's
nontest manifestations and the degree of inter-item structure.
Table 3
Kethod 1 Kethod 2
Traits A1 81 Cl A2 82 C2
Self-report B1 Warmth
Cl Sociability (.90)
tests singly mayor may not be the same variance they share with each
other (Wschtel , 1972). By using ~ or more tests to represent the
construct in nomological or relational studies, one can disentangle
shared variance from unshared variance and discern which aspec ts of
conStruct meaning, if any, derive from. the shared and unshared parts.
is not so much that they are often less rigorous but that they vary
enormously in their degree of rigor.
-- ----+ I
r-------,I
L
Cf'ftttmt-~I"(W;e"fI
t;on~l~tP.11c:tcR I
----I
r
I
1
P~;fo~.;n;-C'
(tutu/~t"'tul
;r;c;,u.-;.-
beho'lviorltl ("1(\~lIe~)
I
I
I
sources of validity evidence are not unlimited but fall into a half
dozen or so main forms. TIle number of forms is arbitrary, to be sure I
1987b). This has consequences for the nature and sequence of processes
involved in item responses and, hence, for the constructs implicated in
test scores. For example, French (1965) demonstrated that the factor
loadil".gs of cognitive tests varied widely as a function of the problem-
solving strategies or styles of the respondents. In a similar vein,C.
n. Frederiksen (1969) found differential Performance patterns in verbal
learning as a function of strategy choice, which was in turn shown to
be Partly a function of strengths and weaknesses in the respondent! s
ability repertoire.
but it also includes for Shulman the general izabil i ty of findings from
assessment settings to educational applications.
Cronbach (1987a), for one, lauds Cook and campbell (1979) for
giving a proper breadth to the notion of constructs by extending the
construct validity purview from the meaning of measures to the meaning
of cause and effect relationships, for advancing "the larger view that
constructs enter whenever one pins labels on causes or outcomes" (09.6).
- 95 -
In choosing a construct label, one should strive for cons i.s tencv
between the trait implications and fhe evaluative implications of the
name, attempting to capture as closely as possible the essence of the
construct!s theoretical import (especially its empirically grounded
import) in terms reflective of its salient value implications. T'.nis
may prove difficult, however, because some traits, such as self-
control and self-expression, are open to conflicting value
interpretations. T'.nese cases may call for systematic examination of
counterhypotheses about value outcomes - if not to reach convergence
on an interpretation, at least to clarify the basis of the conflict.
S~ne traits lnay also im9ly different value outcomes 'UP~e~' different
circ~stances. In such cases. differentiated trait labels lnight be
·useful to embody the value distinction.s, as in the instance of
!IfacEitating anx.iecv' and "debilitating anxde ty" (Alpert &: ::ai;e~·.
1960j. ~ival theories of the construct might also highlight
different value irnplicatior's, of course, and lead to conflict
between the theories not only in trait interpretation but also in value
interpretation, as in tl1e case of L'1telligence vi~Ned as fixed capacity
as opposed to experientially developed ability (.J. lJ'lCV, Hunt, 196:).
(po 23)
Our own metaphors . . . are the ways of speaking which make sense
to us, not just as being meaningful but as being sensible,
significant, to the point. They are basic to semantic explanation
and thereby enter into scientific explanation. If there is bias
here it consists chiefly in the failure to recognize that other
ways of putting things might ·prove equally effective in carrying
inquiry forward. But that we llUlSt choose one way or another, and
thus give our values a role in the scientific enterprise, surely
does not in itself mean that we are methodologically damned.
(p. 384)
The judged degree of fit between the test and the ultimate
~tional objectives for covering the subject-matter domain is an
example of what Cureton (1951) calls "logical relevance." nus notion
may also be applied to the judged fit between a test and the ultimate
performance objectives in a job domajn. An example of test
specifications for a high-school level reading test is given in Table
4. These specifications reflect committee judgments of relevant domain
coverage, recognizing that no specific curriculum may match these
specifications in detail. A remaining step is to specify the number
of items, or the am::ront of testing time, to be devoted to each row and
column or cell of this grid so that the test is judged to be
representative in its coverage of the domain.
Table "
Commi ttee Specifications for the Reading Test
(National Educational Longi tudina1 Study)
EXPOSITORY ! i ! I,!
1. Science passage
(Geological Clock)
, I
2. Literary passage
(Harlem Renaissance) I -I
3. Social Studies passage
(Transportation)
.i i
I
\
PERSUASIVE , I I
I
I
4. Editorial
I j
I I , r
I
POETRY
I
5. Poe:n
J
NARRATIVE i !
! J
, r
6.
7.
American fiction
Biography
I
!i
I
r ---+--t-
8. Fable I I 1
- 113 -
Moreover, in some studies, the issue was not only the relevance
of the test items to the immediate objectives of the teacher-training
curriculum but to the more distal objectives or requirements of the job
of beginning teacher. In such cases, job relevance panels comprised of
practicing teachers and school administrators were asked to judge,
independently member by member, the degree to which the knowledge and
skills tapped by each test item were relevant to competent performance
as a beginning teacher in the state. To complete the circle, one
might have also asked, of course, for judgments of the relevance of
curriculum ~opics to the job of beginning teacher. But by at least
judging the relevance of the test to both the curriculum and the job,
one gains an important entry point for appraising the logical relevance
of the teacher-training program to beginning-teacher performance
requirements •
if sanctions are invoked, should assess what students have had a chance
to learn in school.
scoring of the test, and the average of the indexes for the retained
i terns serves as a kind of content relevance coefficient for the
resultant test scores (Lawshe, 1975). However, the elimination of
some items from the test scoring may erode content representativeness,
so the match of the resultant test to the content domain should be
reappraised. This approach may also be employed in educational
appIications to quantify the extent to which each test i tern (and the
resultant "purified" test) is appropriate to a Particular curriculum or
subject-matter domain. These are important first steps in the test
validation process but, as emphasized next, they are only first steps.
Syste:!latic
Test
Variance
Figure 5
Representation of spurious validity due to correlation
between test scores and criterion contaminants
lttJreover, it would also seem that the study of mul tiple criter fa
and the various dimensions along which they differ should illuminate,
better than the use of a single criterion, the underlying psychologica1
and behavioral processes involved in domain performance (Dunnette,
1963b; Schmddt «Kaplan, 1971; Wallace, 1965). As a consequence of
such arguments and evidence, Hakel (1986) has declaimed that "we have
reached the point where the 'ultimate criterion' should have been
consigned to the history books. OUr preoccupation with it has limited
our thinking and hindered our research" (p, 352). But Hakel is
speaking of the ultimate criterion as a single global measure, not as
the "multiple and complex" conceptual criterion or construct theory
guicUng test and criterion measurement and validation.
-......"
c::
o
H
H
Co)
Predlccor
Given this trade off between the two kinds of errors as a function
of the cutting score, Taylor and Russell (1939) determined the
increases in the percent of correct predictions of success as the
cutting score .increases, given a particular base rate and correlation
coefficient. Rather than finding that Predictive efficiency or utility
is a parabolic function of the test-criterion correlation as in the
coefficient of determination or the index of forecasting efficiency.
they noted that the correlation-utility relationship differs for
different selection ratios. They further noted that low correlation
coeff icients afford substantial improvement in the prediction of
success for accepted applicants when the selection ratio is low. On
the other hand, i f the cutting score is not properly determined, under
some circumstances even a good predictor can prove less useful in
maximizing the total correct decisions than simply predicting solely in
terms of the base rate (Meehl &: Rosen. 1955).
VALIDI1Y GENERALIZATION
It is not that the ends should not justify the means, for what
else could possibly justify the means i f not appraisal of the ends.
Rather, it is that the intended ends do not provide sufficient
justification, especially if adverse consequences of testing are linked
to sources of test invalidity. Even if adverse testing consequences
derive from valid test interpretation and use, the appraisal of the
functional worth of the testing in pursuit of the intended ends should
take into account all of the ends, both intended and unintended, that
are advanced by the testing application, including not only individual
and institutional effects but societal or systemic effects as well.
Thus, although appraisal of the intended ends of testir.g is a matter of
social policy, it is not only a matter of policy formulation but also
of policy evaluation that weighs all of the outcomes and side-effects
of policy implementation by means of test scores. Such evaluation of
the consequences and side effects of testing is a key aspect of the
validation of test use.
Pitting the pI'q7Gsed test use agairwt alternative uses. There are
few prescriptions for how to proceed in the systematic evaluation of
testing consequences. However, just as the use of counterhypotheses
was suggested as an efficient means of directing attention to
vulnerabilities in a construct theory, so the use of counterproposals
might serve to direct attention efficiently to vulnerabilities
in a proposed test use. TIle approach follows in the spirit of
COImterplanning in systems analysis, in which the ramifications of one
proposal or plan are explored in depth in relation to a dramatically
different or even antagonistic proposal (Churchman, 1968). The purpose
of the counterproposal - as in Kantian, Hegelian, and Singerian JOOdes
of inquiry - is to provide a context of debate exposing the key
technical and value assumptions of the original proposal to critical
evaluation.
11. as equals.
The test user is also in the best position to evaluate the meaning
of individual scores under the specific circumstances, that is, to
appraise the construct valicllty of incUvidual score interpretation and
the extent to which the intended score meaning may have been eroded by
contaminating variables operating locally. For ~le, verbal
intelligence scores for Hispanic students should be interpreted in
light of their English-language prof iciency, and test scores for
inattentive children and those with behavior problems should be
interpreted with care and checked against alternative incUcators
(Heller et al., 1982).
But the test user cannot be the sole arbiter of the ethics of
assessment, because the value of measurements is as much a scientific
and professiot".al valid! ty issue as the mean.1ng of measurements.
One implication of this stance is that the published research and
interpretive test literature should not merely provide the interpreter
or user with some facts and concepts, but with some exposition of the
critical value contexts in which the facts are embedded and with a
provisional accounting of the potential social consequences of
alternative test uses (Messick, 1981b). The test user should also be
explicitly alerted to the dangers of potential test misuse, because
test misuse is a significant source of invalidity. That is, just as
the use of test scores invalid for the specific purpose is misuse, so
the misuse of otherwise valid scores leads to invalidity in practice.
- 160 -
c...t",<:t - - . - - - - - - - - -....
r---
I hn .'.h.•
EY14l..-ce I lot."rec.Uoo :.,.U"Uon.
----- I
·------h. - - - - - - - - -
POCfftU .. l ~:t .. i. I
Co"""\1~.a !
...,
L-.hn L'u _
,
r--'-----,
Te.
f 4. "_..,..',----.- ._- .• ,1_ D<~u~.n_ "'~f1~ :
I ~
-------
0.(:"100 J
....
Breland. H. M•• camp. R.• Jones. R. J., M:>rris. M. M.• & Rock. D. A.
(1987). Assessing writing skill (Research M:>nograph # 11). New
York: College Entrance Examination Board.
Cole, M., 6'c Bruner, J. S. (1971). CuI tural differences and inferences
about psychological processes. American PSychologist, 26, 867-876.
Cole, M., Gay, J., Glick, J., 6'c Sharp, D. (1971). The cultural
context of learning and thlnking. New York: Basic Books.
Cole, M., Hood, L., 6'c McDernott, R. (1978) . Ecological niche picking
as an axiom of experimental cognitive psyr.hclogy. san Diego, CA:
Laboratory for Comparative Human Cognition, Universi tv of
California, san Diego.
Cook, T. D., &: Shadish, W. R., Jr. ( 1986). Program evaluation: 11'.e
worldly science. Annual Review of Psyr;:bology, 37, 193-232.
Dehn, N., & Schank, R. (1982). Artificial and human intelligence. I:1
R. J. Sternberg (EeL), Handbook of hwnan intelligence (w. 352-391).
New York: cambridge University Press.
Drasgcw, F., & Kang, T. ( 1984) . Statistical power of d.i ff'er'ent La.,
validity and differential prediction analysis for detecting
measurement nonequivalence. Journal of Applied Psychology, 69,
498-508.
Haney, W., & Madaus, G. (1988). The evolution of ethical and technical
standards for testing. In R. K. Hambleton & J. zaal ( Eds. ) ,
Handbook of testing. Amsterdam: North Holla'1d Press.
Hunt. E., Lunneborg, C. E., & Lewis, J. (1975). What does it mean to
be high verbal? Cognitive PSychology, 7, 194-227.
Lansman. M., Donaldson, G., Hunt, E., &: Yantis, S. (1982). Ability
factors and cognitive processes. Intelligence, 6, 347-386.
Millman, J. (1973). Passing scores and test lengths for doma in-
referenced measures. Review of Educational Research, 43, 205-216.
Murnane, R. J., News t ead , S., & Olsen, R. J. (1985). CcmparL~g p~lic
and private schools: 111e puzzling role of selectivity b i as .
Jcurris I of Business and Economic Statistics, 3. 23-35.
ncrts, C ..e.,
~~ ~ .Joreskog , K
T
. .G, ~..
!oJ.lI111, R·
• !oJ.
0(
11973)
\ • ...l.oCn
T~~ .~~~ '-_.. _ca t"_011.
and estimation in path analysis with unmeasured variables.
American Journal of Sociology, 78, 1469-1484.