The Standardsin Persormance Assessment: Validityof

Standards of Validity and the
Validityof Standardsin
PerSormance Assessment
Samuel Messick
Educational Testing Service
framework for empirically testing
rational hypotheses about score
What are six distinct aspects o f construct validation? meaning and utility. Fundamentally,
How do these aspects apply to performance assess- then, score validation is empirical
evaluation of the meaning and conse-
ment? Are the consequences of performanceassess- quences of measurement. As such,
ment on teaching and learning relevant to construct validation combines scientific inquiry
validation? with rational argument to justify (or
nullifjr) score interpretation and use.
V alidity is an overall evaluative

judgment of the degree to
which empirical evidence and theo-
teaching and learning, are becoming
increasingly popular as purported in-
struments of standards-based educa-
Aspects of Construct Validity
The validity issues of score meaning,
retical rationales support the ade- tion reform. Indeed, it is precisely relevance, utility, and social conse-
quacy and appropriateness of inter- because of these politically salient po- quences are many-faceted and inter-
pretations and actions based on test tential consequencesthat the validity twined. They are difficult if not
scores or other modes of assessment of performance assessment needs to impossible to disentangle, which is
(Messick, 1989). Validity is not a be systematically addressed, as do why validity has come to be viewed as
property of the test or assessment as other basic measurement issues such a unified concept (APA, AERA, &
such, but rather of the meaning of as reliability, comparability, and fair- NCME, 1985; Messick, 1989). How-
the test scores. These scores are a ness. ever, to speak of validity as a unified
function not only of the items or These issues are critical for perfor- concept does not imply that validity
stimulus conditions but also of the mance assessment because validity, cannot be usefully differentiated into
persons responding as well as the reliability, comparability, and fair- distinct aspects to underscore issues
context of the assessment. In particu- ness are not just measurement prin- and nuances that might otherwise be
lar, what needs to be valid is the ciples; they are social values that downplayed or overlooked, such as
meaning or interpretation of the have meaning and force whenever the social consequences of perfor-
scores as well as any implications for evaluative judgments and decisions mance assessments or the role of
action that this meaning entails are made. As a salient social value, score meaning in applied use. The in-
(Cronbach, 1971). The extent to validity assumes both a scientific and tent of these distinctions is to provide
which score meaning and action im- a political role that can by no means a means of addressing functional as-
plications hold across persons or pop- be fulfilled by a simple correlation co- pects of validity that help disentangle
ulation groups and across settings or efficient between test scores and a some of the complexities inherent in
contexts is a persistent and perennial purported criterion (i.e., classical cri- appraising the appropriateness,
empirical question. This is the main terion validity) or by expert judg- meaningfulness, and usefulness of
reason that validity is an evolving ments that test content is relevant to score inferences.
property and validation a continuing the proposed test use (i.e., traditional In particular, six distinguishable
process. content validity). validity aspects are delineated
Indeed, broadly speaking, validity emphasizing content, substantive,
is nothing less than an evaluative structural, generalizability, external,
The Value of Validity summary of both the evidence for and consequential aspects of con-
The principles of validity apply to all and the actual as well as potential
assessments, whether based on tests, consequences of score interpretation
questionnaires, behavioral observa- and use (i.e., construct validity con- Samuel Messick is a Vice President for
tions, work samples, or whatever. ceived comprehensively). This com- Research at Educational Testing Service,
These include performance assess- prehensive view of validity integrates Rosedale Rd., Princeton, NJ 08541. His
ments which, because of their considerations of content, criteria, specializations are validity and educa-
promise of positive consequences for and consequences into a construct tional and psychological measurement.
Winter 1995 5
struct validity (Messick, 1994, in Content Relevance and explicitly or tacitly, the form or forms
press). In effect, these six aspects Representatiueness of performance that are appropriate
function as general validity criteria A key issue for the content aspect of to be evaluated against the standards.
or standards for all educational and construct validity is the specification From the discussion thus far, it
psychological measurement (Mes- of the boundaries of the construct do- should be clear that not only the as-
sick, 1989). Following a capsule de- main to be assessed-that is, deter- sessment tasks but also the content
scription of these six aspects, I mining the knowledge, skills, and standards themselves should be rele-
highlight some of the validity issues other attributes to be revealed by the vant and representative of the con-
and sources of evidence bearing on assessment tasks. The boundaries struct domain. That is, the content
each: and structure of the construct do- standards should be consistent with
The content aspect of construct main can be addressed by means of domain theory and be reflective of
validity includes evidence of job analysis, task analysis, curricu- the structure of the construct do-
content relevance, representa- lum analysis, and especially domain main. This is the issue of the con-
tiveness, and technical quality theory-that is, scientific inquiry struct validity of content standards.
(Lennon, 1956; Messick, 1989). into the nature of the domain There is also a related issue of the
The substantive aspect refers to processes and the ways in which they construct validity of performance
theoretical rationales for the ob- combine to produce effects or out- standards. That is, increasing
served consistencies in test re- comes. Amajor goal of domain achievement levels or performance
sponses, including process theory is to understand the con- standards (as well as the tasks that
models of task performance struct-relevant sources of task diffi- benchmark these levels) should re-
(Embretson, 1983), along with culty, which then serves as a guide to flect increases in complexity of the
empirical evidence that the the- the rational development and scoring construct under scrutiny and not in-
oretical processes are actually of performance tasks. For an example creasing sources of construct-irrele-
engaged by respondents in the of how domain theory can inform vant difficulty. These and other
assessment tasks. both test construction and valida- issues related to standards-based as-
The structural aspect appraises tion, see Kirsch, Jungeblut, and sessment will be discussed in the sub-
the extent to which the internal Mosenthal (in press). At whatever sequent article on standard setting.
structure of the assessment re- stage of its development, then, do-
flected in the scores, including main theory is a primary basis for Substantive Theories, Process
scoring rubrics as well as the un- specifylng the boundaries and struc- Models, and Process Engagement
derlying dimensional structure ture of the construct to be assessed. The substantive aspect of construct
of the assessment tasks, is con- However, it is not sufficient merely validity emphasizes two important
sistent with the structure of the to select tasks that are relevant to points: One is the need for tasks
construct domain at issue (Lo- the construct domain. In addition, providing appropriate sampling of
evinger, 1957). the assessment should assemble domain processes in addition to tra-
The generalizability aspect ex- tasks that are representative of the ditional coverage of domain content;
amines the extent to which domain in some sense. The intent is the other is the need to move beyond
score properties and interpreta- to ensure that all important parts of traditional professional judgment of
tions generalize to and across the construct domain are covered, content to accrue empirical evidence
population groups, settings, and which is usually described as select- that the ostensibly sampled processes
tasks (Cook & Campbell, 1979; ing tasks that sample domain are actually engaged by respondents
Shulman, 1970),including valid- processes in terms of their functional in task performance. Thus, the sub-
ity generalization of test-cri- importance. Both the content rele- stantive aspect adds to the content
terion relationships (Hunter, vance and representativeness of as- aspect of construct validity the need
Schmidt, & Jackson, 1982). sessment tasks are traditionally for empirical evidence of response
The external aspect includes appraised by expert professional consistencies or performance regu-
convergent and discriminant ev- judgment, documentation of which larities reflective of domain processes
idence from multitrait-multi- serves to address the content aspect (Embretson, 1983; Loevinger, 1957;
method comparisons (Campbell of construct validity Messick, 1989).
& Fiske, 1959), as well as evi- In standards-based education re-
dence of criterion relevance and form, two types of assessment stan- Scoring Models as Reflective of Task
applied utility (Cronbach & dards have been distinguished. One and Domain Structure
Gleser, 1965). type is called content standards, According to the structural aspect of
The consequential aspect ap- which refers to the kinds of things a construct validity, scoring models
praises the value implications of student should know and be able to should be rationally consistent with
score interpretation as a basis do in a subject area. The other type what is known about the structural
for action as well as the actual is called performance standards, relations inherent in behavioral
and potential consequences of which refers to the level of compe- manifestations of the construct in
test use, especially in regard to tence a student should attain at key question (Loevinger, 1957; Peak,
sources of invalidity related to stages of developing expertise in the 1953). That is, the theory of the con-
issues of bias, fairness, and dis- knowledge and skills specified by struct domain should guide not only
tributive justice (Messick, 1980, the content standards. Performance the selection or construction of rele-
1989). standards also circumscribe, either vant assessment tasks but also the
6 Educational Measurement: Issues and Practice
rational development of construct- This issue of generalizability of score terion measures pertinent to selec-
based scoring criteria and rubrics. inferences across tasks and contexts tion, placement, licensure, program
Ideally, the manner in which behav- goes to the very heart of score mean- evaluation, or other accountability
ioral instances are combined to pro- ing. Indeed, setting the boundaries of purposes in applied settings. Once
duce a score should rest on score meaning is precisely what gen- again, the construct theory points to
knowledge of how the processes un- eralizability evidence is meant to the relevance of potential relation-
derlying those behaviors combine dy- address. ships between the assessment scores
namically to produce effects. Thus, However, because of the extensive and criterion measures, and empiri-
the internal structure of the assess- time required for the typical perfor- cal evidence of such links attests to
ment (i.e., interrelations among the mance task, there is a conflict in the utility of the scores for the ap-
scored aspects of task and subtask performance assessment between plied purpose.
performance) should be consistent time-intensive depth of examination
with what is known about the inter- and the breadth of domain coverage Consequences as Validity Evidence
nal structure of the construct domain needed for generalizability of con- Because performance assessments
(Messick, 1989). This relation of the struct interpretation. This conflict promise potential benefits for teach-
assessment structure to the domain between depth and breadth of cover- ing and learning, it is important to
structure has been called structural age is often viewed as entailing a accrue evidence of such positive con-
fidelity (Loevinger, 1957). trade-off between validity and relia- sequences as well as evidence that
To the extent that different assess- bility (or generalizability). It might adverse consequences are minimal.
ments (i.e., those involving different better be depicted as a trade-off be- In this connection, the consequential
tasks or different settings or both) tween the valid description of the aspect of construct validity includes
are geared to the same construct do- specifics of a complex task perfor- evidence and rationales for evaluat-
main, using the same scoring model mance and the power of construct in- ing the intended and unintended
as well as scoring criteria and terpretation. In any event, as consequences of score interpretation
rubrics, the resultant scores are Wiggins (1993) stipulates, such a and use in both the short- and long-
likely to be comparable or can be ren- conflict signals a design problem that term, especially those associated with
dered comparable using equating needs to be carefully negotiated in bias in scoring and interpretation or
procedures. Otherwise, score com- performance assessment. with unfairness in test use. However,
parability is jeopardized but can be In addition to generalizability this form of evidence should not be
variously approximated using such across tasks, the limits of score viewed in isolation as a separate type
techniques as statistical or social meaning are also affected by the de- of validity, say, of consequential valid-
moderation (Misley, 1992). Score gree of generalizabilityacross time or zty. Rather, because the values served
comparability is clearly important for occasions and across observers or in the intended and unintended out-
normative or accountability purposes raters of the task performance. Such comes of test interpretation and use
whenever individuals or groups are sources of measurement error asso- both derive from and contribute to
being ranked. However, score com- ciated with the sampling of tasks, the meaning of the test scores, ap-
parability is also important even occasions, and scorers underlie tradi- praisal of social consequences of the
when in&viduals are not being ditional reliability concerns; they are testing is also seen to be subsumed as
rectly compared, but are held to a examined in more detail in the sub- an aspect of construct validity (Mes-
common standard. Score comparabil- sequent article on generalizability. sick, 1980).
ity of some type is needed to sustain The primary measurement con-
the claim that two individual perfor- Convergent and Discriminant cern with respect to adverse conse-
mances in some sense meet the same Correlations With External Variables quences is that any negative impact
local, regional, national, or interna- The external aspect of construct va- on individuals or groups should not
tional standard. These issues are ad- lidity refers to the extent to which derive from any source of test inva-
dressed more fully in the subsequent the assessment scores’ relationships lidity such as construct underrepre-
article on comparability. with other measures and nonassess- sentation or construct-irrelevant
ment behaviors reflect the expected variance (Messick, 1989, in press).
Generalizability and the Boundaries high, low, and interactive relations That is, low scores should not occur
of Score Meaning implicit in the theory of the construct because the assessment is missing
The concern that a performance as- being assessed. Thus, the meaning of something relevant to the focal con-
sessment should provide representa- the scores is substantiated externally struct that, if present, would have
tive coverage of the content and by appraising the degree to which permitted the affected students to
processes of the construct domain empirical relationships with other display their competence. Moreover,
is meant to ensure that the score in- measures, or the lack thereof, is con- low scores should not occur because
terpretation not be limited to the sistent with that meaning. That is, the measurement contains some-
sample of assessed tasks but be gen- the constructs represented in the as- thing irrelevant that interferes with
eralizable to the construct domain sessment should rationally account the affected students’ demonstration
more broadly. Evidence of such gen- for the external pattern of correla- of competence. Positive and nega-
eralizability depends on the degree of tions. tive consequences of assessment,
correlation of the assessed tasks with Of special importance among these whether intended or unintended, are
other tasks representing the con- external relationships are those be- discussed in more depth in the subse-
struct or aspects of the construct. tween the assessment scores and cri- quent article on fairness.
Winter 1995
Validity as Integrative Summary provide compelling reasons why not, €? B. (in press). Moving toward the
These six aspects of construct validity which is what is meant by validity as measurement of adult literacy (Techni-
apply to all educational and psy- a unified concept. cal report on the 1992 National Adult
Literacy Survey). Washington, DC:
chological measurement, including References U. S. Government Printing Office.
performance assessments. Taken American Psychological Association, Lennon, R. T. (1956). Assumptions un-
together, they provide a way of American Educational Research derlying the use of content validity.
addressing the multiple and interre- Association, & National Council on Educational and Psychological Mea-
lated validity questions that need to Measurement in Education. (1985). surement, 16, 294-304.
be answered in justifylng score inter- Standards for educational and psycho- Loevinger, J. (1957). Objective tests as
pretation and use. They are high- logical testing. Washington, DC: Amer- instruments of psychological theory.
lighted because most score-based ican Psychological Association. Psychological Reports, 3, 635-694
interpretations and action inferences, Campbell, D. T., & Fiske, D. N (1959). (Monograph Supplement 9).
as well as the elaborated rationales or Convergent and discriminant valida- Messick, S. (1980). Test validity and the
tion by the multitrait-multimethod ethics of assessment. American Psy-
arguments that attempt to legitimize matrix. Psychological Bulletin, 56, chologist, 35, 1012- 1027.
them ( h e , 19921, either invoke 81-105. Messick, S. (1989).Validity In R. L. Linn
these properties or assume them, ex- Cook, T. D., & Campbell, D. T. (1979). (Ed.), Educational measurement (3rd
plicitly or tacitly. That is, most score Quasi-experimentation: Design and ed., pp. 13-103). New York Macmillan.
interpretations refer to relevant con- analysis issues for field settings. Messick, S. (1994). The interplay of
tent and operative processes, pre- Chicago: Rand McNally. evidence and consequences in the vali-
sumed to be reflected in scores that Cronbach, L. J. (1971). Test validation. dation of performance assessments.
concatenate responses in domain-ap- In R. L. Thorndike (Ed.), Educational Educational Researcher, 23(2), 13-23.
propriate ways and are generalizable measurement (2nd ed., pp. 443-507). Messick, S. (in press). Validity of psycho-
across a range of tasks, settings, and Washington, DC: American Council on logical assessment: Validation of infer-
Education. ences from persons’ responses and
occasions. Furthermore, score-based Cronbach, L. J., & Gleser, G. C. (1965). performances as scientific inquiry into
interpretations and actions are typi- Psychological tests and personnel deci- score meaning. American Psychologist.
cally extrapolated beyond the test sions (2nd ed.). Urbana, IL: University Mislev, R. J. (1992).Linking educational
context on the basis of documented or of Illinois Press. assessments: Concepts, issues, meth-
presumed relationships with nontest Embretson (Whitely), S. (1983). Con- ods, and prospects. Princeton, N J ETS
behaviors and anticipated outcomes struct validity: Construct represen- Policy Information Center.
or consequences. The challenge in tation versus nomothetic span. Peak, H. (1953).Problems of observation.
test validation is to link these in- Psychological Bulletin, 93, 179-197. In L. Festinger & D. Katz (Eds.), Re-
ferences to convergent evidence Hunter, J. E., Schmidt, E L., &Jackson, search methods in the behavioral sci-
supporting them as well as to dis- C. B. (1982). Advanced meta-analysis: ences (pp. 243-299). Hinsdale, IL:
Quantitative methods of cumulating Dryden.
criminant evidence discounting plau- Shulman, L. S.(1970). Reconstruction of
research findings across studies. San
sible rival inferences. Evidence perti- Francisco: Sage. educational research. Review of Educa-
nent to all of these aspects needs to h e , M. T. (1992). An argument-based tional Research, 40, 371-396.
be integrated into an overall validity approach to validity. Psychological Wiggins, G. (1993).Assessment: Authen-
judgment to sustain score inferences Bulletin, 112, 527-535. ticity, context, and validity. Phi Delta
and their action implications, or else Kirsch, I. S., Jungeblut, A., & Mosenthal, Kappan, 75,200-214.
NCME OUTREACH COMMITTEE NEEDS CONTACTS
The NCME Board has named a committee whose charge is to offer the expertise of our membership to other
professional education-related organizations. This Outreach Committee will arrange for NCME members to
make assessment-related presentations for our fellow organizations at their annual meetings or related fo-
rums. The presentation topics will be of the host group’s choosing; presentations will be made by selected
NCME volunteer members who are expert on the topic of choice. The host organizations will pay all expenses
of the presenter, but no fee will be charged.
The initial step in this process is to contact prospective cooperating organizations with an offer of our
services. To begin this step, NCME members who are also members of other professional organizations are
asked to suggest groups with which we should be in contact. Also, if you know a person within the organi-
zation who would be best to contact (e.g., president, program chair, executive director), please let us know
that information. Please do NOT contact the organization directly with any offers to assist. All such contact
should be initiated through the Outreach Committee.
If you have any suggestions-or other comments concerning this committee or its mission-please
contact:
Michael D. Beck,
BETA, Inc.,
35 Guion Street,
Pleasantville, NY 10570
(Telephone: 914-769-5235, fax: 914-769-4809).
8 Educational Measurement: Issues and Practice

The Standardsin Persormance Assessment: Validityof

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Standardsin Persormance Assessment: Validityof

Uploaded by

Copyright:

Available Formats

Standards of Validity and the

V alidity is an overall evaluative

NCME OUTREACH COMMITTEE NEEDS CONTACTS

8 Educational Measurement: Issues and Practice

You might also like