Moss (1994 Validity Reliability)

Can There Be Validity without Reliability?
Author(s): Pamela A. Moss

Source: Educational Researcher, Vol. 23, No. 2 (Mar., 1994), pp. 5-12
Published by: American Educational Research Association
Stable URL: http://www.jstor.org/stable/1176218 .
Accessed: 25/01/2015 22:02
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
American Educational Research Association is collaborating with JSTOR to digitize, preserve and extend
access to Educational Researcher.
http://www.jstor.org
This content downloaded from 140.159.34.46 on Sun, 25 Jan 2015 22:02:38 PM

All use subject to JSTOR Terms and Conditions
Can There Be Validity Without Reliability?
PAMELAA. MOSS
Reliabilityhas traditionallybeentakenfor grantedas a necessary script if it dealt effectively with the concerns of Reviewer
but insufficientconditionfor validityin assessmentuse. My pur- A. He commented, diplomatically,that he feared our posi-
pose in this articleis to illuminateand challengethis presumption "might be misread as a rejection of a fundamental
tionbyexploringa dialecticbetweenpsychometric andhermeneutic measurement principle."He noted that "any measurement
approachesto drawingand warrantinginterpretations of human should have adequate reliabilityfor its purposes, otherwise
productsor performances. Reliability,as it is typicallydefinedand it is not good measurement, regardless of its positive
operationalized in the measurementliterature(e.g., American features."
Educational Research Association[AERA],AmericanPsychological There is an instructive irony embedded in this anecdote.
Association,& NationalCouncilon Measurementin Education, The process by which a working decision was reached re-
1985; Feldt& Brennan,1989), privilegesstandardized formsof garding our manuscriptwas based in an epistemology that
assessment.By consideringhermeneuticalternativesfor serving more closely resembled the one we had proposed than the
the importantepistemological andethicalpurposesthat reliability one against which our manuscript was evaluated. The
serves,we expandthe rangeof viablehigh-stakesassessmentprac- editor's decision was not grounded in the consistency
ticesto includethosethathonorthe purposesthat studentsbring among independent readings, which diverged substantially;
to their workand the contextualized judgmentsof teachers. rather,he made a thoughtfuljudgmentbased upon a careful
Research,Vol.23, No. 2, pp. 5-12.
Educational reading of both sets of comments and his own evaluation
of the manuscript. I am confident that he was concerned
with the validity and fairness of his decision. Of course, I
ome time ago, I submittedto a journala manuscript didn't agree with his initial decision, but our dialogue con-
in which my coauthors and I argued for the value of tinued through the mail, and the paper improved (and was
teachers' contextualizedjudgments in making conse- published) as we strengthened our argument in response
quentialdecisions about individualstudents and educational to his concerns. I am also confident that both the readers
programs. Drawing on epistemological strategies typically of the journaland I were well served by the written dialogue
used by qualitative or interpretive researchers, we offered that accompanied what, for me, was a "high-stakes"
an example of how teachers' narrative evaluations of their decision.
students' collected work, which varied in substance from My purpose in this article is to illuminate and challenge
student to student and classroom to classroom, might be the presumption that reliability,as it's typically defined and
warranted and used for accountabilitypurposes. We based operationalized in the professional measurement literature
our argument for the value of this sort of contextualized (e.g., AERA et al., 1985; Feldt & Brennan, 1989), is essen-
assessment on the unique quality of information it might tial to sound assessment practice;in doing this, I give par-
provide when used in conjunction with more standardized ticular attention to the context of accountability in public
forms of assessment and on the educationalbenefits it might education. I explore a dialectic between two diverse ap-
have for teachers and students. We warrantedthe narrative proaches to drawing and warranting interpretations of
evaluation, in part, in criticaldialogue among readers about human products and performances-one based in
the multidimensional evidence contained in students' psychometricsand one in hermeneutics.This task, I believe,
folders and, in part, in documentation of evidence allow- honors Messick's (1989)proposed "Singerian" mode of in-
ing subsequent readers of the report to "audit" or confirm quiry in validity research, where one inquiring system is
the conclusions for themselves. observed in terms of another inquiringsystem "to elucidate
Reviewer B thought our manuscript was a "superb and the distinctive technical and value assumptions underlying
important article" and gave it her "highest endorsement." each system application and to integrate the scientific and
ReviewerA thought the manuscriptshould not be published ethical implications of the inquiry" (p. 32). My point is not
in its current form because we had "confused the purpose to overturn a traditionalcriterionbut rather to suggest that
of assessment with that of instruction" and had "failed to it be treatedas only one of several possible strategiesof serv-
establish reliability" (in this case, adequate consistency ing important epistemological and ethical purposes. The
among independent readings). She commented that our choice among reliability and its alternatives has conse-
argument showed a lack of understanding of the essential
function of reliability,not only in service of validity, but also
"for fairness to the student to prevent the subjectivity and A. MOSS is assistantprofessor,Universityof Michigan,
PAMELA
potential bias of an individual teacher." The editor, faced 4220 Schoolof Education,610 East University,Ann Arbor,MI
with the dilemma of divergent opinions, wrote that he 48109-1259. She specializesin educationalmeasurementand
would be willing to publish an articlebased on the manu- evaluation.
MARCH 1994 5

quences for stakeholders in the educational system. That criteriasuch as "authenticity" (Newmann, 1990), "direct-
choice should be not be taken for granted or treated as ness" (Frederiksen& Collins, 1989),or "cognitive complex-
nonproblematic. ity" (Linn, Baker,& Dunbar, 1991). This balancing of often
Concerns With the Traditional View of Reliability competing concerns has resulted in the sanctioningof lower
levels of reliability,as long as "acceptablelevels are achieved
"Without reliability,there is no validity." Many of us who for particularpurposes of assessment" (Linn et al., 1991,
develop and use educational assessments were taught to p. 11; see Messick, 1992, and Moss, 1992, for a review).
take this maxim for granted as a fundamental principle of Where acceptablelevels have not been reached, recommen-
sound measurement. The Standardsfor Educationaland dations for enhancing reliability without increasing the
PsychologicalTesting(AERA et al., 1985), along with most number of tasks or readersbeyond cost-efficientlevels have
major measurement texts (e.g., Crocker & Algina, 1986; typically involved (a) increasing the specification of tasks
Cronbach,1990), present reliabilityas a necessary, albeitin- or scoring procedures, thereby resulting in increased stan-
sufficient, condition for validity. Theoretically,reliabilityis dardization,and (b), in the case of portfolios,disaggregating
defined as "the degree to which test scores are free from the contents so that tasks may be scored, independently,
errorsof measurement.... Measurement errorsreduce the one task at a time. Wiley and Haertel(in press) offer a prom-
reliability(and thereforethe generalizability)of the score ob- ising means of addressing task reliabilitywithout the con-
tained for a person from a single measurement" (AERAet straining assumption of homogeneity of tasks. As part of
al., 1985, p. 19). Typically,reliability is operationalized by a comprehensive assessment development process, they
examining consistency, quantitatively defined, among in- suggest carefullyanalyzing assessment tasks to describethe
dependent observations or sets of observations that are in- capabilitiesrequiredfor performance,scoringtasks separate-
tended as interchangeable-consistencyamong independent ly for the relevant capabilities, and examining reliability
evaluations or readings of a performance, consistency within capabilityacrosstasks to which the capabilityapplies.
among performancesin response to independent tasks, and While this supports the use of complex and authentic tasks
so on. In fact, Feldt and Brennan (1989) describe the that may naturallyvary in terms of the capabilitieselicited,
"essence" of reliabilityanalysis as the "quantificationof the it stillrequiresdetailed specificationof measurementintents,
consistency and inconsistency in examinee performance" performance records, and scoring criteria. So although
(p. 105). In this article, I focus primarily on issues of growing attention to the consequences of assessment use
reliabilityor generalizability across tasks (products or per- in validity research provides theoretical support for the
formances by the person or persons about whom conclu- move toward less standardized assessment, continued
sions are drawn) and across readers (interpretersor evalu- relianceon reliability,defined as quantificationof consisten-
ators of those performances). cy among independent observations, requires a significant
Less standardized forms of assessment, such as perfor- level of standardization.
mance assessments, present serious problemsfor reliability, Given the growing body of evidence about the impact of
in terms of generalizabilityacross readers and tasks as well high-stakes assessment on educational practice (Corbett&
as across other facets of measurement. These less standar- Wilson, 1991;Johnston, Weiss, & Afflerbach, 1990; Smith,
dized assessments typically permit students substantial 1991), this privileging of standardizationis problematic.As
latitude in interpreting,responding to, and perhaps design- Resnick and Resnick (1992) conclude, to the extent that
ing tasks; they result in fewer independent responses, each assessment results "are made visible and have conse-
of which is more complex, reflecting integration of multi- quences" (p. 55), effortsto improve performanceon a given
ple skills and knowledge; and they requireexpert judgment assessment "seem to drive out most other educationalcon-
for evaluation. Empiricalstudies of reliabilityor generaliz- cerns" (p. 58). There are certain intellectual activities that
ability with performance assessments are quite consistent standardized assessments can neither document nor pro-
in their conclusions that (a) readerreliability,defined as con- mote; these include encouraging students to find their own
sistency of evaluation across readers on a given task, can purposes for reading and writing, encouraging teachers to
reach acceptable levels when carefully trained readers make informed instructional decisions consistent with the
evaluate responses to one task at a time and (b) adequate needs of individual students, and encouragingstudents and
task or "score" reliability,defined as consistency in perfor- teachers to collaboratein developing criteriaand standards
mances across tasks intended to address the same capabili- to evaluate their work. A growing number of educators are
ties, is far more difficult to achieve (e.g., Breland, Camp, calling for alternative approaches to assessment that sup-
Jones, Morris, & Rock, 1987; Dunbar, Koretz, & Hoover, port collaborativeinquiry and foreground the development
1991; Shavelson, Baxter,& Gao, 1993). In the case of port- of purpose and meaning over skills and content in the in-
folios, where the tasks may vary substantiallyfrom student tellectual work of students (Greene, 1992;Willinsky, 1990)
to student and where multiple tasks may be evaluated si- and teachers (Darling-Hammond, 1989; Lieberman, 1992).
multaneously, inter-reader reliability may drop below ac- If Resnick and Resnick (1992)are correctin their conclusion
ceptablelevels for consequentialdecisions about individuals that what isn't assessed tends to disappear from the cur-
or programs(Koretz,McCaffrey,Klein, Bell, & Stecher,1992; riculum,then we need to find ways to document the validity
Nystrand, Cohen, & Martinez, 1993).1 of assessments that support a wider range of valued educa-
Validityresearchersin performanceassessment, building tional goals. And, as Wolf,Bixby,Glenn, and Gardner(1991)
on the pioneering work of Messick (1964, 1975, 1980, 1989) have suggested, we need to "revise our notions of high-
and Cronbach(1980, 1988) that expanded the definition of agreement reliabilityas a cardinal symptom of a useful and
validity to include consideration of social consequences, viable approachto scoring student performance"and "seek
have stressed the importance of balancingconcerns about other sorts of evidence that responsible judgment is un-
reliability, replicability, or generalizability with additional folding" (p. 63).
6 EDUCATIONALRESEARCHER

Unquestionably, reliabilityserves an important purpose. one in which all partiesconcerned (includingresearcherand
Underlying our concerns about reliabilityare both epistemo- researched) approach each other as equals. These differing
logical and ethical issues. These include the extent to which perspectives provide alternative resolutions to concerns
we can generalize to the constructof interestfrom particular about such issues as subjectivity,objectivity,and generaliz-
samples of behavior evaluated by particularreaders and the ability that psychometricians have confronted in building
extent to which those generalizations are fair. There are, their interpretations.
however, alternativemeans of serving those purposes. The
decision about which strategy to use should depend upon Comparing Hermeneutic and Psychometric Approaches
the aims and consequences of the assessment in question. Major differences between hermeneutic and psychometric
In the sections that follow, I explore the potential of a approaches to validity research can be characterizedlargely
hermeneutic approachto drawing and warrantinginterpre- in terms of how each treats the relationships between the
tations of human products or performances.2Although the parts of an assessment (individual products or perfor-
focus here is on reliability(consistency among independent mances) and the whole (entire collection of available
measures intended as interchangeable), it should be clear evidence) and how human judgment is used to arrive at
that reliabilityis an aspect of constructvalidity (consonance a well-warranted conclusion.
among multiple lines of evidence supporting the intended In a typical psychometric approach to assessment, each
interpretation over alternative interpretations). And as performanceis scored independently by readers who have
assessment becomes less standardized,distinctionsbetween no additional knowledge about the student or about the
reliability and validity blur. judgments of other readers. Inferences about achievement,
competence, or growth are based upon composite scores,
A Hermeneutic Approach to Interpretation
aggregated from independent observations across readers
Hermeneutics characterizes a general approach to the in- and performances, and referenced to relevant criteria or
terpretation of meaning reflected in any human product, norm groups. These scores, whose interpretability and
expression, or action, often referred to as a text or "text validity rest largely on previous research, are provided to
analog." Although hermeneutics is not a unitary tradition, users with guidelines for interpretation.Users are typically
most hermeneutic philosophers share a holistic and in- (and appropriately) advised to consider scores in light of
tegrative approach to interpretationof human phenomena other information about the individual, although main-
that seeks to understand the whole in light of its parts, stream validity theory provides little guidance about how
repeatedly testing interpretations against the available to combine such information to reach a well-warranted
evidence until each of the parts can be accounted for in a conclusion-a task to which hermeneutic analysis is well
coherent interpretation of the whole. (Edited volumes by suited.
Bleicher,1980, and Ormiston and Schrift, 1990, provide ex- A hermeneutic approach to assessment would involve
cerpts from the works of the major philosophers in the holistic, integrativeinterpretationsof collected performances
hermeneutic tradition.) Recently, a number of philosophers that seek to understand the whole in light of its parts, that
of science have suggested that this "hermeneutic circle" of privilege readers who are most knowledgeable about the
initial interpretation,validation, and revised interpretation context in which the assessment occurs, and that ground
characterizesmuch that occurs in the natural as well as the those interpretationsnot only in the textual and contextual
social sciences (Bernstein,1983;Diesing, 1991;Kuhn, 1986). evidence available, but also in a rational debate among the
Hermeneutic writings are often categorized into three community of interpreters. Here, the interpretationmight
major perspectives reflecting differences in the relative be warrantedby criterialike a reader'sextensive knowledge
"authority"they give to text, context, and readerin building of the learning context; multiple and varied sources of
an interpretation(Bleicher,1980; Ormiston & Schrift, 1990; evidence; an ethic of disciplined, collaborativeinquiry that
Rabinow & Sullivan, 1987;Warnke,1987).One perspective, encourages challenges and revisions to initial interpreta-
reflected in the writings of Schleiermacher,Dilthey, Betti, tions; and the transparency of the trail of evidence leading
and, more recently Hirsch, treats "hermeneutical theory" to the interpretations, which allows users to evaluate the
as a methodology intended to produce relatively objective conclusions for themselves.
or correct interpretations that reflect the original intent of
the authorwhile bracketingthe preconceptionsof the reader. Illustrations of More Hermeneutic Approaches to
Assessment
Here, the hermeneutic circle is conceived of in terms of a
dialecticalrelationshipbetween the parts of the text and the In higher education, we have a number of models of high-
whole. A second perspective, "hermeneutic philosophy," stakes assessments that are not standardized. For instance,
which is reflected in the writings of Heidegger and Gada- consider how we confer graduate degrees, grant tenure, or,
mer,recognizes that the reader'spreconceptions,"enabling" as my introductory paragraphs illustrated, make decisions
prejudices, or foreknowledge are inevitable and valuable in about which articles will be published and which will not.
interpretinga text. In fact, they make understanding possi- As an extended example, consider what, based on my ex-
ble. Here, the hermeneutic circle is viewed as including a perience, appears to be a typical process for making deci-
dialecticbetween reader and text to develop practicallyrele- sions about hiring faculty colleagues. Candidates submit
vant knowledge. A third perspective, often called "critical" portfolios of their work and evaluations by others. While
or "depth hermeneutics," reflectedin the writings of Haber- there may be some minimal standardization(e.g., three re-
mas and Apel, highlights the importance of considering presentative publications, teaching evaluations, and a let-
social dynamics that may distort meaning. Here the herme- ter articulating their programs of research and teaching),
neutic circle expands to include a critiqueof ideology from candidates are expected to compile evidence that they
the ideal perspective of an unconstrained communication-- believe best represents the substance and quality of their
MARCH 1994 7

work. Search committee members are selected because of Hammond & Snyder, 1992). These examples of assessment
their areas of expertise and affiliation-to cover the are all consistent with what Darling-Hammondand Snyder
knowledge and politicalbases that a thoughtful decision re- call a professional model of accountability, which seeks
quires. They are not trained to agree on a common set of evidence that teachers are engaging in collaborativeinquiry
criteriaand standards;rather,it is expected that all will bring to make knowledge-based decisions that respond to in-
expertise to bear in evaluating candidates' credentials.Can- dividual students' needs.
didates' portfolios, interviews, and presentations are not
parceled out to be evaluated independently by different Yes, But What About Generalizability and Fairness?
committee members. Rather,each member examines all the Regardlessof whether one is using a hermeneuticor psych-
evidence availableto reach and support an integrativejudg- ometricapproachto drawing and evaluatinginterpretations
ment about the qualificationsof the candidates. These judg- and decisions, the activity involves inference from obser-
ments are not aggregated to arriveat a set of scores; rather, vable parts to an unobservable whole that is implicit in the
they are brought to the table for a sometimes contentious purpose and intent of the assessment. The question is
discussion. Disagreement is taken seriously and positions whether those generalizations are best made by limiting
are sometimes changed as differentperspectivesare brought human judgment to single performances, the results of
to bear on the evidence. The final decision represents a con- which arethen aggregatedand comparedwith performance
sensus or compromise based on that discussion. An ethic standards, or by expanding the role of human judgment to
of fairness typically pervades these discussions, as creden- develop integrativeinterpretationsbased on all the relevant
tials are viewed and reviewed to make sure no qualifiedcan- evidence.
didates have been overlooked.The recommendation,ration- With a psychometric approach, generalizability is war-
ale, and supporting evidence are passed on for review to ranted in quantitative measures of consistency across in-
other levels of the system, typicallyincluding executivecom- dependent observations (across tasks, readers, and so on).
mittees, administrators,and affirmativeaction committees. As I argued above, the natureof the warrantprivilegesmore
Takentogether,these proceduresserve to warrantthe validi- standardized forms of assessment. When operationalized
ty and fairness of the decision. Would we really believe the in this way, inadequate consistency puts the validity of the
process more fair and valid if we followed more traditional assessment use in jeopardy. While consistency or consen-
assessment techniques of evaluating the parts independent- sus supports the validity of the interpretations in both
ly and aggregating the scores? psychometric and hermeneutic approaches, the difference
These models are not unheard of in K-12 education. rests in how it is addressed. Here I will consider the way
Edited volumes by Berlaket al. (1992) and Perrone (1991) generalizations may be constructed and warranted from
describe successful examples of more contextualized and more hermeneutic perspectives and how this, in turn, ex-
dialogic forms of assessment. For instance, at the Walden pands possibilities for assessment.
School in Racine, Wisconsin, students preparepapers or ex-
hibits that are evaluatedby a committeeof teachers,parents, GeneralizationAcross Tasks
other students, and members of the community. At the With respect to generalization across tasks, the goal of a
Prospect School in North Bennington, Vermont, teachers more hermeneutic approach is to construct a coherent in-
meet regularly to discuss progress of individual students terpretationof the collectedperformances,continuallyrevis-
or curriculumissues, in much the same way that physicians ing initial interpretations until they account for all of the
conduct case conferences. In pilot projectsin England, com- availableevidence. Inconsistency in students' performance
mittees of teachers, supervisors, and others at the school across tasks does not invalidate the assessment. Rather,it
level engage in periodic audits of the individual portfolios, becomes an empirical puzzle to be solved by searching for
and committees at higher levels of the system review the a more comprehensive or elaboratedinterpretationthat ex-
procedures of the school level committees to ensure that plains the inconsistencyor articulatesthe need for additional
appropriatestandards are being followed. (Elsewhere, my evidence. A well-documented reportdescribesthe evidence
colleagues and I have provided an extended illustration; available to other readers so that they may judge its ade-
Moss et al., 1992.) quacy for themselves in supporting the desired generaliza-
While the above examples focus primarilyon individual tion. Moreover, when the interpretation informs a subse-
assessment, other examples more directlyaddress the prob- quent action, such as a revised pedagogicalstrategy,the suc-
lems of providing system level informationin more dialogic cess of the action becomes another warrant of the validity
and contextualizedforms. In Pittsburgh,Pennsylvania, the of the working interpretation. This is consistent with the
Arts PROPELproject has involved committees of teachers characterizationof the hermeneuticcircleby Packerand Ad-
in designing a districtwideportfolio assessment system and dison (1989)as a dialecticbetween problemand solution that
has invited educators from outside the district to audit the furthers the concern of the reader.
portfolio evaluation process (LeMahieu, Eresh, & Wallace, In terms of task selection, hermeneutic approaches to
1992). At the Brooklyn New School in New YorkCity, the assessment can allow students and others being assessed
staff has developed a "learner centered accountability substantiallatitude in selecting the products by which they
system" in which a comprehensive set of structures and will be represented-a latitude that need not be constrain-
processes have been set up to support opportunities for ed by concerns about quantitativemeasures of consistency
guided student choice, collaborativelearning and inquiry across tasks. As my hiring illustration suggested, permit-
among teachers and administrators,active involvement of ting those assessed to choose products that best represent
students and their families in educational decisions, and their strengths and interests may, in some circumstances,
regularinvolvement by educators and researchersfrom out- enhance not only validity but also fairness. With psycho-
side the school in formative evaluation activities (Darling- metric approaches to assessment, fairness in task selection

has typically been addressed by requiring that all subjects (1983), citing Gadamer, argues that we cannot bracket all
respond to equivalent tasks, which have been investigated our prejudices because there is no knowledge or under-
for bias against various groups of concern (Cole & Moss, standing without prejudice (foreknowledge). (Imagine try-
1989). Neither approach ensures fairness:With the psycho- ing to interpret a response written in an unknown foreign
metric approach, we may present students with tasks for language.) The point is to discriminatebetween blind and
which there is differential familiarity, and with the enabling prejudices by criticallytesting them in the course
hermeneutic approach, students may not be prepared to of inquiry.
choose the products that best represent their capabilities. In a very real sense, attention to reliabilityactuallyworks
However, both approaches to fairness in task selection are against criticaldialogue, at least at one phase of inquiry. It
defensible and deserve discussion. leads to procedures that attempt to exclude, to the extent
AcrossReaders possible, the values and contextualized knowledge of the
Generalization reader and that foreclose on dialogue among readers about
With respect to generalization across readers, a more her- the specific performances being evaluated. A hermeneutic
meneutic approach to assessment would warrant interpre- approach to assessment encourages such dialogue at all
tations in a criticaldialogue among readers that challenged phases of the assessment. As Bernstein (1983) argues, the
initial interpretationswhile privileging interpretationsfrom absence of a sure foundation against which to test knowl-
readers most knowledgeable about the context of assess- edge claims does not condemn us to relativism: Themes in
ment. Initial disagreement among readers would not in- the work of Gadamer,Habermas, and others writing in the
validatethe assessment; rather,it would provide an impetus hermeneutic and criticaltraditions look to "dialogue, con-
for dialogue, debate, and enriched understanding informed versation, undistorted communication, communal judg-
by multiple perspectives as interpretationsare refined and ment, and the type of rational wooing that can take place
as decisions or actions are justified. And again, if well when individuals confront each other as equals" (p. 223).
documented, it would allow users of the assessment infor- If interpretationsare warrantedthrough criticaldialogue,
mation, including students, parents, and others affected by then the question of who participates in the dialogue
the results, to become part of the dialogue by evaluating becomes an issue of power, as proponents of criticalor depth
(and challenging) the conclusions for themselves. hermeneutics would remind us. In articulatingcriteriafor
Concerns about the objectivity (and hence the fairness) valid assessment in the service of accountabilitypurposes,
of such a process have been thoughtfully addressed by a number of assessment specialists have explicitly advised
qualitativeresearchersfrom both hermeneutic and postpos- against using the judgments of classroom teachers (e.g.,
itivist empiricaltraditionsof research. Phillips (1990), a per- Mehrens, 1992; Resnick & Resnick, 1992). Resnick and
suasive defender of postpositivist empiricalresearch, citing Resnick, for instance, assert:
Scriven's (1972) distinction between quantitative and
A principalrequirementof accountabilityand program
qualitative senses of objectivity, notes that consensus or
evaluationtests is that they permitdetachedand impartial
agreementamong independent observationsis no guarantor
of objectivity. Rather, he defines objectivity, procedurally, judgmentsof students'performance,thatis, judgmentsby
as acceptance of a criticaltradition: "The community of in- individualsother than the students' own teachers,using
assessment instrumentsnot of the teachers'devising....
quirers must be a critical community, where dissent and Like accountabilitytests, selection and certificationtests
reasoned disputation (and sustained efforts to overthrow mustbe impartial.Thepublicfunctionof certification would
even the most favoredof viewpoints) are welcomed as being not be met if teacherswere to grade the performanceof
centralto the process of inquiry" (pp. 30-31). Moreover,he their own students. (pp. 48-50)
notes, objectivity is no guarantor of "truth":
In contrast, other educators raise concerns about the
A criticalcommunitymight never reach agreementover, absence of teachers' voices in mechanisms of accountability
say, two viablealternativeviews, but if both of these views that affectthem and their students (e.g., Darling-Hammond
have been subjectedto criticalscrutiny,then both would & Snyder, 1992;Erickson,1986;Lieberman,1992).Erickson,
have to be regardedas objective.... And even if agree- for instance, laments the fact that teachers'accounts of their
ment is reached,it can stillhappen thatthe objectiveview own practices typically have no place in the discourse of
reached within such a community will turn out to be
wrong-this is the crossthatall of us livingin the new non- schooling. He notes that in other professions, including
foundationalistage have to learn to bear! (p. 31) medicine, law, and social work, "it is routine for practi-
tioners to characterizetheir own practice,both for purposes
This dialogic perspective on the role of the criticalcommun- of basic clinicalresearch and for the evaluation of their ser-
ity of interpretersin an age where no knowledge is viewed vices" (p. 157)and that "the lack of these opportunities [for
as certain is consistent with the recent writing of Cronbach teachers] is indicative of the relative powerlessness of the
(1988, 1989) and Messick (1989) on the philosophy of vali- profession outside the walls of the classroom" (p. 157).
dity. It is also consistent with the writing of hermeneutic Similarconcerns have been raised about the role of students
philosophers. Here, however, a comparison among the in assessments that have consequences in their lives (e.g.,
hermeneutic perspectives that I describedearlierreflectsin- Greene, 1992; Willinsky, 1990).
structive differences in the role of the readers' preconcep- From a psychometric perspective, the call for "detached
tions and the role of the power dynamics within the social and impartial" high-stakes assessment reflects a profound
context when interpretations are formed. Proponents of concern for fairness to individual students and protection
hermeneutic philosophy and depth hermeneutics would of stakeholders'interestsby providing accurateinformation.
question the possibility of "objective" knowledge that re- From a hermeneutic perspective, however, it can be criti-
quired readers to bracket their preconceptions. Bernstein cized as arbitrarilyauthoritarian and counterproductive,
MARCH 1994 9

because it silences the voices of those who are most hermeneuticapproachto assessment would lend theoretical
knowledgeable about the context and most directlyaffected support to new directionsin assessment and accountability
by the results. Quantitative definitions of reliabilitylocate that honor the purposes and lived experiences of students
the authorityfor determiningmeaning with the assessment and the professional, collaborativejudgments of teachers.
developers. In contrast, Gadamer (cited in Bernstein, 1983) Exploringthe dialecticbetween hermeneutics and psycho-
argues that the point of philosophicalhermeneuticsis to cor- metrics should provoke and inform a much needed debate
rect "the peculiar falsehood of modern consciousness: the among those who develop and use assessments about why
idolatryof scientificmethod and of the anonymous authority particular methods of validity inquiry are privileged and
of the sciences" (p. 40) and to vindicate "the noblest task what the effects of that privileging are on the community.
of the citizen-decision-making according to one's own
Epilogue
responsibility-instead of conceding that task to the expert"
(p.40). My friend and collaborator,RobertaHerter,who is a veteran
Of course, the validity of any consequential interpreta- Englishteacherin a large urban school district,tells the story
tion, including the extent to which it is free from inap- of "Cory," one of her former 10th-grade students. Cory's
propriateor "disabling" prejudices, must be carefullywar- experience puts a human face on the "detached and im-
rantedthrough critical,evidence-basedreview and dialogue. partial" nature of psychometrically sound standardized
The process proposed is not dissimilar from the way deci- assessments and illustrates the potential consequences of
sions are made and warrantedin the law (see Ricoeur,1981). devaluing more contextualized and dialogic approaches to
assessment:
Again, neither a psychometric nor a hermeneutic approach
guarantees fairness; however, a consideration of the When he first took the competencyexam mandatedby
assumptions and consequences associated with both ap- the districtin 1989,the writingproficiencycomponentre-
proaches leads to a better informed choice. quired Cory to produce a paragraphof at least five sen-
tences about his experience with friendship. Using the
Implications languageof the promptto guidehis openingsentence,Cory
respondedto the test promptby relatinga story aboutin-
I now return to my title, "Can there be validity without fluencingfriendsto quitsmokingwhile attemptingto main-
reliability?"When reliabilityis defined as consistencyamong tain his relationshipwith them. The eight sentences he
independent measures intended as interchangeable, the answer wrote responded to the prompt as if they had been re-
is, yes. Should there be? Here, the answer is, it depends hearsed,practicedin classroomexercisesin preparationfor
on the context and purposes for assessment. My argument the exam, conformingto the minimumresponserequired
shares much with Mishler's (1990)views on reliabilityas a to pass the test. The anonymousreadersof his examboth
means of warranting knowledge claims: ratedhim a 3.5 on a 5.0 holisticscale, a sufficientscoreto
pass the paragraphportion of the exam.
Reformulatingvalidationas the social discoursethrough Coryfailed the exam, however,becausehe did not pass
which trustworthinessis establishedelides such familiar the multiplechoiceportionof the writingtest. Eventhough
and objectivity.These
shibbolethsas reliability,falsifiability, his writing demonstratedthat he could apply mechanics
criteriaare neithertrivialnor irrelevant,but they must be appropriately, he also needed to demonstratethathe could
understoodas particularways of warrantingvalidityclaims recognizeerrorssuch as misplacedpunctuationmarksand
ratherthan as universal,abstractguarantorsof truth.They lack of subject-verbagreement in decontextualizedsen-
are rhetoricalstrategies ... that fit only one model of tences. The decontextualizededitingtasks requiredof the
science. (p. 420) multiple choice portion of the exam failed him.
When contrastedwith the lively writingfromhis folder,
Like Mishler, I am not advocating the abandonment of work collectedover a semester,the writingtest distorted
reliability.Rather,I am advocating that we consider it one and underestimatedCory'scapabilities.He wrote on rac-
alternativefor serving importantepistemologicaland ethical ism, MalcolmX, teen pregnancy,drugs, issues important
purposes-an alternativethat should always be justified in to him and the communityin which he lives. He stood out
critical dialogue and in confrontation with other possible amonghis peers as a good writer-a thoughtful,intelligent
means of warranting knowledge claims. As Messick (1989) studentwho put his writingto realpurposes-lettersto pen
has advised, such confrontations between epistemologies friends, the eulogy he wrote and deliveredat his uncle's
funeral, and plays on current issues of interest to his
illuminate assumptions, consequences, and the values im-
classmatesread and performedin class.
plied therein. Ultimately,the purpose of educationalassess- His versatilityas a writer,demonstratedby his abilityto
ment is to improve teaching and learning. If reliabilityis put writefor a varietyof audiencesin appropriateregisters,set
on the tablefor discussion, if it become an option ratherthan him apartfrom many of his peers who had not achieved
a requirement, then the possibilities for designing assess- Cory's degree of competence. Where Cory's test profile
ment and accountabilitysystems that reflect a full range of painted a pictureof a formulaicwriterwho could not rec-
valued educational goals become greatly expanded. ognize errorsin Englishusage, his foldershowed evidence
I believe the dialogue I have proposed here is not only of a purposefulwritercapableof producingcontrolledand
timely but urgent. We are at a crossroadsin education:There coherentprose. His letters,raps, libraryreportsfor science
is a crisis mentality accompanied by a flurry of activity to and history,his journaldocumentinghis personalgrowth
and changingattitudeswere powerfulindicatorsof his po-
design assessment and accountabilitysystems that both doc- tentialfor success both in and out of school. In an inter-
ument and promote desired educational change. Current view on his own learninghe defined educationas some-
conceptions of reliability and validity in educational mea- thingultimately,"you have to do foryourself."He showed
surement constrain the kinds of assessment practices that himself to be a responsible student, a reflective, critical
are likely to find favor, and these in turn constrain educa- thinker,consciousof the choicesaffordedhim by the school
tional opportunities for teachers and students. A more he attended.

Cory's failure on the exam consigned him to a reading Linn(Ed.),Intelligence: Measurement, theoryandpublicpolicy.Urbana:
competency class and a class in writing improvement, a University of Illinois Press.
low-track English elective where students who fail the exam Cronbach,L. J. (1990).Essentialsofpsychological testing(5thed.). New
label themselves LD or learning disabled. Both of these York:Harper& Row.
tracked classes were designed to prepare students to pass Darling-Hammond,L. (1989).Accountabilityfor professionalpractice.
the test so they might receive an endorsed diploma at Teachers CollegeRecord,91, 59-80.
Darling-Hammond,L., & Snyder,J. (1992).Reframingaccountabili-
graduation-classifying them as having achieved minimum ty: Creatinglearner-centeredschools. In A. Lieberman(Ed.), The
competency in basic skills of reading, writing, and math, changingcontextsof teaching(91stYearbookof the National Society
or reducing the value of their diploma to a certificate of at- for the Study of Education).Chicago:Universityof ChicagoPress.
tendance. But Cory didn't wait to take the test again; he Diesing, P. (1991).Howdoessocialsciencework?Reflections on practice.
dropped out of day school in his senior year. (adapted from Pittsburgh:University of PittsburghPress.
Moss & Herter, 1993) Dunbar,S. B., Koretz, D. M., & Hoover, H. D. (1991).Qualitycon-
trol in the development and use of performanceassessments. Ap-
pliedMeasurement in Education,4, 289-303.
Erickson,F. (1986).Qualitativemethods in researchon teaching. In
Notes M. C. Wittrock(Ed.),Handbook of researchon teaching(pp. 119-161).
New York:Macmillan.
Feldt, L. S., & Brennan,R. L. (1989).Reliability.In R. L. Linn (Ed.),
'Dunbar,Koretz,and Hoover(1991),in a reviewof empiricalresearch Educational measurement (3rd ed.). Washington,DC: The American
on performanceassessment, describereliabilityestimates based on Councilon Educationand the National Council on Measurement
the averageof coefficientsreportedfor each of nine studies, adjusted in Education.
via the SpearmanBrownformulato reflectan assessment based on Frederiksen,J. R., & Collins,A. (1989).A systems approachto educa-
a single readerand sampleof performance.Forthe seven studiesthat tional testing. Educational Researcher, 18(9),27-32.
took placeafter1984,readerreliabilityrangedfrom .59 to .91 and task Greene, M. (1992).Evaluationand dignity. Quarterlyof theNational
or "score" reliabilityranged from .27 to .60. Koretz(1993),describ-
WritingProject,14, 10-13.
ing inter-readerreliabilityon portfolios from Vermont'sstatewide Hipps, J. A. (1992,April).Newframeworks forjudgingalternative assess-
assessment, reports that correlationsbetween readers ranged from ments.Paper presented at the Annual Meeting of the American
.33 to .43, with raters assigning the same score between about 50% EducationalResearchAssociation, San Francisco.
and 60%of the time. Johnston,P. (1989).Constructiveevaluationand the improvementof
20ther articleshave suggested the use of qualitativemethods for
teaching and learning. Teachers CollegeRecord,90, 509-528.
validityresearchwith less standardizedformsof assessment. See, for Johnston,P. H. (1992).Constructive evaluationof literateactivity.New
example, Hipps (1992), Johnston (1992), and Moss et al. (1992). York:Longman.
Hermeneuticsprovides a philosophicalunderpinninglargelyconsis- Johnston, P. H., Weiss, P., & Afflerbach,P. (1990). Teachers' evalua-
tent with these authors'methodologicalsuggestions. Cherryholmes tionof theteachingandlearningin literacyandliterature (ReportSeries
(1988) suggests that other research discourses, including 3.4). Albany: State University of New Yorkat Albany, Centerfor
phenomenology,criticaltheory,interpretiveanalytics,and deconstruc- the Learningand Teachingof Literature.
tionism can each contribute,in differentways, to validity research. Koretz,D. (1993).New reporton VermontPortfolioProjectdocuments
Mishler (1990)and Johnston (1989)echo similarthemes.
challenges. NationalCouncilon Measurement in EducationQuarterly
Newsletter,1(4), 1-2.
Koretz, D., McCaffrey,D., Klein, S., Bell, R., & Stecher,B. (1992).
References Thereliabilityof scoresfromthe1992Vermont Portfolio AssessmentPro-
gram:Interimreport.Santa Monica, CA: Rand Instituteon Educa-
tion and Training,NationalCenterfor Researchon Evaluation,Stan-
AmericanEducationalResearchAssociation,AmericanPsychological dards, and Student Testing.
Association, & National Council on Measurementin Education. Kuhn, T. S. (1986).Theessentialtension:Selected studiesin scientifictradi-
(1985).Standards foreducational
andpsychologicaltesting.Washington, tion andchange.Chicago:The University of ChicagoPress.
DC: Authors. LeMahieu,P. G., Eresh, J. T., & Wallace,R. C., Jr.(1992).Using stu-
Berlak,H., Newmann, F. M., Adams, E., Archbald,D. A., Burgess, dent portfoliosfor publicaccounting.TheSchoolAdministrator, 49(11),
T., Raven,J., & Romberg,T. A. (1992).Toward a newscienceofeduca- 8-15.
tionaltestingandassessment.Albany: State Universityof New York Lieberman,A. (1992). The meaning of scholarly activity and the
Press. building of community.Educational Researcher, 21(6), 5-12.
Bernstein, R. J. (1983). Beyondobjectivismand relativism:Science, Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex,
hermeneutics, andpraxis.Philadelphia:University of Pennsylvania performance-based assessment:Expectationsand validationcriteria.
Press. Educational Researcher, 20(8), 5-21.
Bleicher,J. (1980).Contemporary hermeneutics: Hermeneutics as method, Mehrens, W. A. (1992).Using performanceassessment for accoun-
philosophy,and critique.London: Routledgeand Kegan Paul. tabilitypurposes. Educational Measurement: IssuesandPractice,11(1),
Breland,H. M., Camp, R., Jones, R. J., Morris,M. M., & Rock, D. 3-20.
A. (1987).Assessingwritingskill(ResearchMonographNo. 11).New Messick,S. (1964).Personalitymeasurementand collegeperformance.
York:College EntranceExaminationBoard. In Proceedings of the 1963InvitationalConference on TestingProblems
Cherryholmes,C. H. (1988). Constructvalidity and discourses of (pp. 110-129). Princeton,NJ: EducationalTesting Service
research.AmericanJournalof Education,96, 421-457. Messick, S. (1975).The standard problem:Meaning and values in
Cole, N. S., & Moss, P A. (1989).Biasin test use. In R. L. Linn(Ed.), measurementand evaluation. AmericanPsychologist,30, 955-966.
Educational measurement (3rded., pp. 201-219).Washington,DC:The Messick,S. (1980).Testvalidityand the ethicsof assessment.American
American Council on Education and the National Council on Psychologist,35, 1012-1027.
Measurementin Education. Messick, S. (1989).Validity.In R. L. Linn (Ed.), Educational measure-
Corbett,H. D., & Wilson, B. L. (1991). Testing,reform,and rebellion. ment(3rded., pp. 13-103).Washington,DC:The AmericanCoun-
Norwood, NJ: Ablex. cil on Education and the National Council on Measurementin
Crocker,L., & Algina, J. (1986).Introductiontoclassicalandmoderntest Education.
theory.FortWorth, TX: Holt, Rinehart,& Winston. Messick, S. (1992,April). Theinterplayof evidenceandconsequences in
Cronbach,L. J. (1980).Validityon parole: How can we go straight? thevalidationof performance assessments.Paperpresented at the An-
Newdirections fortestingandmeasurement: Measuring achievement, pro- nual Meeting of the National Council on Measurementin Educa-
gressovera decade,no. 5 (pp. 99-108). San Francisco:Jossey-Bass. tion, San Francisco.
Cronbach,L. J. (1988).Five perspectiveson validity argument. H. In Mishler,E. G. (1990).Validationin inquiry-guidedresearch.Harvard
Wainer (Ed.), Test validity. Hillsdale, NJ: Erlbaum. Educational Review,60, 415-442.
Cronbach,L. J. (1989).Constructvalidationafterthirtyyears. In R. L. Moss, P. A. (1992). Shifting conceptions of validity in educational
MARCH 1994 11

measurement: Implications for performance assessment. Review of O'Connor (Eds.), Cognitiveapproachesto assessment. Boston: Kluwer-
Educational
Research,62, 229-258. Nijhoff.
Moss, P. A., Beck, J. S., Ebbs, C., Herter, R., Matson, B., Muchmore, Ricoeur, P. (1981). The model of the text: Meaningful action considered
J., Steele, D., & Taylor, C. (1992). Portfolios, accountability, and an as a text. In P. Ricoeur (J.B. Thompson, Ed. and Trans.), Hermeneutics
interpretive approach to validity. EducationalMeasurement:Issuesand and the human sciences. New York: Cambridge University Press.
Practice, 3(11), 1-11. Scriven, M. (1972). Objectivity and subjectivity in educational research.
Moss, P. A., & Herter, R. (1993). Assessment, accountability, and In L. G. Thomas(Ed.), Philosophical
redirection
of educational
research
authority in urban schools. The Long TermView, 1(4), 68-75. (71st Yearbook of the National Society for the Study of Education,
Newmann, F. M. (1990). Higher order thinking in teaching social Part 1). Chicago: The University of Chicago Press.
studies: A rationale for the assessment of classroom thoughtfulness. Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variabili-
Journalof Curriculum
Studies.22(1), 41-56. ty of performance assessments. Journalof EducationalMeasurement,
Nystrand, M., Cohen, A. S., & Martinez, N. M. (1993). Addressing 30, 215-232.
reliability problems in the portfolio assessment of college writing. Smith, M. L. (1991). Put to the test: The effects of external testing on
Educational
Assessment,1(1), 53-70. teachers. EducationalResearcher,20(5), 8-11.
Ormiston, G. L., & Schrift, A. D. (Eds.). (1990). The hermeneutictradi- Warnke,G. (1987).Gadamer: tradition,andreason.Stan-
Hermeneutics,
tion: FromAst to Ricoeur.Albany: State University of New YorkPress. ford, CA: Stanford University Press.
Packer, M. J., & Addison, R. B. (1989). Entering the circle:Hermeneutic Wiley, D. E., & Haertel, E. H. (in press). Extended assessment tasks:
investigation in psychology. Albany: State University of New York Purposes, definitions, scoring, and accuracy. In R. Mitchell &
Press. M. Kane(Eds.), Implementing assessment:
performance Promises,prob-
Perrone, V. (Ed.). (1991). Expandingstudent assessment. Washington, lems, and challenges. Washington, DC: Pelavin Associates.
DC: Association for Supervision and Curriculum Development. Willinsky,J. (1990).Thenewliteracy:Redefiningreadingandwritingin
Phillips, D. C. (1990). Subjectivity and objectivity: An objective in- the schools. New York: Routledge.
quiry. In E. W. Eisner & A. Peshkin (Eds.), Qualitativeinquiryin educa- Wolf, D., Bixby, J., Glenn, J., III, & Gardner, H. (1991). To use their
tion: The continuing debate(pp. 19-37). New York: Teachers College minds well: Investigating new forms of student assessment. Review
Press. of Researchin Education,17, 31-74.
Rabinow, P., & Sullivan, W. M. (Eds.). (1987). Interpretivesocialscience:
A second look. Berkeley: University of California Press. Received April 19, 1993
Resnick, L. B., & Resnick, D. (1992). Assessing the thinking curricu-
lum: New tools for educational reform. In B. Gifford & M. C. Revision received August 10, 1993
Accepted August 30, 1993
Call for Applications
AERAISpencer Doctoral Research

Training FellowshipPrograms
The American Educational Research Association, in signed to facilitatethe entry and socialization of new re-
partnership with the Spencer Foundation, announces a searchersinto the field. Activitiesinclude a nationalmen-
program to increase the cadre of new, well-prepared tor component, two 1-week summer institutes with a
educational researchers. Funds are available to provide distinguished national faculty, unique participationex-
fellowship support for promising graduate students in periences at the AERA Annual Meeting, experiences at
educationalresearch,and to provide a programof educa- the professional meetings of other disciplines, and ac-
tional experiences designed to help new researchers cess to an electronic network linking fellows, mentors,
become contributing members of the community. and AERAstaff. Spencer Foundation funds will support
The fellowship program is targeted for full-time grad- up to 12 Fellowships for the 1994-1995 academic year.
uate students approximatelymidway through their doc- 2. The AERA/SPENCERTravelFellowships of $3,000
toral programs, generally in their second year of a full- are designed for students who receive financial support
time program. Fellows will be provided with unique ac- at their home institution,but wish to take partin the pro-
cess to the community of educational researchers and fessional enhancement activities of the fellowship pro-
with a mentoring and cohort network that would prob- gram enumerated above. The 1-year TravelFellowships
ably be unavailable to them at their institutions, do not provide for a nationalmentoror monthly stipends.
Applications are sought for two fellowship programs; Spencer Foundation funds will support as many as 10
each will make awards for the start of the 1994-1995 travel fellowships for 1994-1995.
academic year. Deadline for receipt of applications is May 20, 1994.
1. The AERA/SPENCER1-year Fellowship Program Applicants will be notified by the end of June 1994.
will make awards averaging$16,000plus travelfunds for Application forms for the two programs are available
professional development activities. Fellows will have, by contacting AERA at 1230 17th Street, NW, Washing-
in addition to financialsupport, opportunities to partici- ton, DC, 20036. Telephone: 202-223-9485;fax: 202-775-
pate in a number of activities designed to complement 1824.
and extend the education and training they receive at Minoritiesand persons with disabilitiesare encouraged
their home institutions. Such experiences will be de- to apply.


Moss (1994 Validity Reliability)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Moss (1994 Validity Reliability)

Uploaded by

Copyright:

Available Formats

Can There Be Validity without Reliability?

Author(s): Pamela A. Moss

This content downloaded from 140.159.34.46 on Sun, 25 Jan 2015 22:02:38 PM

This content downloaded from 140.159.34.46 on Sun, 25 Jan 2015 22:02:38 PM

This content downloaded from 140.159.34.46 on Sun, 25 Jan 2015 22:02:38 PM

This content downloaded from 140.159.34.46 on Sun, 25 Jan 2015 22:02:38 PM

This content downloaded from 140.159.34.46 on Sun, 25 Jan 2015 22:02:38 PM

This content downloaded from 140.159.34.46 on Sun, 25 Jan 2015 22:02:38 PM

This content downloaded from 140.159.34.46 on Sun, 25 Jan 2015 22:02:38 PM

This content downloaded from 140.159.34.46 on Sun, 25 Jan 2015 22:02:38 PM

Call for Applications

AERAISpencer Doctoral Research

This content downloaded from 140.159.34.46 on Sun, 25 Jan 2015 22:02:38 PM

You might also like