Professional Documents
Culture Documents
Wiggins Atruetest Kappan89 PDF
Wiggins Atruetest Kappan89 PDF
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .
http://www.jstor.org/action/showPublisher?publisherCode=pdki. .
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Phi Delta Kappa International is collaborating with JSTOR to digitize, preserve and extend access to The Phi
Delta Kappan.
http://www.jstor.org
Toward More Authentic and
Equlitable Assessmemsnt
704 PHIDELTA
KAPPAN
tory. The tests grew out of the "school the educationalenterprise, such reduc
efficiency"movement in the years be tionistshortcuts,suchhigh student/teach
tween 1911 and 1916, a time oddly simi er ratios, and such dysfunctional alloca
lar to our own. The movement, spear tion of time and resources will be seen
headedby thework of FranklinBobbitt, as intolerable.
was driven by crude and harmful analo Schools and teachers do not tolerate the To design an
gies drawn fromFrederickTaylor'sman same kind of thinking in athletics, the
agementprinciples,which were used to arts, and clubs. The requirements of the
authentic test,we
improve factory production. Raymond game, recital, play, debate, or science must first decide
Callahan notes that the reformers, then fair are clear, and those requirements de
as now, were far too anxious to satisfy termine the use of time, the assignment what are the actual
external critics and to reduce complex in of personnel, and the allocation of mon performancesthat
tellectualstandardsand teacherbehaviors ey. Far more time - often one's spare
to simple numbers and traits.4Implicit time - is devoted to insuring adequate we want students
ly, therewere signs of hereditarianand practiceand success.Even in thepoorest to be good at.
social-class-basedviews of intelligence; schools, the ratio of players to inter
the tests were used as sorting mechanisms scholastic coaches is about 12 to 1.8 The
at least partly in response to the increased test demands such dedication of time;
heterogeneityof the schoolpopulationas coaching requiresone-to-one interaction.
a result of the influx of immigrants.5 And no one complains about teaching to
The "standards" were usually cast in the test in athletic competition. sure thoroughanalysis and useful feed
terms of the increased amount of work We need to begin anew, from the back to studentsabout results.
to be demanded of teachers and students. premise that a testing program must ad This reversal in thinkingwill make us
As George Strayer, head of the National dress questions about the inevitable im pay more attention to what we mean by
EducationAssociation (NEA) Commit pact of tests (and scoring methods) on evidence of knowing.Mastery ismore
tee on Tests and Standards for School Ef students and their learning. We must ask than producing verbal answers on cue;
ficiency, reported,"Wemay not hope to different questions. What kinds of chal it involves thoughtful understanding,
achieve progress except as such measur lenges would be of most educational val as well. And thoughtfulunderstanding
ing sticks are available." A school su ue to students? What kinds of challenges implies being able to do something ef
perintendent put it more bluntly: "The would give teachersuseful information fective, transformative, or novel with a
resultsof a fewwell-planned testswould about the abilities of their students? How problem or complex situation. An au
carry more weight with the businessman will the results of a test help students thentic test enables us to watch a learner
and parent than all the psychology in the know their strengthsandweaknesses on pose, tackle, and solve slightly ambigu
world."6 essential tasks? How can a school ade ous problems. It allows us to watch a
Evenwith unionizationand the insights quately communicate its standards to in student marshal evidence, arrange argu
gained from better education, modern terested outsiders and justify them, so ments, and take purposeful action to ad
teachers still fall prey to the insistent that standardized tests become less neces dress the problems.9Understanding is
claims of noneducation interests. The sary and less influential? often best seen in the ability to criticize
wishes of college admissionsofficers, of or extend knowledge, to explain and ex
employers, of budget makers, of sched plore the limitsandassumptionsonwhich
ulers, and even of the secretaries who AUTHENTIC TESTS
a theory rests. Knowledge is thus dis
enter grades on computers often take Tests shouldbe centralexperiences in played as thoughtful know-how - a
precedenceover theneeds of studentsto learning. The problems of administra blend of good judgment, sound habits,
be properly examined and the needs of tion, scoring, and between-school com responsiveness to the problem at hand,
teachers to deliberate and confer about parisons should come only after an au and control over the appropriate informa
effective test design and grading. thentic test had been devised - a rever tion and context. Indeed, genuine mas
Thus, when teachers regard tests as sal of the current practice of test design. tery usually involves even more: doing
something to be done as quickly as pos If we wish to design an authentic test, something with grace and style.
sible after "teaching" has ended in order we must first decide what are the actual To prove that an answer was not an ac
to shake out a final grade, they succumb performances that we want students to be cident or a thoughtless (if correct) re
to the same flawed logic employed by the good at.We must design those perform sponse, multiple and varied tests are re
test companies (with far less statistical ances first and worry about a fair and quired. In performance-based areas we
justification).Such acquiescence is pos thorough method of grading them later. do not assess competence on the basis of
sible only when the essential ideas and Do we judge our studentsto be deficient one performance.We repeatedlyassess
priorities ineducationareunclearor have inwriting, speaking, listening, artistic a student'swork - througha portfolio
been lost. If tests serve only as adminis creation, findingand citing evidence, and or a season of games. Over time and in
trativemonitors, then short-answer,"ob problem solving?Then let the tests ask the context of numerousperformances,
jective" tests - an ironicmisnomer7 - them towrite, speak, listen, create, do we observe thepatterns of success and
will suffice (particularlyif one teaches original research, and solve problems. failure and the reasons behind them.
128 studentsand has only a single day Only thenneed we worry about scoring Traditional tests -as arbitrarilytimed,
inwhich tograde finalexams).However, the performances, training the judges, superficialexercises (more likedrills on
if a test is seen as the heart and soul of and adapting the school calendar to in the practice field than like a game) that
706 KAPPAN
PHIDELTA
ers. 16You must complete an oral history tion over the next year, and 6) discuss *Novice. Students use high-frequen
based on interviews and written sources where your company will be in the mar cy words, memorized phrases, and for
and present your findings orally in class. ket six months from today and one year mulaic sentenceson familiartopics. Stu
The choice of subject matter will be up from today. dents show little or no creativity with the
to you. Some examples of possible topics The tasks thatmust be completed in the languagebeyond thememorizedpatterns.
include: your family, running a small course of this project include: * Intermediate.Studentsrecombinethe
business, substance abuse, a labor union, * deriving formulas for supply, de learnedvocabulary and structures into
teenage parents, or recent immigrants. mand, elasticity, and equilibrium; simple sentences.Sentences are choppy,
You are to create three workable hypoth * preparingschedules for supply, de with frequenterrors ingrammar,vocabu
eses based on your preliminary investi mand, costs, and revenues; lary,and spelling.Sentenceswill be very
gations and come up with four questions * graphing all work; simple at the low end of the intermedi
you will ask to test each hypothesis. * preparing a written evaluation of the ate range and will often read very much
To meet the criteria for evaluating the current and future situation for the mar like a direct translation of English.
oral history project described above, you ket in general and for your company in * Intermediate high. Students can write
must: particular; creativesentences,sometimesfairlycom
*investigate threehypotheses; * preparingawritten recommendation plex ones, but not consistently.Structural
* describe at least one change over for your board of directors; forms reflecting time, tense, or aspect are
time; * showing aggregate demand today attempted, but the result is not always
* demonstrate that you have done back and predicting what it will be one year successful. Student show an emerging
ground research; hence; and ability to describe and narrate in para
* interview four appropriate people as * showing the demand for your firm's graphs, but papers often read like aca
sources; product today and predicting what itwill demic exercises.
* prepare at least four questions relat be one year hence. * Advanced. Students are able to join
ed to each hypothesis; Connecticut has implemented a range sentences in simple discourse and have
* ask questions that are not leading or of performance-based assessments in sci sufficientwriting vocabulary to express
biased; ence, foreign languages, drafting, and themselvessimply,althoughthe language
* ask follow-up questions when ap small-engine repair, using experts in the may not be idiomatic. Students show
propriate; field to help develop apt performance good control of themost frequently used
* note important differences between criteria and test protocols. Here is an syntactic structures and a sense that they
fact and opinion in answers that you re excerpt from the Connecticut manual are comfortable with the target language
ceive; describing the performancecriteria for and can go beyond the academic task.
* use evidence to support your choice foreign languages; these criteria have Of course, using such an approach is
of the best hypothesis; and been derived from the guidelines of the time-consuming, but it is not impractical
* organize your writing and your class American Council on the Teaching of or inapplicable to all subject areas on
presentation. Foreign Languages (ACTFL).18On the a large scale. The MAP (Monitoring
A course-ending simulation/exam in written test, students are asked to draft Achievement in Pittsburgh) testingpro
economics.'7 You are the chief execu a letter to a pen pal. The four levels used gram offers tests of critical thinking and
tive officer of an established firm. Your for scoring are novice, intermediate, in writing that rely on essay questions and
firm has always captured a major share termediate high, and advanced; they are are specifically designed to provide di
of the market, because of good use of differentiatedas follows: agnostic information to teachers and
technology, understanding of the natural
laws of constraint, understanding of mar
ket systems, and the maintenance of a
high standard for your product. How
ever, in recent months your product has
become part of a new trend in public
tastes. Several new firms have entered
themarket and have captured part of your
sales. Your product's proportional share
of total aggregate demand is continuing
to fall. When demand returns to normal,
you will be controlling less of the mar
ket than before.
Your board of directors has given you
less thanamonth topreparea reportthat
solves theproblem in the short run and
in the long run. Inpreparing the report,
you should: 1) define the problem, 2)
preparedata to illustratethe current sit
uation, 3) preparedata to illustratecon
'Ms. Kelsor says I'd do better with a team of teachers. She thinks six or eight
ditions one year in the future,4) recom
mend action for today,5) recommendac would be about right."
708 KAPPAN
PHIDELTA
and correctly answered "no one," since overlooked feasible in-class alternatives
"all-around" could mean "winner of all to such impersonal testing, which are
events"? If looked at in this way, couldn't already in use around the world. The
it be that the child was more thoughtful German abitur (containing essay and
than most by deliberately not taking the oral questions) is designed and scored by
bait of part B (which presumably would classroom teachers, who submit two pos XX4ho is re
have caused the child to pause and con
sider his or her answer). The full sen
sible tests to a state board for approval.
The APU in Great Britain has for more
sponsiblefor
tence answer in part B - remember, this than a decade developed tests that are insuring that an
is a 9-year-old - is revealing to me. It designed for classroom use and that in
ismore emphatic than the answer to part volve interaction between assessor and
answer has been
A, as if to say, "Your question suggests student. fully exploredor
I should have found one all-around win What is so striking about many of the
ner, but Iwon't be fooled. I stick to my APU test protocols is that the assessor is
understood,the
answer that no one was the all-around meant to probe, prompt, and even teach, testeror the student?
winner." (Note, by the way, that in the if necessary, to be sure of the student's
scorer's manual the word all-around has actual ability and to enable the learner to
been changed to overall.) The student did learn from the assessment. In many of
not, of course, explain the answer, but these tests the first answer (or lack of
it is conceivable that the instruction was one) is not deemed a sufficient insight
into the student'sknowledge.23Consid the table.] If no response, prompt for
confusing, given that there was no "work"
string.
needed to determine that "no one" was the er, for example, the following sections
13. Ask: "Is there any other meth
all-around winner. One quick follow-up from the assessor's manual for a mathe od?" If student does not suggest using
question could have settled the matter. matics test for British 15-year-olds cover C = itd, prompt with, Would it help
A moral question with intellectual ram ing the ideas of perimeter, area, and cir tomeasure the diameter of the circle?"
ifications is at issue here: Who is re cumference.
sponsible for insuring that an answer has The scoring system works as follows:
been fully explored or understood, the 1. Ask: "What is the perimeter of a 1) unaided success; 2) success following
tester or the student? One reason to safe rectangle?" [Write student answer.] one prompt from the tester; 3) success
guard the teacher's role as primary asses 2. Present sheet with rectangle following a series of prompts; 4) teach
sor is that the most accurate and equita ABCD. Ask: "Could you show me the ing by the tester, prompts unsuccessful;
ble evaluation depends on relationships perimeter of this rectangle?" If neces 5) an unsuccessful response, and tester
that have developed over time between sary, teach. did not prompt or teach; 6) an unsuc
examiner and student. The teacher is the 3. Ask: "How would you measure cessful response despite prompting and
the perimeter of the rectangle?" If nec teaching; 7) question not given; and 8)
only one who knows what the student can
essary, prompt for full procedure. If
or cannot do consistently, and the teach unaided success where student correct
necessary, teach. . ..
er can always follow up on confusing, 10. "Estimate the length of the cir ed an unsuccessful attempt without help.
glib, or ambiguous answers. cumference of this circle." The "successful" responses were com
In this country we have been so en 11. Ask: "What would you do to bined into two larger categories called
amored of efficient testing that we have check your estimate?" [String is on "unaided success" and "aided success,"
with percentages given for each.24
The Australians for years have used
similar tasks and similarly trained teach
X 'mcar iafull fuu
IF_g Wkv ers to conduct district- and statewide
load thSsemester..
Remedial writini.. assessments in academic subject areas
r (much as we do in this country with the
Remedlial readinrg.
Remedial miath... Advanced Placement exams). Teachers
give tests made up of questions drawn
from banks of agreed-upon items and
then mark them. Reliability is achieved
through a process called "moderation," in
which teachers of the same subjects gath
RATE~~~~~~~~~~~TT OLL~ er to compare results and to set criteria
for grading.
To insure that professionalization is
aided, not undermined,by national test
ing, the process of "groupmoderation"
has been made a central featureof the
V proposednew nationalassessmentsystem
inGreat Britain. The testswil.l be both
teacher-givenand standardized.Butwhat
(Pq?HCO1-FO(7T is so admirable-and equitable -is that
1989
MAY 709
development that teachers need and de intellectual standards and can reveal only
sire. Both equity in testing and reform of where students stand in relation to one
schooling ultimately depend on a more another. It tells us nothing about where
open and consensual process of establish they ought to be. Moreover, students are
ing and upholding schoolwide standards. left with only a letter or number - with
A number of reasons are often cited
XVe must over for retaining "objective" tests (the design
nothing to learn from.
Consider, too, that the bell-shaped
come the lazy habit of which is usually quite "subjective"'), curve is an intendedresult in designing
among them: the unreliability of teacher a means of scoring a test, not some co
of grading and created tests and the subjectivity of hu incidental statistical result of a mass test
scoring"on the man judgment. However, reliability is ing. Norm-referenced tests, be they lo
only a problem when judges operate in cally or nationally normed, operate un
curve" as a cheap private and without shared criteria. In der the assumption that teachers have no
way of setting and fact, multiple judges, when properly effect - or only a random effect - on
trained to assess actual student perform students.
upholdingstandards. ance using agreed-upon criteria, display
a high degree of inter-rater reliability. In There is nothing sacred about the
the Connecticut foreign language test de normal curve. It is the distributionmost
scribed above, on the thousands of stu appropriate to chance and random ac
dent tests given, two judges using a four tivity. Education is a purposeful activi
point scoring system agreed on a student's ty, and we seek to have the students
theprocess of groupmoderation requires score 85% of the time.26 Criticisms of
learnwhat we have to teach. . . . [W]e
collective judgmentsabout any discrep Advanced Placement exams that contain
may even insist thatour efforts are un
ancies between grade patterns in differ successful to the extent that the distri
essay questions usually focus on the cost bution of achievement approximates
ent schools and between results in a giv
of scoring, not on problems of inter-rater the normal distribution.27
en school and on the nationally stan
reliability. Inadequatetesting technolo
dardized criterion-referencedtest. Sig In addition, such scoring insures that,
gy is a red herring. The real problem
nificantly, theprocessofmoderationcan, standing in the way of developing more by design, at least half of the student
on occasion, override the results of the
authentic assessment with collaborative population is always made to feel inept
nationally standardized test:
standard-setting is the lack of will to in and discouraged about their work, while
vest the necessary time and money. the other half often has a feeling of
A first task of a moderation group True criterion-referenced tests and di achievement that is illusory.
would be to examine how well the pat ploma requirements, though difficult to Grading on a curve in the classroom
ternsof the twomatched for each group
frame in performance standards, are es is even less justifiable. There is no sta
of pupils [comparing percentages of
students assigned to each level].... sential for establishing an effective and tistical validity to the practice, and it
The meeting could then go on to ex just education system. We must over allows teachers to continually bypass the
plore discrepancies in the pattern of come the lazy habit of grading and scor harder but more fruitful work of setting
particular schools or groups, using ing "on the curve" as a cheap way of set and teaching performance criteria from
samples of pupils' work and knowledge ting and upholding standards. Such a which better learning would follow.
of the circumstances of schools. The practice is unrelated to any agreed-upon To let students show off what they
group moderation would first explore
any general lack of matching between
the overall teacher rating distribution
and the overall distribution of results 1fFACHN-R5
on the national tests. The general aim
would be to adjust the overall teacher
LOUNGEI
rating results tomatch the overall re
sults of the national tests; if the group
were to have clear and agreed reasons
for not doing this, these should be
reported ... [and] departures could be
approved if the group as a whole could
be convinced that theywere justified in
particular cases.25 (Emphasis added)
710 PHIDELTA
KAPPAN
know and are able to do is a very differ Structureand logistics.Authentic tests make studentjudgmentcentral inposing,
ent business from the fatalisminducedby aremore appropriatelypublic, involving clarifying, and tacklingproblems.
counting errors on contrived questions. an actual audience, client, panel, and so Standards of grading and scoring.
Since standardizedtests are designed to on. The evaluation is typically based on Authentic testsmeasure essentials, not
highlight differences, they often end up judgment that involvesmultiple criteria easily counted (but relatively unimpor
exaggeratingthem(e.g., by throwingout (andsometimesmultiple judges),and the tant) errors. Thus the criteria for scor
pilot questions that everyone answers cor judging ismade reliableby agreed-upon ing them must be equally complex, as in
rectly in order to gain a useful "spread" standardsand prior training. the cases of the primary-trait scoring of
of scores).28And since the tasksare de Authentic testsdo not relyon unrealis essays or the scoring of ACTFL tests of
signedaroundhiddenandoften arbitrary tic and arbitrary time constraints, nor do foreignlanguages.Nor can authentictests
questions,we shouldnot be surprisedif they rely on secret questions or tasks. be scored on a curve. They must be
the test results end up too dependent on They tend to be like portfolios or a full scoredwith reference to authentic stan
the native language ability or cultural season's schedule of games, and they em
backgroundof the students,insteadof on phasize studentprogress towardmastery.
the fruit of theirbest efforts. Authentic tests require some collab
Tracking is the inevitable result of orationwith others. Most professional
grading on a curve and thinking of stan challenges faced by adults involve the ca
dards only in terms of drawing exag pacity to balance individual and group
gerated comparisonsbetween students. achievement.Authentic tests recur, and A uthentic tests
Schools end up institutionalizingthese they are worth practicing, rehearsing,
differences, and, as the very word track and retaking. We become better educat are contextualized,
implies, the standardsfordifferenttracks
never converge. Students in the lower
ed by taking the test over and over. Feed
back to students is central, and so authen
complex intellectual
tracks are not taught and assessed in such tic tests are more intimately connected challenges,not
a way that they become better enabled to
close the gap between their current com
with theaims, structures,schedules,and
policies of schooling.
fragmentedand
petence and ideal standards of perform Intellectualdesign features.Authentic staticbits
ance.29Tracking simply enables students tests are not needlessly intrusive, arbi or tasks.
in the lower tracks to get higher grades. trary, or contrived merely for the sake
In the performance areas, by contrast, of shaking out a single score or grade.
high standards and the incentives for Instead, they are "enabling" - construct
students are clear.30 Musicians and ath ed to point the student toward more
letes have expert performers constantly sophisticated and effective ways to use
before them from which to learn. We set knowledge. The characteristicsof com
up differentweight classes forwrestling by which we might
petent performance dards of performance,which students
competition, different ratingclasses for sort nonenabling from enabling tests must understand to be inherent to suc
chess tournaments, and separate varsity might include: "The coherence of [the cessful performance.
and junior varsity athletic teams to nur student's]knowledge, principled [as op Authentic tests usemultifaceted scor
ture students'confidence as they slowly posed to merely algorithmic] problem ing systems instead of a single aggregate
grow and develop their skills. We assume solving,usableknowledge, attention-free grade. The many variables of complex
that progress toward higher levels is not and efficient performance, and self-reg performance are disaggregated in judg
only possible but is aided by such group ulatory skills. "33 ing.Moreover, self-assessmentbecomes
ings. Authentic testsare contextualized,com more central.35
The tangible sense of efficacy (aided plex intellectual challenges, not frag Authentic tests exist in harmony with
by the desire to do well publicly and the mented and static bits or tasks. They cul schoolwideaims; theyembody standards
power of positive peer pressure) that minate in the student's own research or to which everyone in the school can as
theseextracurricularactivitiesprovide is product, for which "content" is to be pire. This implies the need for school
a powerful incentive. Notice how often mastered as a means, not as an end. wide policy-making bodies (other than
some students will try to sneak back into Authentic tests assess studenthabits and academicdepartnents)thatcrossdiscipli
school after cutting class to submit them repertoires;theyare not simply restrict nary boundaries and safeguard the essen
selves to the rigorsof athletics, debate, ed to' recall and do not reflect lucky or tial aims of the school. At Alvemo Col
or band practice - even when they are unlucky one-shot responses. The port lege inMilwaukee, all facultymembers
not the stars or when their team has an folio is the appropriate model; the general are both members of disciplinary depart
abysmal record.31 task is to assess longitudinalcontrolover ments and of"competency groups" that
the essentials.34 span all departments.
Authentic testsare representativechal Fairness and equity.Rather than rely
CRITERIA OF AUTHlENT7ICITY
lengeswithiina given discipline.They are on right/wronganswers, unfair "distrac
From the arguments and examples designed to emphasize realistic (but fair) tors,"andother statisticalartificestowid
above, letme move to a considerationof complexity; they stress depthmore than en the spreadof scores, authentic tests
a set of criteriaby which we might dis breadth. In doing so, theymust neces ferretout and identify (perhapshidden)
tmnguish authenticfrom inauthenticforms sarily involve somewhatambiguous, ill strengths.The aim is to enable-the stu
of testing.32 structuredtasksor problems, and so they dents to show off what they can do. Au
MAY
1989 711
thentic tests strike a constantly examined teaching begins with the freedom and re
balancebetween honoring achievement, sponsibility to set and uphold clear, ap
progress, native language skill, and pri propriate standards - a feat that is im
or fortunate training. In doing so, they possible when tests are seen as onerous
can better reflect our intellectual values. add-ons for "accountability" and are de
Authentic tests minimize needless, un O nly a humane signed externally (and in secret) or ad
fair, and demoralizing comparisons and ministered internally in the last few days
do away with fatalistic thinking about re and intellectually of a semester or year.
sults. They also allow appropriate room valid approach to The redesign of testing is thus linked
to accommodate students' learning styles, to the restructuring of schools. The re
aptitudes, and interests. There is room evaluation can help structuring must be built around intellec
for the quiet "techie" and the show-off us insure progress tual standards, however, not just around
as has too
prima donna in plays; there is room for issues involving governance,
the slow, heavy lineman and for the towardnational often been the case so far. Authentic re
small, fleet pass receiver in football. In
professional work, too, there is room for
intellectualfitness. structuring depends on continually ask
ing a series of questions: What new meth
choice and style in tasks, topics, and ods, materials, and schedules are re
methodologies. Why must all students be quired to test and teach habits of mind?
tested in the same way and at the same What structures, incentives, and policies
time? Why should speed of recall be so will insure that a school's standards will
well-rewarded and slow answering be so be known, reflected in teaching and test
throughout to the test's "face" and "eco
heavily penalized in conventional test design, coherent schoolwide, and high
logical"validity.)
ing?36 enough but still reachable by most stu
Authentic tests can be - indeed, As I said at the outset, we need a new dents? Who will monitor for teachers'
should be - attempted by all students, philosophy of assessment in this country failure to comply? And what response to
with the tests "scaffolded up," not that never loses sight of the student. To such failure is appropriate? How schools
"dumbed down" as necessary to compen build such an assessment, we need to re frame diploma requirements, how the
sate for poor skill, inexperience, or weak turn to the roots of authentic assessment, schedule supports a school's aims, how
training. Those who use authentic tests the assessment of performance of exem job descriptions are written, how hiring
should welcome student input and feed plary tasks. We might start by adopting is carried out, how syllabi and exams are
back. The model here is the oral exam the manifesto in the introduction of the designed, how the grading system rein
for graduate students, insuring that the new national assessment report in Great forces standards, and how teachers po
student is given ample opportunity to ex Britain, a plan that places the interests of lice themselves are all inseparable from
plain his or her work and respond to criti students and teachers first: the reform of assessment.
cism as integral parts of the assessment. Authentic tests must come to be seen
In authentic testing, typical procedures Any system of assessment should as so essential that they justify disrupt
of test design are reversed, and account satisfy general criteria. For the purpose ing the habits and spending practices of
ability serves student learning. A model of national assessment we give priori conventional schoolkeeping.Otherwise
task is first specified. Then a fair and ty to the following four criteria: standards will simply be idealized, not
incentive-building plan for scoring is de * the assessment results should made tangible. Nor is it "soft-hearted" to
vised. Only then would reliability be con give direct information about pupils' worry primarily about the interests of stu
sidered. (Far greater attention is paid achievement in relation to objectives: dents and teachers: reform has little to do
they should be criterion-referenced;
with pandering and everything to do with
* the results should provide a basis
for decisions about pupils' further the requirements for effective learning
learning needs: they should be forma and self-betterment. There are, of course,
tive; legitimate reasons for taking the intellec
* the grades should be capable of tual pulse of students, schools, or school
comparison across classes and schools systems through standardized tests, par
. . .so the assessments should be ticularly when the results are used as an
calibrated or moderated; "anchor" for school-based assessment (as
* theways inwhich criteria are set the British propose). But testing through
up and used should relate to expected
matrix sampling and other less intrusive
ffFy soea m is ilbeal routes of educational development, giv
methods can and should be more often
ltJo yms ing some continuity to a pupil's assess
tolanapwu used.
ment at different ages: the assessments
takes." Only sucha humaneand intellectually
should relate to progression.37
valid approachto evaluationcan help us
The task is to define reliable assess insureprogress towardnational intellec
ment in a differentway, committingor tual fitness.As long aswe hold simplis
reallocatingthe time andmoney needed ticmonitoring teststobe adequatemodels
to obtainmore authentic and equitable of and incentives for reaching our in
testswithin schools. As theBritish pro tellectualstandards,studentperformance,
posals imply, theprofessionalizationof teaching,andour thiinking anddiscussion
712 PHIDELTA
KAPPAN
about assessment will remain flaccid and 12. See also Walter Haney, "Making Testing More Against the Wall: Psychometrics Meets Praxis,"
Educational," Educational October Educational Measurement: Issues and Practice, vol.
uninspired. Leadership,
1985, pp. 4-13. 5, 1986, pp. 12-16; and Richard Wallace, "Redirect
13. Robert Glaser, "Cognitive and Environmental ing a School District Based on theMeasurement of
1. For an explanation of the State reports of above in Freeman, The
Perspectives on Assessing Achievement," in Eileen Learning Through Examination,"
average test scores, see Daniel Koretz, "Arriving in the Service of Learn of Testing . . . , pp. 59-68.
Freeman, ed., Assessment Redesign
in Lake Wobegon: Are Standardized Tests Exag
ing: Proceedings of the 1987 ETS Invitational Con 20. Daniel P. Resnick and Lauren B. Resnick,
gerating Achievement and Distorting Instruction?,"
ference (Princeton, N.J.: Educational Testing Ser "Standards, Curriculum, and Performance: A His
American Educator, Summer 1988, pp. 8-15, 46
an American vice, 1988), pp. 40-42; and idem, "The Integration torical and Comparative Perspective," Education
52; and Edward Fiske, "Questioning of Instruction and Testing," in Eileen Freeman, ed., al Researcher, vol. 14, 1985, pp. 5-21.
Rite of Passage: How Valuable Is the SAT?," New
The Redesign of Testing for the 21st Century: 21. Aristotle Nicomachean Ethics 1137b25-30.
York Times, 1 January 1989.
Proceedings of the 1985 ETS Invitational Confer 22. Learning by Doing: A Manual for Teaching and
2. Norman Frederiksen, "The Real Test Bias: In ence (Princeton, N.J.: Educational Testing Service,
fluences of Testing on Teaching and Learning," Assessing Higher-Order Thinking in Science and
1986).
American Psychologist, vol. 39, 1984, p. 200. Mathematics (Princeton, N.J.: Educational Testing
14. Frederiksen, p. 199.
3. Grant Wiggins, "Rational Numbers: Scoring and Service, Report No. 17-HOS-80, 1987).
15. For a complete account of the nine "Common
23. Similar work on a research scale is being done
Grading That Helps Rather Than Hurts Learning," see Theodore R. Sizer, Horaces Com
American Educator, Winter Principles," in the U.S. as part of what is called "diagnostic
1988, pp. 20, 25, 45,
48. promise: The Dilemma of the American High achievement assessment." See Richard Snow, "Prog
School, updated ed. (Boston: Houghton Mifflin, ress inMeasurement,
4. Raymond Callahan, Education and the Cult of Cognitive Science, and Tech
1984), Afterword. For a summary of the idea of
of Chicago Press, nology That Can Change the Relation Between In
Efficiency (Chicago: University "exhibitions," see Grant Wiggins, "Teaching to the struction and Assessment," in Freeman, Assess
1962), pp. 80-84. (Authentic) Test," Educational Leadership, April ment in the Service of Learning . . . ,pp. 9-25; and
5. David Tyack, The One Best System: A History 1989. J. S. Brown and R. R. Burton, "Diagnostic Models
of American Urban Education (Cambridge, Mass.: 16. I wish to thank Albin Moser of Hope High for Procedural Bugs in Basic Mathematical Skills,"
Harvard University Press, 1974), pp. 140-46. School in Providence, R.I., for this example. For Cognitive Science, vol. 2, 1978, pp. 155-92.
6. Callahan, pp. 100-101. an account of a performance-based history course, 24. Mathematical Development, Secondary Survey
7. Richard J. Stiggins, "Revitalizing Classroom As
including the lessons used and pitfalls encountered, Report No. 1 (London: Assessment of Performance
sessment: The Highest Instructional Priority," Phi write to David Kobrin, Department of Education,
Unit, Department of Education and Science, 1980),
Delta Kappan, January 1988, pp. 363-68. Brown University, Providence, RJ 02912.
pp. 98-108.
8. Peter Elbow points out that in all performance 17. I wish to thank Dick Esner of Brighton High on Assessment
25. Task Group and Testing,
based education, the teacher goes from being the School in Rochester, N.Y., for this example. De
(TGAT) Report (London: Department of Education
student's adversary to being the student's ally. See tails on the ground rules, the information supplied and Science, 1988), Paragraphs 73-75.
Peter Elbow, Embracing Contraries: Explorations for the simulation, the logistics, and the evaluation
26. Personal communication from Joan Baron,
in Teaching and Learning (New York: Oxford Uni can be obtained by writing to Esner.
director of the Connecticut Assessment of Educa
versity Press, 1986). 18. Manuals are available from the Office of Re
tional Progress.
9. For more on content as knowledge in use and search and Evaluation, Connecticut Department of
27. Benjamin Bloom, George Madaus, and J.
on the design of curricula and tests around "essen Education, P.O. Box 2219, Hartford, CT 06115.
Thomas Hastings, Evaluation to Improve Learning
tial questions," see Grant Wiggins, "Creating a For further information on the ACTFL guidelines
American Edu (New York: McGraw-Hill, 1981), pp. 52-53.
Thought-Provoking Curriculum," and their use, see ACTFL Provisional Proficiency
28. Jeannie Oakes, Keeping Track: How Schools
cator, Winter 1987, pp. 10-17. Guidelines (Hastings-on-Hudson, N.Y.: American
Structure Inequality (New Haven, Conn.: Yale
10. Gilbert Ryle, The Concept of Mind (London: Council on the Teaching of Foreign Languages,
University Press, 1985), pp. 10-13.
Hutchinson Press, 1949). 1982); and Theodore Higgs, ed., Teaching for
the Organizing 29. Ibid.
11. M. McCloskey, A. Carramaza, andB. Green, Proficiency, Principle (Lincoln
in 'Sophisticated' Subjects: Miscon wood, 111.: National Textbook Co. and ACTFL, 30. On the engaging quality of "exhibitions" of mas
"Naive Beliefs
About Trajectories of Objects," Cognition, 1984). tery, see Sizer, pp. 62-68.
ceptions
19. See Paul LeMahieu and Richard Wallace, 31. For various group testing and grading strate
vol. 9, 1981, pp. 117-23. "Up
gies, see Robert Slavin, Using Student Team Learn
ing, 3rd ed. (Baltimore: Johns Hopkins Team
Learning Project Press, 1986).
32. Credit for some of these criteria are due to Ar
thur Powell, Theodore Sizer, Fred Newmann, and
Doug Archbald and to the writings of Peter Elbow
and Robert Glaser.
. . . ,"
33. Glaser, "Cognitive and Environmental
pp. 38-40.
34. See the work of the ARTS Propel project, head
ed by Howard Gardner, in which the portfolio idea
is described as it is used in pilot schools in Pitts
burgh. ARTS Propel is described in "Learning from
the Arts," Harvard Education Letter, Septem
ber/October 1988, p. 3. Many members of the Coa
lition of Essential Schools use portfolios to assess
students' readiness to graduate.
35. Alverno College has prepared material on the
hows and whys of self-assessment. See Faculty of
Alverno College, Assessment at Alverno, rev. ed.
(Milwaukee: Alverno College, 1985).
36. For a lively discussion of the research results
on the special ETS testing conditions for dyslec
tics, who are given unlimited time, see "Testing,
Equality, and Handicapped People," ETS Focus, no.
21, 1988.
37. Task Group on Assessment and Testing,
(TGAT) Report (London: Department of Education
'You changed an F to a B and a C to an A. Nice work, son!"
and Science, 1988), Paragraph 5. IB