Professional Documents
Culture Documents
Education http://rre.aera.net
Published on behalf of
and
http://www.sagepublications.com
Additional services and information for Review of Research in Education can be found at:
Subscriptions: http://rre.aera.net/subscriptions
Reprints: http://www.aera.net/reprints
Permissions: http://www.aera.net/permissions
Citations: http://rre.sagepub.com/content/27/1/25.refs.html
What is This?
Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014
Chapter 2
LAURA HAMILTON
RAND Corporation
T his chapter focuses on the use of tests or assessments1 as instruments for promot-
ing educational change. Recent large-scale education reform efforts, along with
state and federal legislation, illustrate the growing importance that policymakers
and education reformers are attaching to accountability for student performance. The
emphasis on testing and accountability may be attributable to a number of factors,
including the relatively low cost of these reforms; the fact that they can be externally
mandated and, to some extent, controlled; and the relative speed with which these
policies can be implemented (Linn, 2000). Despite their popularity, in most cases these
reforms are not guided by a careful investigation of the probable consequences of using
tests as accountability tools.
In this chapter, I provide some brief background on recent and current uses of large-
scale achievement tests and discuss what is currently known about the effects of large-
scale testing on several outcomes. I provide some recommendations to foster more
effective use of tests and discuss some of the limitations of relying heavily on tests as
policy tools. The focus of the chapter is on tests used at the K–12 level in the United
States, but much of the discussion is applicable to postsecondary testing and to test use
in other countries. Also, the chapter focuses primarily on large-scale, externally man-
dated tests, such as those administered as part of district and state testing programs, but
it also addresses classroom-based assessments as a potentially important component of
any testing policy initiative. Finally, the chapter addresses tests that are used to support
accountability systems and to provide instructional feedback to teachers; it does not
address tests intended primarily for selecting students into programs or institutions or
for diagnosing individual learning needs (e.g., college admissions tests or IQ tests; for
a recent review of the history and uses of these types of tests, see Gipps, 1999).
When examining the use of tests, it is critical to take into consideration the purpose
of the testing system. The National Research Council Committee on the Foundations
of Assessment (2001) noted three broad purposes for large-scale achievement tests:
(a) assessment to assist learning, also called formative assessment; (b) assessment of indi-
vidual achievement, also called summative assessment, and (c) assessment to evaluate
the quality and effectiveness of educational programs. Most uses of tests fall into one
25
of these categories, and most emphasize the use of tests as tools for measuring a con-
struct such as mathematics achievement. Increasingly, however, tests are being adopted
as components of broader reform efforts and are being designed not only to produce
information but to create improvements in the educational system by signaling what
students are expected to learn and by creating motivational conditions that promote cer-
tain outcomes. Even when tests are not explicitly intended to serve as levers to change
instruction, their use may lead to consequences that are serious and unanticipated by
their developers or users. The importance of examining consequences of test use has
been recognized by professional organizations that publish guidelines for appropriate
use of tests. Most notably, the Standards for Educational and Psychological Testing
(American Educational Research Association [AERA], American Psychological Asso-
ciation [APA], & National Council on Measurement in Education [NCME], 1999)
specify that consequences should be examined as part of any validity investigation (see
also Cronbach, 1988; Messick, 1989). An examination of consequences is especially
important for testing programs that are intended to serve as policy tools.
Before discussing what is known about the effects of testing, it is important to keep
in mind what a test is, and what kinds of inferences it can support. By their nature,
tests cannot tell us everything we would like to know about a student’s competencies.
Performance on a test provides information about a sample of examinee behavior under
certain, very specific conditions. For certain kinds of tests, such as essay tests or port-
folios, test performance may represent a direct sample of the behaviors in which we are
interested, whereas for the majority of large-scale testing programs, the sample of
behavior that the test provides is an indicator of a much broader set of proficiencies,
most of which cannot be tested directly (Haertel, 1999). From the sample, users make
inferences about how scores generalize to the broader domain in which they are inter-
ested, and the degree to which these inferences are warranted is in part a function of
the adequacy with which the test samples the domain. When users of test results inter-
pret scores on a standardized, multiple-choice mathematics achievement test, for exam-
ple, it is likely that their inferences about what these scores mean extend beyond the
ability to answer certain types of closed-ended questions; in other words, the users’
target of inference is broader than the specific behaviors elicited by the test (Koretz,
McCaffrey, & Hamilton, 2001). To the extent that test performance serves as an accu-
rate indicator of the broader skills in which users are interested, users’ inferences will
be appropriate. However, if test performance fails to provide a reasonable indication of
students’ mastery of the knowledge and skills in which users are interested, the validity
of users’ inferences will be compromised. Some of the unintended negative conse-
quences of testing discussed in this chapter contribute to a mismatch between the infer-
ences warranted on the basis of test scores and the inferences users make about them.
schools’ effectiveness was implemented in Boston (Resnick, 1982), and this test had
many of the features associated with today’s large-scale tests. In particular, it was
intended to provide efficient measurement for large numbers of students and to facil-
itate comparisons across classrooms and schools. Testing took on a new role during the
years of World War I, that of selecting individuals into programs or institutions. The
first large-scale group intelligence test, the Army Alpha, was published in 1917, and
the first standardized achievement test battery, the Stanford Achievement Tests, was
published in 1923 (Resnick, 1982). The achievement tests that followed in the sub-
sequent three decades were primarily intended to assess the competencies of individual
students and evaluate the effectiveness of specific curriculum programs (Goslin, 1963).
Over the following years, the use of testing expanded dramatically, in terms of both the
numbers of students affected and the purposes for which tests were used (Haney, 1981).
The creation of the National Assessment of Educational Progress (NAEP) and the
enactment of the original Title I legislation led to the first formal uses of tests as mon-
itoring devices (Hamilton & Koretz, 2002; Koretz, 1992) and may be considered the
precursors to today’s widespread use of tests as tools for holding educators accountable
for student performance (Roeber, 1988).
High-stakes uses of tests for individuals for purposes other than selection were rare
after World War II but began to increase in the 1970s, when minimum competency
testing became widespread (Jaeger, 1982). The minimum competency testing move-
ment emphasized the need to ensure that students demonstrated a grasp of basic skills
and, in many instances, led states or districts to prohibit failing students from gradu-
ating or from being promoted to the next grade. Thus, this movement represents
the first formal use of tests as tools to hold students and teachers accountable for
performance in recent decades (Hamilton & Koretz, 2002). In addition, minimum-
competency tests were intended to serve as signals to students and teachers of what
should be taught and learned, and this movement marked a shift toward what Popham
and others (Popham, 1987; Popham, Cruse, Rankin, Sandifer, & Williams, 1985)
called measurement-driven instruction, reflecting a belief that instruction should be
shaped by tests.
This emphasis on using tests to change instruction continued into the 1980s, a
decade characterized by deep concern over what was perceived to be poor performance
on the part of American students. These concerns are expressed vividly in A Nation at
Risk (National Commission on Excellence in Education, 1983), which led to a nation-
wide reform effort that included an increased reliance on testing (Pipho, 1985) as well
as an expansion of the kinds of stakes attached to scores, including school-level incen-
tives such as financial rewards or interventions (Koretz, 1992). The presumed positive
effects of testing on both student and teacher motivation have represented a primary
rationale for expanding the stakes attached to scores (National Council on Education
Standards and Testing, 1992). In addition, the focus on minimum competency shifted
to a call for high, rigorous standards for all students, and for tests that would be aligned
with those standards and would encourage teachers to teach to them (National Coun-
cil on Education Standards and Testing, 1992; Resnick & Resnick, 1992; Smith &
O’Day, 1990).2 Clear links among testing, standards, and curriculum, in addition to
formal stakes, were believed to enhance motivation (Smith, O’Day, & Cohen, 1990).
At the same time, testing has continued to be used as a mechanism to document the
performance of American students and thereby justify the need for reforms (Linn, 1993;
U.S. Congress, Office of Technology Assessment, 1992). All of these trends reflect
a gradual but steady shift from the use of tests as measurement instruments designed
to produce information to a reliance on tests to influence policy and instruction. This
dual use of tests has continued to the present day.
words, any test that does not use a selected-response (multiple-choice, true–false) format
(Madaus & O’Dwyer, 1999, p. 689). Performance assessments require the subjective
judgment of a rater, unlike selected-response items that are typically machine scored.
Advocates of performance assessments have claimed that this form of assessment will
improve instruction and facilitate more valid inferences about examinees’ capabilities,
particularly with respect to problem solving and higher order thinking skills (Madaus,
1993). However, validity studies have demonstrated that performance tests frequently
fail to tap the processes and skills their developers intended (Baxter & Glaser, 1998;
Hamilton et al., 1997; Linn, Baker, & Dunbar, 1991), and concerns about technical
quality and costs are likely to prohibit states from using performance assessments in
their accountability systems (Linn, 1993; Mehrens, 1998; National Center for Educa-
tion Statistics [NCES], 1996). To illustrate the latter point, the U.S. General Account-
ing Office (2003) recently estimated the cost to states of implementing NCLB using
only multiple-choice tests as approximately $1.9 billion, whereas the cost if states also
include a small number of hand-scored open-response items such as essays would be
about $5.3 billion. The magnitude of this difference suggests that many states will be
reluctant to adopt non-multiple-choice formats, at least until the testing technology
becomes less expensive.
As a result of the technical and cost constraints inherent in testing large num-
bers of students across multiple classrooms, schools, and districts, to date most
large-scale testing programs have been shown to assess a limited number of desired
outcomes. Tests’ failure to measure the full range of achievement outcomes that
schools are believed to influence limits their utility for instructional purposes and
threatens the validity of inferences users make from test results. These threats to
validity are of particular concern when results have stakes attached for students or
educators. The fact that tests typically assess only a subset of important skills and
knowledge has implications for the effects of large-scale testing programs, as I dis-
cuss subsequently.
In the next section, I return to the topic of large-scale testing and examine what is
known about the effects of large-scale testing on student achievement and other out-
comes. At the end of that section, I discuss the evidence on effects of classroom-based
assessment. The chapter concludes with a set of suggestions for the design of testing
programs that will promote beneficial outcomes.
Of particular concern is the way in which educators may reallocate their efforts away
from content that is not tested and toward content that is tested. The extent to which
the validity of inferences may be compromised by such reallocation depends on the
nature of the reallocation and on the specific inferences that users make. If teachers
focus more effort on their states’ content standards, and if users interpret scores as indi-
cators of proficiency on those standards, it is likely that users will make reasonably valid
inferences from any changes in test scores that result. However, if teachers emphasize
aspects of the test that are incidental to the construct the test is designed to measure
(e.g., certain item styles or formats), or if they emphasize the specific skills included in
the test without addressing skills or knowledge that are not tested but that are part of
users’ targets of inference, validity may be compromised. Koretz et al. (2001) identi-
fied seven types of teacher responses to high-stakes testing, some of which they called
positive (providing more instructional time, working harder to cover more material,
working more effectively), one of which they labeled negative (cheating), and three of
which they said were ambiguous—that is, their impact may be positive or negative
depending on the nature of the response and the context in which it takes place. These
three ambiguous responses—reallocating instructional time, aligning instruction with
standards,6 and coaching by focusing on incidental aspects of the test7—are addressed
by some of the studies reviewed subsequently.
Before discussing the evidence on the effects of testing on practice, it is worth point-
ing out that testing is likely to serve as one influence among many, and the effects of
testing will in most cases interact with other factors such as teachers’ own beliefs and
knowledge about pedagogy and subject matter, their professional development experi-
ences, their access to necessary resources including appropriate curriculum and instruc-
tional materials, and the responses of their colleagues and supervisors (Cimbricz, 2003;
Cohen, 1995; Haertel, 1999; O’Day, Goertz, & Floden, 1995). Therefore, the effects
of testing may be limited to surface-level responses rather than deeper changes in the
nature of instruction (Firestone, Mayrowetz, & Fairman, 1998). Also, the effects of test-
ing on instruction are likely to vary by stakes. If tests do not have clear consequences
attached to them, teachers may pay little attention. However, when test scores are asso-
ciated with consequences that are important or meaningful to teachers, it is likely that
instruction will be affected. The empirical evidence, though not extensive, supports
this distinction (Mehrens, 1998), particularly a recent national survey of teachers con-
ducted by Pedulla et al. (2003). Most of the studies of effects on practice, particularly
those involving surveys, report average responses that mask some of these important
interactions and influences.
Classroom-Level Effects
There is some evidence that large-scale testing has led to some of the outcomes that
Koretz et al. (2001) classified as beneficial responses. Research suggests that teachers
may respond by focusing their efforts more strongly on achievement than they had in
the past (Wolf, Borko, McIver, & Elliott, 1999) and by working harder (Bishop &
Mane, 1999). Additional evidence of positive effects was reported by Shepard and
Dougherty (1991), whose research indicated that many teachers in two districts with
high-stakes tests believed that the test results were helpful for identifying student
strengths and weaknesses and for attracting resources for students who needed them.
Other studies indicate neutral or negative effects. Firestone et al. (1998) found little
evidence of significant changes among teachers facing a high-stakes testing program in
Maryland, and Jones et al. (1999) reported roughly equal percentages of teachers who
increased and decreased their use of various practices. Using classroom artifacts and
interviews with teachers, McDonnell and Choisser (1997) found that although assess-
ment programs in North Carolina and Kentucky led to changes in instructional
approaches, the depth and complexity of instruction were not affected.
Much of the literature that examines reallocation suggests that tests may exert a neg-
ative effect on curriculum and instructional practice by narrowing teachers’ focus exces-
sively. Several studies demonstrate that teachers tend to deemphasize subjects such as
science, social studies, art, and writing that are not part of many testing programs
(Jones et al., 1999; Koretz, Barron, Mitchell, & Stecher, 1996; Shepard & Dougherty,
1991; Smith, Edelsky, Draper, Rottenberg, & Cherland, 1991; Stecher, Barron,
Chun, & Ross, 2000). In Kentucky in the late 1990s, for example, where some sub-
jects were tested in fourth grade and others were tested in fifth grade, teachers in the
grades at which a subject was tested reported spending more time teaching that subject
than teachers in the grades at which the subject was not tested (Stecher & Barron,
1999). Teachers in Colorado reported that their practices were positively influenced by
the state standards but that the state’s high-stakes testing program tended to have a
negative influence, particularly by causing teachers to reduce time spent on social stud-
ies and science and to eliminate projects, lab work, and other activities not represented
in the test. Moreover, these changes were more frequent at low-performing than at
high-performing schools (Taylor, Shepard, Kinner, & Rosenthal, 2003). These Col-
orado teachers also believed that the scores were not very useful for instructional feed-
back and expressed concern that the state’s accountability system would result in funds
being taken away from needy schools (Taylor et al., 2003).
Reallocation also occurs within subjects, across tested and untested topics and skills.
Some teachers report altering the sequence in which they present topics to accommo-
date the testing schedule, making sure that tested topics are presented before the test-
ing date and saving other topics for the end of the school year (Corbett & Wilson, 1988;
Darling-Hammond & Wise, 1985). Perhaps more important, several studies indi-
cate that teachers have increased coverage on tested topics and skills while decreasing
emphasis on topics and skills not included in the test. In a study of two districts con-
ducted by Shepard and Dougherty (1991), substantial majorities of teachers reported
increasing the amount of time allocated to basic skills, vocabulary, and computation in
mathematics. Romberg, Zarinia, and Williams (1989) found that among a national
sample of eight-grade mathematics teachers, a majority had increased coverage of basic
skills and computation while decreasing emphasis on extended projects and other activ-
ities not emphasized by most tests. In language arts, teachers in Arizona reported
neglecting nontested parts of the curriculum, including certain types of writing (Smith
et al., 1991).
Some responses focus heavily on the specific format in which test questions are pre-
sented: Teachers in two Arizona schools studied by Smith and Rottenberg (1991)
reported having students solve only the types of mathematics word problems that
appeared on the state test, and in a study conducted by Darling-Hammond and Wise
(1985), teachers reported reducing their reliance on essay tests and administering more
quizzes designed to mirror the format of items on the standardized tests given in their
schools. In the Shepard and Dougherty (1991) study of two districts, teachers of writ-
ing said they had begun to emphasize having students look for mistakes in written
work rather than produce their own writing, as a result of the format of the writing test
used in those districts. The practice of adopting instructional materials (including class-
room assessments) that are designed to mirror the format of the state test is more com-
monly reported among teachers in states with high-stakes testing programs than in
other states (Pedulla et al., 2003).
Studies of test preparation, which generally address activities such as practicing on
released forms of the test, indicate that in at least some schools, substantial amounts of
instructional time are consumed by these activities. Tepper (2002) found that teachers
in Chicago increased time spent on test preparation substantially when a high-stakes
testing program was introduced. Smith (1994) reported up to 100 hours per course
among teachers in Arizona, and Jones et al. (1999) found that a majority of teachers
in North Carolina said they devoted more than 20% of instructional time to test prac-
tice. In Colorado, teachers reported spending large amounts of time on practice tests
and teaching of test-taking strategies, particularly at lower performing schools (Taylor
et al., 2003). The Pedulla et al. (2003) comparison of high- and low-stakes states
indicates that teachers in high-stakes states spend more time on test preparation,
begin it earlier in the year, and are more likely to use specific types of materials—
those that resemble the state test, are prepared by the state or commercial publishers,
or include released items—than are teachers in low-stakes states (see also Corbett &
Wilson, 1991).
The sole unambiguously negative response identified by Koretz et al. (2001) was
cheating, which can take many forms including failing to follow test-administration
instructions, inappropriately exposing students to copies of the test, and changing stu-
dent responses before submitting answer sheets for scoring. Although there are no
comprehensive studies of the frequency of cheating, sizable minorities of teachers in
two states reported that inappropriate test-administration practices such as rephrasing
questions during testing occurred in their schools (Koretz, Barron, et al., 1996; Koretz,
Mitchell, Barron, & Keith, 1996). Jacob and Levitt (2002) found that instances of
cheating increased when high-stakes testing was introduced in Chicago, but they also
noted that this increase was responsible for only a small portion of the test-score gains
observed there.
Together, these findings suggest that high-stakes testing does influence instruction,
at times in significant ways. Indeed, tests seem to exert a more powerful influence than
standards, even though the latter are intended to be the primary vehicle through which
information about instructional goals is conveyed (see, e.g., Clarke et al., 2003). Some
of these responses may be a result of teachers’ inability to distinguish between ethical
and unethical test-preparation practices; the boundaries are not always clear, particu-
larly with regard to practices such as focusing on the specific objectives measured by a
test (Mehrens & Kaminski, 1989). Because of the weaknesses inherent in most efforts
to measure instructional practice, it is not always possible to determine whether these
effects are beneficial or harmful or to determine with certainty which features of the
testing programs contributed to the effects. However, the frequency and extent of real-
location, both within and across subjects, should certainly be kept in mind by users
who are interpreting results from high-stakes tests and by policymakers who wish to
use testing as a means to shape curriculum and instruction.
lem solving tapped by the test and ignore broader conceptions of problem solving
(Stecher & Mitchell, 1995). Not only do teachers pay attention to the test items, but
they also shift their instruction and evaluation strategies to match the rubrics that are
used to score the assessments (Mabry, 1999). As Firestone et al. (1998) point out, teach-
ers’ beliefs and knowledge of pedagogy and subject matter are likely to exert a greater
influence on practice than can be achieved with any testing program, and because most
teachers are accustomed to focusing on small, discrete problems and covering many
topics in a somewhat shallow way, the implementation of performance assessment alone
is unlikely to affect practice beyond fairly small, surface-level changes.
The limited generalizability of scores on performance assessments is another prob-
lem that is partly related to the possible effects of performance assessment on curricu-
lum narrowing. Several studies have documented that scores from various types of
performance assessments tend to have low levels of reliability and do not generalize well
to other kinds of assessments measuring similar constructs (Baker, O’Neil, & Linn,
1993; Dunbar, Koretz, & Hoover, 1991; Miller & Seraphine, 1993; Shavelson,
Baxter, & Pine, 1992). Although raters are a commonly cited source of variability, for
most performance assessment applications, task sampling is a larger problem than rater
sampling (Dunbar et al., 1991). The specific format that is used has a large effect on
performance and can reduce generalizability. Baxter, Shavelson, Goldman, and Pine
(1992), for example, used several methods to measure students’ science achievement
and found that changes in scores on a notebook-based task did not correlate well with
changes in scores on tasks that involved direct observation of students’ performance.
This failure to generalize not only reduces the utility of performance assessments as a
method for gathering information about student skills and knowledge, but may exac-
erbate the problem of curriculum narrowing by encouraging educators to focus on a
specific task type as the most promising means of raising scores. Some of the problems
that have been identified are attributable in part to the specific features of performance
assessments that are used for external accountability purposes (see Haertel, 1999) but
also to the limitations of tests as policy instruments, a topic to which I return in the
final section of the chapter.
School-Level Effects
The instructional effects of testing can extend beyond the classroom. Research has
addressed school-level responses that are likely to lead to changes in the nature and
quality of curriculum and instruction provided to students. Some studies that included
principal surveys found that many principals reported increasing teacher profes-
sional development opportunities, adding summer or after-school sessions to provide
additional instruction to low-performing students, and revising curriculum programs
(Stecher et al., 2000; Stecher & Chun, 2001). On the other hand, some kinds of
responses reported by principals may be problematic in that they appear to be designed
to improve test scores without necessarily improving the overall quality of education
offered to students in the school. For example, approximately one third of principals
in Maryland reported reassigning teachers between tested and untested grades to im-
prove the quality of instruction in the grades for which the test was administered
(Koretz, Mitchell, et al., 1996). In general, there is evidence that, when faced with
strong incentive systems, school personnel tend to focus more on short-term test-score
gains than on long-term instructional improvement (O’Day, in press). Other studies
have reported frequent use of student-level incentives such as field trips or parties as
rewards for good test performance (Stecher et al., 2000); whether this represents a
desirable or harmful practice obviously depends on how it is implemented and how
students respond.
tests are poor measures of students’ skills and knowledge, particularly in the case of
special education students, minority students, and English language learners (Pedulla
et al., 2003).
Students’ emotional responses to testing represent another critical component of
school climate. Majorities of teachers believe that testing increases students’ stress lev-
els and negatively affects student morale and self-confidence, according to several
studies that have surveyed teachers (Koretz, Barron, et al., 1996; Mabry et al., 2003;
Miller, 1998). Comparing states with high-stakes tests and those without such tests,
Pedulla et al. (2003) found that teachers in high-stakes states were more likely to
report that students felt intense pressure to do well on tests, and although a majority
in both types of states believed student morale was generally high, teachers in low-
stakes states were more likely to rate student morale as high than were teachers in
high-stakes states. Although these studies are suggestive, they rely on teacher percep-
tions, and there is little direct evidence of how testing actually affects student morale
(Mehrens, 1998). A few studies suggest that high-stakes testing may improve stu-
dents’ motivation to learn (Betts & Costrell, 2001; Roderick & Engel, 2001), but the
motivational effects are likely to vary according to the age of the student (Phillips &
Chin, 2001) and the specific features of the incentive system. In particular, to the
extent that students view the tests as too difficult or the goals as unrealistic, the moti-
vational effects may be negligible or even negative (Linn, 1993). The overall lack of
evidence regarding student morale, stress, and motivation is due in part to the diffi-
culty that researchers have in gaining access to students and measuring their levels of
these constructs (Stecher, 2002).
Parent involvement and support for testing may influence the climate of the school
as well. Nationwide, parental support for some aspects of today’s testing policies has
been fairly high in recent years (Public Agenda, 2000), but in certain settings, particu-
larly among parents of children in high-scoring suburban schools, testing has been met
with intense parental resistance (see, e.g., Zernike, 2001) and has led to the formation
of organized parent groups such as the Parents’ Coalition to Stop High-Stakes Testing
in New York (McDonnell, 2002). It is not clear how these reactions affect school cli-
mate, but as the provisions of NCLB begin to affect families it will be important to
monitor parents’ responses.
many other education reform efforts, and achievement trends may be confounded with
characteristics of states and districts that influence both achievement and the likelihood
that accountability policies will be adopted (Carnoy & Loeb, 2002). However, as with
the research on effects on practice, there is a growing body of evidence that may pro-
vide some indication of what we can expect as high-stakes testing and accountability
become more widespread. When reviewing this research, it is important to distinguish
between studies that examine gains on the high-stakes test itself and those that incor-
porate information from another test of the same subject. Gains that are observed only
on the test used in the accountability system may not provide sufficient evidence of
true gains in achievement, because scores on accountability tests may become inflated,
a problem I discuss subsequently.
Some research suggests a link between accountability and student achievement on
the high-stakes test, particularly when accountability systems include high school exit
exams (Fredericksen, 1994; Winfield, 1990), though one study that examined high
school exit exams and that controlled for individual student characteristics (unlike
most of the research on this topic) found no such relationship (Jacob, 2001). At the
lower grades, Roderick, Jacob, and Bryk (2002) examined achievement trends among
students in Chicago who were subjected to test-based grade promotion policies.
Increases in gains in the grades at which students were held accountable were substan-
tial in both reading and mathematics and tended to be larger for students attending
low-performing schools than for students of similar ability attending higher perform-
ing schools. A study of the Dallas school accountability program by Ladd (1999) indi-
cated that the performance of seventh-grade students in Dallas improved more rapidly
than the performance of seventh graders in surrounding districts after the accountabil-
ity system was introduced. She found no such effect for third graders but attributed
this in part to problems with the testing system and exclusion rates at that grade.
Studies examining changes in test scores using measures other than the primary
accountability test have produced mixed results. In a study of states’ NAEP trends,
Grissmer, Flanagan, Kawata, and Williamson (2000) attributed improvements on
NAEP to the creation of test-based accountability systems in certain states, though
this study did not empirically test the link between accountability systems and per-
formance on NAEP. A comparison of countries and Canadian provinces showed
that those with exit exams had higher average achievement among middle-school-
aged students on external tests (including the Third International Mathematics and
Science Study [TIMSS]) than those without exit exams (Bishop, 1998); however, it
was not possible to control for all of the factors that may have contributed to those
differences. More recently, Carnoy and Loeb (2002) conducted what is to date the
most comprehensive analysis of the relationship between state-level accountability
policies and NAEP gains. These researchers controlled for contextual factors at the
state level, including demographics and funding, as well as for prior student achieve-
ment. Their results suggest a link between state accountability policies and gains on
NAEP mathematics scores. Analyses conducted by Amrein-Beardsley and Berliner
(2003), however, attributed much of the apparent gain in NAEP scores among states
with high-stakes tests to increased rates of exclusion in those states; in other words,
the increases in the high-stakes states appear to be due in large part to changes in the
tested population rather than real effects of the testing policies. Overall, then, the
evidence, though not extensive, points to some positive relationships between high-
stakes testing and student achievement but also raises questions about the sources of
those relationships, and the evidence is not sufficient to infer a causal relationship
between stakes and NAEP scores.
Despite the evidence of improved achievement on non-accountability tests, how-
ever, the increases on NAEP have been several times smaller than the increases on state
tests (Klein, Hamilton, McCaffrey, & Stecher, 2000; Koretz & Barron, 1998). Most
of these studies examined aggregate achievement trends. Jacob (2002), in contrast, was
able to track individual students over time using data from the Chicago Public Schools.
In this study, scores on the Iowa Test of Basic Skills, which had been improving over
time, increased more rapidly after the district introduced an accountability system that
imposed stakes on schools (probation, reassignment of staff) and on students (reten-
tion in grade). Scores on a low-stakes state test also improved during this time, but the
rate of improvement on the low-stakes test did not change when the accountability sys-
tem was implemented. Together, these results indicate that gains on accountability
tests tend not to generalize to other measures of the same constructs, although small
gains are sometimes observed on those other measures, as discussed in the previous
paragraph. Jacob also found that scores on tests of science and social studies, which
were not part of the accountability system, increased much less than scores on reading
and mathematics. A study by Deere and Strayer (2003) revealed a similar pattern in
Texas. The failure of gains to generalize to tests other than the one used for account-
ability purposes raises concerns about the validity of interpretations of scores on state
tests, discussed subsequently.
Effects on Equity
One of the rationales for many test-based accountability policies, and one that is
inherent in the title of the new federal legislation, is that such policies will increase
equity by ensuring that all students achieve at some predesignated level of perfor-
mance. At the same time, critics of these policies worry that certain groups of students
will suffer adverse consequences. Several aspects of equity have been addressed by
research, though in most cases the evidence base is limited.
State accountability systems such as the one implemented in Texas have achieved
recognition not only for raising average student achievement but for reducing differ-
ences among racial/ethnic and socioeconomic groups. Few studies have empirically
examined effects of high-stakes testing on achievement among different groups of
students. The Chicago study by Roderick et al. (2002) revealed some differences
in achievement gains across student ability groups, with lower achieving students
appearing to gain more than higher achieving students in reading but less in math.
Furthermore, this study revealed that reading scores among the highest-performing
third-grade students actually declined after the accountability system was introduced
(relative to expected trends), especially in the low-performing schools that were most
strongly affected by the accountability policies. In the Dallas study, Ladd (1999)
reported positive effects of accountability for Hispanic and White students but no
effects for Black students; there was no clear explanation for this difference. Examin-
ing differences across racial/ethnic groups on NAEP, Carnoy and Loeb (2002) found
that gains attributable to accountability were greater for minority students than for
White students, a result that may be attributable in part to the lower average starting
scores for minority students. However, the reduction in gaps in performance between
minority and White students on NAEP is typically much smaller than the reduction
in gaps on the state tests (see Haney, 2000; Klein et al., 2000).
The source of these differences is not clear, but some of them may be attributable to
differences in how teachers target instruction across student groups. Students at the
cusp of passing state tests may receive targeted instruction to improve their perfor-
mance; many educators refer to these students as “bubble kids” (Taylor et al., 2003).
This practice is more common in states with high-stakes tests than in those without
(Pedulla et al., 2003). In addition, teachers and administrators in schools serving poor
and minority students may be especially likely to engage in practices designed to raise
test scores, including providing extensive test preparation and narrowing the curricu-
lum to focus on tested topics (McNeil, 2000; Shepard, 1991; Smith et al., 1997). With-
out better evidence on the source of score gains among different groups of students, it
is impossible to determine whether they represent improved quality of instruction;
however, the conflicting NAEP results suggest that at least some of the effects are due
to targeted test-preparation efforts.
Many testing critics claim that high-stakes testing has led to disproportionate num-
bers of minority, low-income, and special needs students being retained in grade, drop-
ping out of school, or being prevented from graduating, as well as to a loss of resources
at schools serving these students (Darling-Hammond, 2003; McNeil, 2000). Teachers
in the Pedulla et al. (2003) study reported increased retention and dropout rates as a
result of testing requirements, particularly those teaching in states with high-stakes
programs. Of course, it is impossible to determine whether these teachers’ impressions
are consistent with what is actually happening. Empirical evidence of the effects of
accountability on retention and dropout rates is, as with most of the topics discussed
so far, mixed. Carnoy and Loeb’s (2002) study of state accountability systems found
no relationship with retention or dropout rates except for Hispanic students, for whom
the results were inconclusive. Haney’s (2000) analysis of Texas data suggested an
increase in retention after Texas implemented its high-stakes testing program. Carnoy,
Loeb, and Smith (2001) reanalyzed Haney’s data and found that the timing of the
increase was not consistent with the implementation of the most recent high-stakes
testing program, but could be attributable to earlier waves of testing and accountabil-
ity. Studies conducted by Jacob (2001) and Lillard and DeCicca (2001) provide evi-
dence that increased high school dropout rates may be associated with high school exit
exam policies. In contrast to these findings, Ladd (1999) reported that high school
dropout rates in Dallas fell relative to those in other districts after Dallas introduced
its test-based accountability system, though it is difficult to determine with certainty
whether the relationship was causal.
Some students may be subjected to inappropriate placements or exclusion from test-
ing as schools strategize to maximize performance. Figlio and Getzler (2002) examined
data from six counties in Florida and found that administrators frequently reclassified
students as disabled to exempt them from testing. This practice was especially common
among low-performing schools (see also Darling-Hammond, 1991; Haney, 2000).
Jacob’s (2002) study of Chicago’s accountability system revealed a similar phenome-
non—the data showed a large increase in the proportions of students placed in special
education classes or excluded from testing after the district’s accountability system was
introduced, and retention in grade also increased, even in grades that were not explic-
itly subject to the accountability system’s test-based promotion and retention policies.
Similarly, Deere and Strayer’s (2003) study suggests that administrators in Texas en-
gaged in efforts to exempt certain students from high-stakes testing in a strategic man-
ner. These actions threaten the validity of inferences about score changes and have
increasingly been the subject of media attention (see, e.g., Schemo, 2003).
School-level labeling, rewards, and sanctions may affect some types of schools more
than others, which in turn may disproportionately affect students from certain lan-
guage, socioeconomic, or racial/ethnic groups. Parkes and Stevens (2003) illustrate the
problem in the context of Florida’s accountability system. In one recent year, approx-
imately 17% of schools with high percentages of English language learners (defined as
greater than 30% of the school’s student population) received an “A” rating in the state’s
accountability system, whereas 33% of schools with lower percentages of English
language learners (less than 30% of the school’s population) received this rating. As a
result, students in schools with high percentages of English language learners were
denied reward money that was distributed to other schools. The effects of such dis-
crepancies on the educational experiences of individual students are unknown, but the
rewards and penalties, as well as other consequences of being labeled as a success or fail-
ure, could in fact exert a significant influence on students. As illustrated by Kane and
Staiger (2002), owing to provisions that require each subgroup to meet a specific tar-
get, schools serving diverse populations of students are more likely than homogeneous
schools to be labeled as making insufficient progress purely as a result of statistical
error. Of course, the subgroup provisions may ultimately lead to positive effects as well,
and it is too early to say whether the effects on balance will be beneficial or harmful.
Aside from the effects of specific accountability policies such as exit exams and
school reconstitution, it is important to examine whether the tests themselves are
characterized by bias against one group or another, which could negatively affect
equity. While a thorough exploration of fairness in testing is beyond the scope of this
chapter and has been addressed elsewhere (see, e.g., Cole & Moss, 1989), it is worth
stating that users of tests must be vigilant about possible threats to fairness in the tests
as well as in the broader accountability systems, and validation efforts must seek to
ensure that inferences made from test scores have comparable degrees of validity
focus on two—the effects of the high stakes that are attached to performance and the
effects of the specific kinds of reporting strategies used in most systems.
Non-multiple-choice tests have not been shown to solve the problem of score infla-
tion. Koretz and Barron (1998) found extensive score inflation on Kentucky’s open-
response test items, with gains on the state test approximately four times the magnitude
of the state’s gains on NAEP. As discussed earlier in the section on curriculum nar-
rowing, it is likely that educators adapt their instruction to specific features of perfor-
mance tasks, which leads to improvements in performance on that type of task but not
on the broader construct the task was designed to measure.
There is much we do not know about score inflation, including what kinds of tests
are most susceptible and what kinds of students or schools are most likely to exhibit
inflated gains. We do know that high stakes and unrealistic goals are likely to exacer-
bate score inflation, as teachers take shortcuts when they believe goals are unattain-
able (Koretz, 2003a). There is insufficient information to determine the extent to
which the discrepancies in trends that have been reported threaten the validity of
inferences that users make. It is likely that most parents, policymakers, and other users
assume that performance on a given test generalizes beyond that specific measure, but
we do not know how broadly most users wish to generalize or how they would react
to test scores if they understood the problem of inflation. Research is needed to deter-
mine what kinds of inferences are being made on the basis of today’s high-stakes tests;
this is a critical part of any validation effort and will contribute to the development of
approaches for identifying score inflation (Koretz et al., 2001).
dents just below the cut score and moves them above it, that school will register a
much larger gain on the percentage-proficient metric than a school that moves stu-
dents within the not-proficient or proficient categories rather than across the bound-
ary. This type of reporting therefore creates incentives for teachers to focus on the
“bubble kids” described earlier and perhaps to shortchange students who are less likely
to move from one performance level to the next (Hamilton & Koretz, 2002).
Another potential drawback of criterion-referenced reporting is that users of test
results do not always know how to interpret performance-level information. Koretz
and Deibert (1993, 1996) found widespread misinterpretation of NAEP achievement
levels in print media articles. This included oversimplification of the meanings of the
levels and a failure to recognize that the underlying performance was continuous
rather than discrete. Hambleton and Slater (1995) documented additional problems
with the interpretation of the NAEP achievement levels among policymakers and
educators. A combination of norm-referenced and criterion-referenced reporting may
be most informative for many users, including the media, policymakers, and parents.
The popularity of state and country rankings on national and international assess-
ments such as NAEP and TIMSS, even when the individual-level scores on those tests
are reported according to performance levels, illustrates the desire users have for some
sort of normative information (Hamilton & Koretz, 2002). Newspaper reports of
school rankings within states provide another example and suggest that there is a
desire for normative information among consumers of test scores.
A related reporting issue is the method used for aggregating scores across students
to obtain a measure of performance (sometimes called an accountability index) for a
group (e.g., a school). The way that an accountability index is constructed—for exam-
ple, whether it involves a single-year average, a gain score from one year to the next,
or a score adjusted for student background characteristics—also affects the kinds of
inferences supported and the validity of decisions made based on the index. NCLB
relies on a single, statewide target for all schools to meet, rather than rewarding or
penalizing schools based on the magnitude of gains or losses. However, many states
and districts continue to operate dual systems that involve the latter approach, and
some are experimenting with more sophisticated methods of measuring school perfor-
mance, such as value-added modeling (Sanders & Horn, 1998; Webster & Mendro,
1997). Inherent in each of these different approaches is some notion of what schools
should be held accountable for—whether, for example, a school serving low-income
or low-performing students should be expected to achieve the same level of performance
as one serving higher income or higher performing students, or whether the former
school should only be held accountable for promoting a similar amount of progress.
The choice of approach can dramatically affect rankings of schools (Clotfelter &
Ladd, 1996), and therefore the decisions made on the basis of the accountability
index, so it is a matter that must be considered seriously by those who are responsible
for designing accountability systems. It is also important to communicate to users that
even the most statistically sophisticated approaches can lead to inappropriate infer-
ences. McCaffrey, Lockwood, Koretz, Hamilton, and Barney (in press), for example,
show that the value-added models that are currently used may confound teacher
effects with the effects of student background characteristics under certain condi-
tions, though users of results from applications of value-added modeling may be led
to believe that such models are able to isolate teacher effects from other influences
(see also Kupermintz, 2002).
Similarly, decisions about whether to base judgments on a single-year, spring-to-
spring change; on a multiyear change; or on a fall-to-spring gain will affect inferences
about schools’ achievement growth. Fall-to-spring changes may be corrupted by prob-
lems related to scale conversion, practice effects, and administration dates (Linn,
Dunbar, Harnisch, & Hastings, 1982), whereas spring-to-spring changes will con-
found academic year learning with summer trends and may create a bias as a result of
socioeconomic differences in summer learning growth (Alexander, Entwisle, & Olson,
2001). Some researchers have suggested that any single-year measure of change is inher-
ently unstable owing to changes in the composition of student cohorts and other
factors contributing to error (Kane & Staiger, 2002; Linn & Haug, 2002) and have
recommended the use of multiyear changes to reduce this instability (but see Rogosa,
2003, for an alternative perspective on the problem of instability of change). There is
no clear consensus on which approach is most desirable; again, the benefits and limi-
tations of each should be carefully weighed in light of the goals of the specific system
being implemented.
Other features of test-based accountability systems, in addition to the two discussed
here, are likely to affect the validity and utility of information from those tests. For
example, NCLB requires states to use a single test form for students at different levels
of proficiency, a requirement that is likely to compromise the utility of information for
some students: Tests designed to discriminate well among average students will pro-
vide imprecise information at the extremes of the distribution. In particular, requiring
students with moderate or severe cognitive disabilities to take the same test that other
students take will result in a test that fails to measure the performance of these students
accurately (Koretz, 2003a). Although many of the provisions may be intended to pro-
mote effective instruction and more equitable results, the ways in which these provi-
sions affect scores must be taken into consideration when interpreting performance
results.
There are several documents that are widely accessible and that provide guidelines
for appropriate test use. Most important, perhaps, are the Standards for Educational
and Psychological Testing (AERA, APA, & NCME, 1999), which delineate specific
guidelines for ensuring test validity, reliability, and fairness. In addition, CRESST
has published standards for accountability systems (Baker, Linn, Herman, & Koretz,
2002) that provide suggestions for designing test-based accountability systems that
promote desired outcomes. The American Educational Research Association’s posi-
tion statement on high-stakes testing (AERA, 2000) also offers guidelines for appro-
priate test use. This section draws from these and other sources to suggest ways to
improve the use of large-scale assessments, with an emphasis on policy-related uses
of such tests. The discussion focuses on practices that have been demonstrated empir-
ically to improve the interpretation or use of test results or that have been repeatedly
advocated by testing and accountability experts. It is not a complete list, but provides
a broad set of suggestions that can serve as a starting point in the design of an effec-
tive test-based accountability system.
to students with disabilities and English language learners, particularly with respect to
the validity of scores that result from accommodated administrations (Abedi, 2003;
Koretz & Hamilton, 2000; Tindal et al., 1998).
Design tests to reduce score inflation. Test scores that are influenced by score infla-
tion are of limited utility for helping parents understand their children’s performance
and for contributing to effective decision making on the part of educators and poli-
cymakers. Those responsible for selecting or developing large-scale tests for account-
ability systems should design these systems in ways that minimize the likelihood of
score inflation. Approaches to reducing score inflation include (but are not limited to)
changing the test items each year and varying the formats of the items rather than
relying on a single format such as multiple choice. Teacher and principal professional
development, as discussed subsequently, may also help avoid the kind of inappropri-
ate narrowing that leads to score inflation.
Utilize technology to improve testing. Information technology is increasingly being
used to enhance testing programs at all levels of the education system. Although there
may be significant logistical barriers to overcome for states or districts to adopt large-
scale, computer-based testing programs, technology has the potential to improve test-
ing systems in a number of ways. Perhaps most important, computers can provide a
cost-effective way to assess skills and knowledge that are difficult or even impossible
to measure with paper and pencil; the architecture test described by Katz, Martinez,
Sheehan, and Tatsuoka (1993) and the examples provided by Bennett (1998) illustrate
this potential. To the extent that instruction involves technology, the use of technol-
ogy for testing may improve alignment between instruction and assessment, and there-
fore produce more valid information about student proficiency. For example, Russell
and Haney (1997) examined the writing performance of students who were accustomed
to composing essays on computers in their writing classes. These students received
higher scores on a writing test that used computers than on a paper-and-pencil writing
test; they wrote more and displayed better organizational skills. This study suggests that
the validity of the test for evaluating instructional effects was enhanced when com-
puters were used. Finally, computers may facilitate a shift away from the single-form
standardized test, which may be especially subject to score inflation—the use of com-
puterized adaptive testing (CAT) has particular appeal in this regard. Although certain
forms of CAT appear to be at odds with NCLB’s prohibition against out-of-level test-
ing (Olson, 2003), it is likely that a solution will be worked out to allow some form of
such testing in state assessment programs.
score inflation has occurred based on discrepancies in trends, a necessarily first step is
to collect the information necessary to investigate the phenomenon. An audit test such
as NAEP can provide such information and can reveal score inflation in its early stages
so that additional steps can be taken to reduce it. NCLB requires samples of students
in each state to participate in NAEP “in order to help the U.S. Department of Educa-
tion verify the results of statewide assessments” (U.S. Department of Education, 2002).
Interpretation of discrepancies between NAEP and state test score trends is not
straightforward. A number of factors other than inflation can contribute to differences
in how the two tests function; these include differences in student motivation, defini-
tions of performance levels, content sampled by the tests, and test administration
rules (National Assessment Governing Board, 2002). In addition, there are a number
of ways to display the data from the two tests, including reporting scale score aver-
ages, tallying the percentages of students meeting each performance level, examining
the magnitudes of racial/ethnic gaps in performance, and displaying the entire distri-
bution of scores on each test, and these have implications for the interpretation of dis-
crepancies (National Assessment Governing Board, 2002). If an audit test such as
NAEP is to be useful, educators and policymakers will need clear guidance on how to
examine and interpret the results.
Present test results in a format useful for teachers. Interviews with teachers suggest that
clearer and more timely reporting of results would improve the utility of scores for
instructional purposes (see, e.g., Clarke et al., 2003). Reports should include informa-
tion about individual students’ performance as well as information about the accuracy
of scores. In addition, parents and educators need to be given clear, accessible guidance
to help them interpret and use information from large-scale tests, and this information
must be relevant to the ways in which test scores are used. A standard internal consis-
tency reliability coefficient, for example, may be interpreted by users as an indicator of
the trustworthiness and stability of scores, but it is not useful for understanding rates
of error in a testing program that uses performance levels or cut scores; instead, mis-
classification rates would be more informative. The utility of score reports would also
be enhanced by a reporting strategy that breaks down student performance according
to specific standards or topic areas, particularly those that are deemed to be most impor-
tant, rather than reporting only a global score (Commission on Instructionally Sup-
portive Assessment, 2001). Of course, because subscores are likely to have significant
measurement error associated with them, reporting information about reliability is
important at the subscore level as well as at the total score level.
Measure instructional practices and other responses of educators in addition to student
achievement. Clearly, the reactions of educators to the accountability system are criti-
cal determinants of whether the system raises student achievement and of the validity
of information the system produces. Systems should monitor practices, ideally through
a variety of formal and informal methods such as interviews or surveys administered to
samples of teachers across schools (Hamilton & Stecher, 2002). Efforts should be made
to triangulate data from multiple sources, such as by surveying or interviewing students
in addition to teachers (Lane, Parke, & Stone, 1998). Information on how teachers
change their practices, what topics do and do not get taught, and whether the incen-
tive system has affected morale would help administrators and policymakers gauge the
effectiveness of the system and modify it if it is not working as intended. Efforts to dis-
tinguish between the practices used by successful and unsuccessful schools, or to link
specific practices with score gains (see, e.g., Lane et al., in press; Stecher et al., 2000;
Stone & Lane, 2003), could be particularly valuable. Measurement of practice is not
straightforward; addressing the diversity of practice across classrooms, the possibility of
response bias, the likely political backlash, and the high costs involved are all impedi-
ments to implementing a large-scale system that measures practice (Koretz, 2003b).
Despite these problems, the feasibility of monitoring classroom practices and other
responses, even among a sample of teachers and schools, should be explored.
Conduct ongoing studies of alignment among standards, assessment, curriculum, and
instruction. Although most states claim to have tests that are aligned with standards,
a 2001 review conducted by Achieve, Inc., of 10 states indicated that only one,
Massachusetts, had attained an acceptable degree of alignment (Achieve, Inc., 2002).
In particular, Achieve’s studies of alignment have shown that while individual items
are well aligned to the specific standard to which they are mapped, many standards are
omitted from the test, especially the more challenging standards that focus on higher
level cognitive processes (Rothman, Slattery, Vranek, & Resnick, 2002). As the Achieve
authors point out, this pattern is likely to lead to instruction focused on lower level
skills as teachers become more familiar with the content of the tests. Efforts to ensure
alignment need to take into consideration the cognitive complexity and level of chal-
lenge associated with test items and standards, in addition to examining content match.
The rubric used by Achieve in the study just mentioned provides a good example of a
methodology for doing so. Examinations of alignment with curriculum and instruc-
tion must also address the cognitive complexity of tasks in addition to their content.
In addition to Achieve’s approach, methods for evaluating multiple dimensions of
alignment among tests, standards, curriculum, and instruction have been developed
by Porter (2002) and Webb (1997). Some of the difficulties inherent in alignment
studies are described by Bhola, Impara, and Buckendahl (2003) and by Koretz and
Hamilton (in press); these include differences in the specificity of standards across
states and the difficulty of determining what cognitive processes an item requires, par-
ticularly in light of the fact that the cognitive demands of an item may vary as a func-
tion of examinee proficiency. Despite these difficulties, the importance of alignment
to the proper functioning of accountability systems requires efforts to evaluate its
multiple dimensions.
Provide professional development to help teachers implement the standards and engage
in appropriate test preparation. A study by Smith and colleagues (1997) revealed that
fewer than one fifth of responding teachers in Arizona believed that they had received
adequate professional development to respond to the state’s testing program. Teach-
ers typically receive little exposure to techniques of assessment development and inter-
pretation during their preservice training (Cizek, 2000; Ward, 1980), and until this
changes, in-service professional development will be vital. Participants in an interview
the resources and skills necessary to produce classroom assessment systems that have
scoring and analysis tools designed to provide useful information; teachers often lack
the knowledge needed to develop effective assessments, and even if they possess this
knowledge, time constraints are likely to prevent them from producing systems that
provide the kind of summary information that is likely to be useful (Cizek, 2000;
Dwyer, 1998).
Utilize technology to improve the administration, scoring, and reporting of results from
classroom assessments. Information technology has been used in a wide range of class-
room assessment applications. Some of these are “benchmark” or “interim” assessment
systems designed to resemble the large-scale test and to provide ongoing feedback to
teachers on students’ success at meeting state standards. Other applications of tech-
nology provide rich, diagnostic information that is unavailable through traditional,
paper-and-pencil forms of testing (Koedinger, Anderson, Hadley, & Mark, 1997;
National Research Council, 2001; Wenger, 1987). Technology can provide oppor-
tunities for individualizing assessment and feedback and can maintain records of
students’ strategies as they progress through a problem or task, providing important
information that is unavailable when only final responses are observed. This type of
application serves a somewhat different role from the interim assessments that are
designed to look like the state test, and can facilitate the integration of instruction and
assessment (Bennett, 1998).
Provide professional development to help educators make better use of classroom-based
assessment. As with large-scale assessment, teachers and principals need to be offered
professional development designed to help them use and interpret results from assess-
ments. In addition, educators should be given training to help them improve their own
assessment-development skills (see, e.g., Commission on Instructionally Supportive
Assessment, 2001).
These guidelines are not intended to be exhaustive, but they do provide a summary
of advice that has been given by a wide variety of researchers and educators and can be
used as input to the development and evaluation of accountability systems that rely on
tests to shape policy and practice. It is important to recognize, however, that any orga-
nization or institution responsible for implementing a large-scale testing program faces
inevitable constraints in regard to time, personnel, and financial resources. These con-
straints require a prioritization of actions to promote high-quality testing in the con-
text of limited resources. Some of the suggestions just presented, such as incorporation
of information technology into testing programs, represent potentially promising
approaches that may improve test quality in the long run but may not be immediately
necessary. Others, such as the need for better validity investigations, deserve higher pri-
ority. Policymakers, administrators, and others who are charged with the task of devel-
oping or modifying a large-scale testing program need to weigh the options and their
likely effects on the quality of the program.
This list of guidelines illustrates the degree to which the effects of any large-scale
testing system will depend to a great extent on the details of implementation—what is
measured, how the measures are validated, what support is given to educators, and so
on. At the same time, testing policies should not be viewed as panaceas for education
reform. There are limitations inherent in this approach to improving education policy
and practice. I turn to some of these limitations in the final section of this chapter.
similar assertions about the need to use multiple measures and avoid reliance on a sin-
gle test score for any high-stakes decision (AERA, APA, & NCME, 1999; Baker et al.,
2002; National Association for the Education of Young Children, 1988). Despite this
consensus regarding the need for multiple measures, many state and district testing sys-
tems violate this admonition by using tests to deny diplomas, retain students in grade,
or reward or penalize educators. One of the problems is a lack of agreement on the
meaning of “multiple measures”—that is, if a math score and a reading score are com-
bined, or if students get five chances to pass an exit exam, does this constitute relying
on a single test score? Even within the professional education and measurement com-
munities, there is disagreement about how to define “multiple measure” or “single test
score” (Mehrens, 2000; Phillips, 2000).
A second reason for the limited utility of testing as a policy tool is the wide range of
other factors influencing the actions of participants in the system. The success of any
test-based accountability system will depend to a large degree on the capacity (includ-
ing human, financial, and material resources) of teachers, administrators, parents, stu-
dents, and policymakers to respond to the system in effective ways. Thus, factors such
as amount and quality of curriculum materials, availability and appropriateness of
training, and quality of collegial relationships within a school will affect what teachers
do, but teachers’ actions will also be influenced by their own prior knowledge and
belief systems about how students learn and what they should learn. Parents’ capacity
to use test-score data to make good decisions about homework help, tutoring, or school
choice will depend in part on how the system makes data accessible to them, but also
on parents’ own skill levels at interpreting data, their prior experiences with testing, and
their beliefs about what test scores mean. Accountability systems can be designed to
maximize the likelihood of desirable responses, but ultimately they cannot guarantee
any specific outcome.
Finally, there is simply too much that we currently do not know about how to
design testing policies that promote desirable outcomes and prevent undesirable ones.
The knowledge base on this topic grows every year and will surely increase as states,
districts, and schools learn from their attempts to meet the requirements of NCLB,
but the search for answers to questions about how to minimize score inflation and
promote effective instruction is likely to continue for many years. In the meantime,
it will be important to continue to gather evidence, both from large-scale studies and
from the individual experiences of teachers, administrators, and others affected by
test-based accountability, and to make that evidence available so that it can inform
efforts to improve accountability systems.
NOTES
1
Although some writers distinguish between the terms test and assessment, subsuming “test”
under the broader category of “assessment” (see, e.g., Cizek, 2000), these terms are used inter-
changeably throughout this chapter.
2
Although the topic of standards is closely intertwined with that of large-scale testing, the
former topic is not addressed here because it is the subject of another chapter in this volume.
3
Although this chapter focuses on testing in the United States, it is worth noting that high-
stakes testing, particularly in the form of exit exams, is prevalent in other nations (Phelps, 2000).
Some of the research discussed here was conducted in other countries, and most of the issues
presented in this chapter are relevant beyond the borders of the United States.
4
As several measurement experts have pointed out, criterion-referenced reporting does not
necessarily require the use of cut scores (Glaser, 1963; Glass, 1976; Linn, 1994), though the
need to identify students achieving minimum competency and, more recently, the emphasis
on standards-based reporting have led to a heavy reliance on cut scores in large-scale testing
programs.
5
Classroom-based assessments are sometimes referred to as formative assessments, but both
classroom-based and large-scale assessments may be used for both formative and summative
purposes, so I rely on the term classroom-based assessments throughout the chapter.
6
Although aligning instruction with standards is generally believed to be a desirable response,
the quality and breadth of the standards, the nature of the alignment (e.g., whether instruction
focuses only on a subset of standards), and the correspondence between the standards and users’
inferences influence whether aligning has positive or negative effects on validity.
7
Excessive coaching of this sort would generally be considered a negative response, but to the
extent that a moderate amount of such coaching is necessary to ensure the validity of the test—
for example, in situations in which some students may have had practice with a particular for-
mat and others have not—it may be considered a desirable activity if it does not unduly detract
from the overall quality and content of instruction.
REFERENCES
Abedi, J. (2003). Impact of student language background on content-based performance: Analyses
of extant data (CSE Tech. Rep. 603). Los Angeles: Center for Research on Evaluation, Stan-
dards, and Student Testing.
Achieve, Inc. (2000). Setting the record straight (Achieve Policy Brief No. 1). Washington, DC:
Author.
Achieve, Inc. (2002). Aiming higher: The next decade of education reform in Maryland. Wash-
ington, DC: Author.
Airasian, P. W. (1988). Measurement-driven instruction: A closer look. Educational Measure-
ment: Issues and Practice, 7(4), 6–11.
Alexander, K. L., Entwisle, D. R., & Olson, L. S. (2001). Schools, achievement, and inequal-
ity: A seasonal perspective. Educational Evaluation and Policy Analysis, 23, 171–191.
American Educational Research Association. (2000). Position statement of the American
Educational Research Association concerning high-stakes testing in pre-K–12 education.
Educational Researcher, 29(8), 24–25.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1999). Standards for educational and psychological
testing. Washington, DC: Author.
Amrein-Beardsley, A. A., & Berliner, D. C. (2003). Re-analysis of NAEP math and reading
scores in states with and without high-stakes tests: Response to Rosenshine. Education Policy
Analysis Archives, 11(25). Retrieved from http://epaa.asu.edu/epaa/v11n25/
Baker, E. L., Linn, R. L., Herman, J. L., & Koretz, D. M. (2002, Winter). Standards for edu-
cational accountability systems. CRESST Line, pp. 1–4.
Baker, E. L., & O’Neil, H. (1995). Diversity, assessment, and equity in educational reform.
In M. Nettles & A. Nettles (Eds.), Equity and excellence in educational testing and assessment
(pp. 69–87). Boston: Kluwer.
Baker, E. L., O’Neil, H. F., & Linn, R. L. (1993). Policy and validity prospects for performance-
based assessment. American Psychologist, 48, 1210–1218.
Bangert-Drowns, R. L., Kulik, J. A., & Kulik, C. C. (1991). Effects of frequent classroom
testing. Journal of Educational Research, 85, 89–99.
Baxter, G. P., & Glaser, R. (1998). Investigating the cognitive complexity of science assess-
ments. Educational Measurement: Issues and Practice, 17(3), 37–45.
Baxter, G. P., Shavelson, R. J., Goldman, S. R., & Pine, J. (1992). Evaluation of procedure-based
scoring for hands-on science achievement. Journal of Educational Measurement, 29, 1–17.
Bennett, R. E. (1998). Reinventing assessment: Speculations on the future of large-scale educa-
tional testing. Princeton, NJ: Educational Testing Service.
Bennett, S. N., Wragg, E. C., Carre, C. G., & Carter, D. G. S. (1992). A longitudinal study
of primary teachers’ perceived competence in, and concerns about, national curriculum
implementation. Research Papers in Education, 7, 53–78.
Betts, J. R., & Costrell, R. M. (2001). Incentives and equity under standards-based reform.
In D. Ravitch (Ed.), Brookings papers on education policy: 2001 (pp. 9–74). Washington,
DC: Brookings Institution Press.
Bhola, D. S., Impara, J. C., & Buckendahl, C. W. (2003). Aligning tests with states’ content
standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3), 21–29.
Bishop, J. (1998). Do curriculum-based external exit exam systems enhance student achievement?
(CPRE Research Rep. RR-40). Philadelphia: Consortium for Policy Research in Education.
Bishop, J. H., & Mane, F. (1999). The New York state reform strategy: The incentive effects of
minimum competency exams. Philadelphia: National Center on Education in Inner Cities.
Black, P. J. (1993). Formative and summative assessment by teachers. Studies in Science Edu-
cation, 21, 49–97.
Black, P., & William, D. (1998). Assessment and classroom learning. Assessment in Education,
5, 7–73.
Borko, H., & Elliott, R. (1999). Hands-on pedagogy versus hands-off accountability. Phi
Delta Kappan, 80, 394–400.
Borko, H., Elliott, R., & Uchiyama, K. (1999, April). Professional development: A key to Ken-
tucky’s educational reform effort. Paper presented at the annual meeting of the American
Educational Research Association, Montreal.
Camilli, G., & Shepard, L. A. (1994). Methods for detecting biased test items. Thousand Oaks,
CA: Sage.
Cannell, J. J. (1988). Nationally normed elementary achievement testing in America’s public
schools: How all 50 states are above the national average. Educational Measurement: Issues
and Practice, 7(2), 5–9.
Carnoy, M., & Loeb, S. (2002). Does external accountability affect student outcomes? A
cross-state analysis. Educational Evaluation and Policy Analysis, 24, 305–331.
Carnoy, M., Loeb, S., & Smith, T. (2001). Do higher state test scores in Texas make for better
high school outcomes? (CPRE Research Rep. RR-047). Philadelphia: Consortium for Policy
Research in Education.
Cimbricz, S. (2003). State-mandated testing and teachers’ beliefs and practice. Education Pol-
icy Analysis Archives, 10(2). Retrieved from http://epaa.asu.edu/epaa/v10n2.html
Cizek, G. J. (2000). Pockets of resistance in the education revolution. Educational Measure-
ment: Issues and Practice, 19(1), 16–23, 33.
Clarke, M., Shore, A., Rhoades, K., Abrams, L., Miao, J., & Li, J. (2003). Perceived effects of
state-mandated testing programs on teaching and learning: Findings from interviews with edu-
cators in low-, medium-, and high-stakes states. Boston: National Board on Educational Test-
ing and Public Policy.
Clotfelter, C. T., & Ladd, H. F. (1996). Recognizing and rewarding success in public schools.
In H. F. Ladd (Ed.), Holding schools accountable: Performance-based reform in education
(pp. 23–63). Washington, DC: Brookings Institution Press.
Cohen, D. K. (1995). What is the system in systemic reform? Educational Researcher, 24(9),
11–17.
Cole, N. S., & Moss, P. A. (1989). Bias in test use. In R. L. Linn (Ed.), Educational measure-
ment (3rd ed., pp. 201–219). New York: Macmillan.
Commission on Instructionally Supportive Assessment. (2001). Building tests to support
instruction and accountability (report prepared for American Association of School Admin-
istrators, National Association of Elementary School Principals, National Association of
Secondary School Principals, National Education Association, and National Middle School
Association). Washington, DC: Author.
Corbett, H. D., & Wilson, B. L. (1988). Raising the stakes in statewide mandatory minimum
competency testing. In W. L. Boyd & C. T. Kerchner (Eds.), The politics of excellence and
choice in education: The 1987 Politics of Education Association yearbook (pp. 27–39). New
York: Falmer Press.
Corbett, H. D., & Wilson, B. L. (1991). Two state minimum competency testing programs and
their effects on curriculum and instruction. In R. E. Stake (Ed.), Advances in program evalu-
ation: Vol. I. Effects of mandated assessment on teaching (pp. 7–40). Greenwich, CT: JAI Press.
Cronbach, L. J. (1988). Five perspectives on the validity argument. In H. Wainer & H. I.
Braun (Eds.), Test validity (pp. 3–17). Hillsdale, NJ: Erlbaum.
Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Review of
Educational Research, 58, 438–481.
Darling-Hammond, L. (1991). The implications of testing policy for quality and equality.
Phi Delta Kappan, 73, 220–225.
Darling-Hammond, L. (2003). Standards and assessments: Where we are and what we need.
Teachers College Record. Retrieved from http://www.tcrecord.org/Content.asp?ContentID=
11109
Darling-Hammond, L., & Wise, A. E. (1985). Beyond standardization: State standards and
school improvement. Elementary School Journal, 85, 315–336.
Deere, D., & Strayer, W. (2003). Competitive incentives: School accountability and student out-
comes in Texas. Unpublished working paper.
Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development
and use of performance assessments. Applied Measurement in Education, 4, 289–303.
Dwyer, C. A. (1998). Assessment and classroom learning: Theory and practice. Assessment in
Education, 5, 131–137.
Ericsson, K. A., & Simon, H. A. (1984). Protocol analysis: Verbal reports as data. Cambridge,
MA: MIT Press.
Figlio, D. N., & Getzler, L. S. (2002). Accountability, ability and disability: Gaming the sys-
tem (NBER Working Paper W9307). Cambridge, MA: National Bureau of Economic
Research.
Firestone, W., Mayrowetz, D., & Fairman, J. (1998). Performance-based assessment and
instructional change: The effects of testing in Maine and Maryland. Educational Evaluation
and Policy Analysis, 20, 95–113.
Frederickson, N. (1994). The influence of minimum competency tests on teaching and learning.
Princeton, NJ: Educational Testing Service, Policy Information Center.
Gifford, B. R., & O’Connor, M. C. (1992). Changing assessments: Alternative views of apti-
tude, achievement, and instruction. Boston: Kluwer.
Gipps, C. (1999). Socio-cultural aspects of assessment. In A. Iran-Nejad & P. D. Pearson
(Eds.), Review of research in education (Vol. 24, pp. 355–392). Washington, DC: American
Educational Research Association.
Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some
questions. American Psychologist, 18, 519–521.
Glaser, R., Linn, R., & Bohrnstedt, G. (1997). Assessment in transition: Monitoring the nation’s
educational progress. New York: National Academy of Education.
Glass, G. V. (1976). Standards and criteria. Journal of Educational Measurement, 15, 237–261.
Goldhaber, D., & Hannaway, J. (2001, November). Accountability with a kicker: Observations
on the Florida A+ Accountability Plan. Paper presented at the annual meeting of the Associ-
ation of Public Policy and Management, Washington, DC.
Goslin, D. A. (1963). Teachers and testing. New York: Russell Sage Foundation.
Greene, J. P., Winters, M. A., & Forster, G. (2003). Testing high-stakes tests: Can we believe
the results of accountability tests? (Civic Rep. 33). New York: Manhattan Institute for Policy
Research.
Grissmer, D. W., & Flanagan, A. (1998). Exploring rapid score gains in Texas and North
Carolina. Washington, DC: National Education Goals Panel.
Grissmer, D. W., Flanagan, A., Kawata, J., & Williamson, S. (2000). Improving student
achievement: What state NAEP scores tell us (Publication MR-924-EDU). Santa Monica,
CA: RAND.
Haertel, E. H. (1999). Performance assessment and education reform. Phi Delta Kappan, 80,
662–666.
Hambleton, R. K., & Murphy, E. (1992). A psychometric perspective on authentic measure-
ment. Applied Measurement in Education, 5, 1–16.
Hambleton, R. K., & Slater, S. C. (1995). Do policymakers and educators understand NAEP
reports? Washington, DC: National Center for Education Statistics.
Hamilton, L. S. (1998). Gender differences on high school science achievement tests: Do for-
mat and content matter? Educational Evaluation and Policy Analysis, 20, 179–195.
Hamilton, L. S. (1999). Detecting gender-based differential item functioning on a constructed-
response science test. Applied Measurement in Education, 12, 211–235.
Hamilton, L. S., Klein, S. P., & Lorie, W. (2000). Using Web-based testing for large-scale assess-
ments. Santa Monica, CA: RAND.
Hamilton, L. S., & Koretz, D. M. (2002). Tests and their use in test-based accountability sys-
tems. In L. S. Hamilton, B. M. Stecher, & S. P. Klein (Eds.), Making sense of test-based
accountability in education (pp. 13–49). Santa Monica, CA: RAND.
Hamilton, L. S., Nussbaum, E. M., & Snow, R. E. (1997). Interview procedures for validat-
ing science assessments. Applied Measurement in Education, 10, 181–200.
Hamilton, L. S., & Stecher, B. M. (2002). Improving test-based accountability. In L. S.
Hamilton, B. M. Stecher, & S. P. Klein (Eds.), Making sense of test-based accountability in
education (pp. 121–144). Santa Monica, CA: RAND.
Haney, W. (1981). Validity, Vaudeville, and values: A short history of social concerns over
standardized testing. American Psychologist, 36, 1021–1034.
Haney, W. (2000). The myth of the Texas miracle in education. Education Policy Analysis
Archives, 8(41). Retrieved from http://epaa.asu.edu/epaa/v8n41/
Hannaway, J., & McKay, S. (2001). Taking measure. Education Next, 1(3). Retrieved from
http://www.educationnext.org/20013/6hannaway.html
Jacob, B. A. (2001). Getting tough? The impact of high school graduation exams. Educational
Evaluation and Policy Analysis, 23, 99–122.
Jacob, B. A. (2002). Accountability, incentives, and behavior: The impact of high-stakes testing
in the Chicago Public Schools (NBER Working Paper 8968). Cambridge, MA: National
Bureau of Economic Research.
Jacob, B. A., & Levitt, S. D. (2002). Rotten apples: An investigation of the prevalence and pre-
dictors of teacher cheating (NBER Working Paper 9413). Cambridge, MA: National Bureau
of Economic Research.
Jaeger, R. M. (1982). The final hurdle: Minimum competency achievement testing. In G. R.
Austin & H. Garber (Eds.), The rise and fall of national test scores. New York: Academic Press.
Jones, G., Jones, B. D., Hardin, B., Chapman, L., Yarbrough, T., & Davis, M. (1999). The
impact of high-stakes testing on teachers and students in North Carolina. Phi Delta Kap-
pan, 81, 199–203.
Kane, T. J., & Staiger, D. O. (2002). Volatility in school test scores: Implications for test-based
accountability systems. In D. Ravitch (Ed.), Brookings papers on education policy (pp. 235–269).
Washington, DC: Brookings Institution Press.
Katz, I. R., Martinez, M. E., Sheehan, K. M., & Tatsuoka, K. K. (1993). Extending the rule
space model to a semantically-rich domain: Diagnostic assessment in architecture. Princeton,
NJ: Educational Testing Service.
Kelley, C., Odden, A., Milanowski, A., & Heneman, H. (2000). The motivational effects of
school-based performance awards (CPRE Policy Brief RB-29). Philadelphia: Consortium for
Policy Research in Education.
Klein, S. P., Hamilton, L. S., McCaffrey, D. F., & Stecher, B. M. (2000). What do test scores
in Texas tell us? Santa Monica, CA: RAND.
Klein, S. P., Jovanovic, J., Stecher, B. M., McCaffrey, D., Shavelson, R. J., Haertel, E., Solano-
Flores, G., & Comfort, K. (1997). Gender and racial/ethnic differences on performance
assessments in science. Educational Evaluation and Policy Analysis, 19, 83–97.
Kluger, A. N., & DeNisi, A. (1996). The effect of feedback interventions on performance: A
historical review, a meta-analysis, and a preliminary feedback intervention theory. Psycho-
logical Bulletin, 119, 254–284.
Koedinger, K. R., Anderson, J. R., Hadley, W. H., & Mark, M. A. (1997). Intelligent tutor-
ing goes to school in the big city. International Journal of Artificial Intelligence in Education,
8, 30–43.
Koretz, D. (1988). Arriving at Lake Wobegon: Are standardized tests exaggerating achieve-
ment and distorting instruction? American Educator, 12(2), 8–15, 46–52.
Koretz, D. (1992). State and national assessment. In M. C. Alkin (Ed.), Encyclopedia of edu-
cational research (6th ed., pp. 1262–1267). Washington, DC: American Educational
Research Association.
Koretz, D. (1997). The assessment of students with disabilities in Kentucky (CSE Tech. Rep.
431). Los Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Koretz, D. (2003a, April). Attempting to discern the effects of the NCLB accountability provisions
on learning. Paper presented at the annual meeting of the American Educational Research
Association, Chicago.
Koretz, D. (2003b). Using multiple measures to address perverse incentives and score infla-
tion. Educational Measurement: Issues and Practice, 22(2), 18–26.
Koretz, D. M., & Barron, S. I. (1998). The validity of gains on the Kentucky Instructional
Results Information System (KIRIS). Santa Monica, CA: RAND.
Koretz, D., Barron, S., Mitchell, K., & Stecher, B. (1996). The perceived effects of the Kentucky
Instructional Results Information System (KIRIS) (Publication MR-792-PCT/FF). Santa
Monica, CA: RAND.
Koretz, D. M., & Deibert, E. (1993). Interpretations of National Assessment of Educational
Progress (NAEP) anchor points and achievement levels by the print media in 1991. Santa Mon-
ica, CA: RAND.
Koretz, D., & Deibert, E. (1996). Setting standards and interpreting achievement: A cau-
tionary tale from the National Assessment of Educational Progress. Educational Assessment,
3, 53–81.
Koretz, D. M., & Hamilton, L. S. (2003). Teachers’ responses to high-stakes testing and the
validity of gains: A pilot study (CSE Tech. Rep. 610). Los Angeles: Center for Research on
Evaluation, Standards, and Student Testing.
Koretz, D. M., & Hamilton, L. S. (2000). Assessment of students with disabilities in Ken-
tucky: Inclusion, student performance, and validity. Educational Evaluation and Policy
Analysis, 22, 255–272.
Koretz, D. M., & Hamilton, L. S. (in press). K–12 group testing. In R. Brennan (Ed.), Edu-
cational measurement (4th ed.). Westport, CT: American Council on Education/Praeger.
Koretz, D. M., Linn, R. L., Dunbar, S. B., & Shepard, L. A. (1991, April). The effects of high-
stakes testing on achievement: Preliminary findings about generalization across tests. Paper pre-
sented at the annual meeting of the American Educational Research Association, Chicago.
Koretz, D. M., McCaffrey, D. F., & Hamilton, L. S. (2001). Toward a framework for vali-
dating gains under high-stakes conditions (CSE Tech. Rep. 551). Los Angeles: Center for
Research on Evaluation, Standards, and Student Testing.
Koretz, D., Mitchell, K., Barron, S., & Keith, S. (1996). The perceived effects of the Maryland
School Performance Assessment Program (CSE Tech. Rep. 409). Los Angeles: Center for
Research on Evaluation, Standards, and Student Testing.
Koretz, D., Stecher, B., Klein, S., & McCaffrey, D. (1994). The Vermont portfolio assessment
program: Findings and implications. Educational Measurement: Issues and Practice, 13(3), 5–16.
Kupermintz, H. (2002). Teacher effects as a measure of teacher effectiveness: Construct validity
considerations in TVAAS (Tennessee Value-Added Assessment System) (CSE Tech. Rep. 563).
Los Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Ladd, H. F. (1999). The Dallas school accountability and incentive program: An evaluation
of its impacts on student outcomes. Economics of Education Review, 18, 1–16.
Lane, S., Parke, C. S., & Stone, C. A. (1998). A framework for evaluating the consequences
of assessment programs. Educational Measurement: Issues and Practice, 17(2), 24–28.
Lane, S., Parke, C. S., & Stone, C. A. (in press). The impact of a state performance-based assess-
ment and accountability program on mathematics instruction and student learning: Evidence
from survey data and school performance. Educational Assessment.
Lane, S., Wang, N., & Magone, M. (1996). Gender related differential item functioning on
a middle-school mathematics performance assessment. Educational Measurement: Issues and
Practice, 15(4), 21–27, 31.
Lillard, D. R., & DeCicca, P. P. (2001). Higher standards, more dropouts? Evidence within
and across time. Economics of Education Review, 20, 459–473.
Linn, R. L. (1993). Educational assessment: Expanded expectations and challenges. Educa-
tional Evaluation and Policy Analysis, 15, 1–16.
Linn, R. L. (1994). Criterion-referenced measurement: A valuable perspective clouded by sur-
plus meaning. Educational Measurement: Issues and Practice, 13(4), 12–14.
Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29(2), 4–16.
Linn, R. L. (2001). The design and evaluation of educational assessment and accountability systems
(CSE Tech. Rep. 539). Los Angeles: Center for Research on Evaluation, Standards, and Stu-
dent Testing.
Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex performance-based assessment:
Expectations and validation criteria. Educational Researcher, 20(8), 15–21.
Linn, R. L., Dunbar, S. B., Harnisch, D. L., & Hastings, C. N. (1982). The validity of the Title
I Evaluation and Reporting System. In E. R. House, S. Mathison, J. Pearsol, & H. Preskill
(Eds.), Evaluation studies review annual (Vol. 7, pp. 427–442). Beverly Hills, CA: Sage.
Linn, R. L., Graue, M. E., & Sanders, N. M. (1990). Comparing state and district test results
to national norms: The validity of claims that “everyone is above average.” Educational
Measurement: Issues and Practice, 9, 5–14.
Linn, R. L., & Haug, C. (2002). Stability of school-building accountability scores and gains.
Educational Evaluation and Policy Analysis, 24, 29–36.
Linn, R. L., Koretz, D. M., Baker, E. L., & Burstein, L. (1992). The validity and credibility of
the achievement levels for the 1990 National Assessment of Educational Progress in mathemat-
ics (CSE Tech. Rep. 330). Los Angeles: Center for the Study of Evaluation.
Mabry, L. (1999). Writing to the rubric: Lingering effects of traditional standardized testing
on direct writing assessment. Phi Delta Kappan, 80, 673–679.
Mabry, L., Poole, J., Redmond, L., & Schultz, A. (2003). Local impact of state testing in south-
west Washington. Education Policy Analysis Archives, 11(22). Retrieved from http://epaa.
asu.edu/epaa/v11n22/
Madaus, G. (1988). The influence of testing on the curriculum. In L. Tanner (Ed.), Critical
issues in curriculum: 87th yearbook of the NSSE, Part 1 (pp. 83–121). Chicago: University
of Chicago Press.
Madaus, G. (1993). A national testing system: Manna from above? A historical/technological
perspective. Educational Assessment, 1, 9–26.
Madaus, G. F., & O’Dwyer, L. M. (1999). A short history of performance assessment:
Lessons learned. Phi Delta Kappan, 80, 688–695.
Massell, D., Kirst, M., & Hoppe, M. (1997). Persistence and change: Standards-based systemic
reform in nine states (CPRE Policy Brief RB-21). Philadelphia: Consortium for Policy Research
in Education.
McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., & Hamilton, L. S. (2003). Evaluating
value-added models for teacher accountability. Santa Monica, CA: RAND.
McDonnell, L. M. (2002). Accountability as seen through a political lens. In L. S. Hamilton,
B. M. Stecher, & S. P. Klein (Eds.), Making sense of test-based accountability in education
(pp. 101–120). Santa Monica, CA: RAND.
McDonnell, L. M., & Choisser, C. (1997). Testing and teaching: Local implementation of new
state assessments (CSE Tech. Rep. 442). Los Angeles: Center for Research on Evaluation,
Standards, and Student Testing.
McNeil, L. M. (2000). Creating new inequalities: Contradictions of reform. Phi Delta Kap-
pan, 81, 729–734.
Mehrens, W. A. (1998). Consequences of assessment: What is the evidence? Education Policy
Analysis Archives, 6(13). Retrieved from http://epaa.asu.edu/epaa/v6n13.html
Mehrens, W. A. (2000). Defending a state graduation test: GI Forum v. Texas Education
Agency. Measurement perspectives from an external evaluator. Applied Measurement in Edu-
cation, 13, 387–401.
Mehrens, W. A., & Kaminski, J. (1989). Methods for improving standardized test scores:
Fruitful, fruitless or fraudulent? Educational Measurement: Issues and Practice, 8(1), 14–22.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–104).
New York: Macmillan.
Meyer, L., Orlofsky, G. F., Skinner, R. A., & Spicer, S. (2002). The state of the states. Edu-
cation Week, 21(17), 68–92.
Miller, M. D. (1998). Teacher uses and perceptions of the impact of statewide performance-based
assessments. Washington, DC: Council of Chief State School Officers, State Education
Assessment Center.
Miller, M. D., & Seraphine, A. E. (1993). Can test scores remain authentic when teaching to
the test? Educational Assessment, 1, 119–129.
Millman, J., & Greene, J. (1989). The specification and development of tests of achievement
and ability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 335–366). New
York: Macmillan.
National Assessment Governing Board. (2002). Using the National Assessment of Educational
Progress to confirm state test results. Washington, DC: Author.
National Association for the Education of Young Children. (1988). NAEYC position state-
ment on standardized testing of young children 3 through 8 years of age, adopted Novem-
ber 1987. Young Children, 43(3), 42–47.
National Center for Education Statistics. (1996). Technical issues in large-scale performance
assessment. Washington, DC: U.S. Department of Education.
National Commission on Excellence in Education. (1983). A nation at risk. Washington, DC:
U.S. Department of Education.
National Council on Education Standards and Testing. (1992). Raising standards for Ameri-
can education. Washington, DC: Author.
National Research Council. (1999a). Grading the nation’s report card: Evaluating NAEP and
transforming the assessment of educational progress. Washington, DC: National Academy Press.
National Research Council. (1999b). High stakes: Testing for tracking, promotion, and gradua-
tion. Washington, DC: National Academy Press.
National Research Council. (2001). Knowing what students know: The science and design of
educational assessment. Washington, DC: National Academy Press.
O’Day, J. (2002). Complexity, accountability, and school improvement. Harvard Educational
Review, 72(3), 293–329.
O’Day, J., Goertz, M. E., & Floden, R. E. (1995). Building capacity for education reform
(CPRE Policy Brief RB-18). Philadelphia: Consortium for Policy Research in Education.
Olson, L. (2003). Legal twists, digital turns. Education Week, 22(35), 11–14, 16.
Parkes, J., & Stevens, J. J. (2003). Legal issues in school accountability systems. Applied Mea-
surement in Education, 16, 141–158.
Pedulla, J. J., Abrams, L. M., Madaus, G. F., Russell, M. K., Ramos, M. A., & Miao, J. (2003).
Perceived effects of state-mandated testing programs on teaching and learning: Findings from a
national survey of teachers. Boston: National Board on Educational Testing and Public Policy.
Phelps, R. P. (2000). Trends in large-scale testing outside the United States. Educational Mea-
surement: Issues and Practice, 19(1), 11–21.
Phillips, M., & Chin, T. (2001). Comment on Betts & Costrell. In D. Ravitch (Ed.), Brookings
papers on education policy: 2001 (pp. 61–66). Washington, DC: Brookings Institution Press.
Phillips, S. E. (2000). GI Forum v. Texas Education Agency: Psychometric evidence. Applied
Measurement in Education, 13, 343–385.
Pipho, C. (1985). Tracking the reforms, Part 5: Testing—Can it measure the success of the
reform movement? Education Week, 4(35), 19.
Popham, W. J. (1987). The merits of measurement-driven instruction. Phi Delta Kappan, 68,
679–682.
Popham, W. J., Cruse, K. L., Rankin, S. C., Sandifer, P. D., & Williams, P. L. (1985). Mea-
surement-driven instruction: It’s on the road. Phi Delta Kappan, 66, 628–634.
Porter, A. C. (2002). Measuring the content of instruction: Uses in research and practice.
Educational Researcher, 31(7), 3–14.
Public Agenda. (2000). Survey finds little sign of backlash against academic standards or stan-
dardized tests. Retrieved from http://www.publicagenda.org/issues/pcc_detail
Quality counts. (2002). Education Week, 21(17). Retrieved from http://www.edweek.com/
sreports/qc02/
Reckase, M. D. (1997, March). Consequential validity from the test developers’ perspective. Paper
presented at the annual meeting of the National Council on Measurement in Education,
Chicago.
Resnick, D. P. (1982). History of educational testing. In A. K. Wigdor & W. R. Garner
(Eds.), Ability testing: Uses, consequences, and controversies, Part II (pp. 173–194). Washing-
ton, DC: National Academy Press.
Resnick, L. B., & Resnick, D. P. (1992). Assessing the thinking curriculum: New tools for
educational reform. In B. R. Gifford & M. C. O’Connor (Eds.), Changing assessment:
Alternative views of aptitude, achievement, and instruction (pp. 37–75). Boston: Kluwer.
Roderick, M., & Engel, M. (2001). The grasshopper and the ant: Motivational responses of
low-achieving students to high-stakes testing. Educational Evaluation and Policy Analysis,
23, 197–227.
Roderick, M., Jacob, B. A., & Bryk, A. S. (2002). The impact of high-stakes testing in Chicago
on student achievement in promotional gate grades. Educational Evaluation and Policy
Analysis, 24, 333–357.
Roeber, E. (1988, February). A history of large-scale testing activities at the state level. Paper pre-
sented at the Indiana Governor’s Symposium on ISTEP, Madison, IN.
Rogosa, D. (2003). Confusions about consistency in improvement. Retrieved from http://www-
stat.stanford.edu/∼rag/api/consist.pdf
Romberg, T. A., Zarinia, E. A., & Williams, S. R. (1989). The influence of mandated testing
on mathematics instruction: Grade 8 teachers’ perceptions. Madison: National Center for
Research in Mathematical Science Education, University of Wisconsin–Madison.
Rothman, R., Slattery, J. B., Vranek, J. L., & Resnick, L. B. (2002). Benchmarking and align-
ment of standards and testing (CSE Tech. Rep. 566). Los Angeles: Center for Research on
Evaluation, Standards, and Student Testing.
Russell, M., & Haney, W. (1997). Testing writing on computers: An experiment comparing
student performance on tests conducted via computer and via paper-and-pencil. Education
Policy Analysis Archives, 5(3). Retrieved from http://epaa.asu.edu/epaa/v5n3.html
Sadler, R. (1989). Formative assessment and the design of instructional assessments. Instruc-
tional Science, 18, 119–144.
Sanders, W., & Horn, S. (1998). Research findings from the Tennessee Value-Added Assess-
ment System (TVAAS) database: Implications for educational evaluation and research.
Journal of Personnel Evaluation in Education, 12, 247–256.
Schemo, D. J. (2003, July 11). Questions on data cloud luster of Houston schools. New York
Times, p. A1.
Shavelson, R. J., Baxter, G. P., & Pine, J. (1992). Performance assessments: Political rhetoric
and measurement reality. Educational Researcher, 21(4), 22–27.
Shepard, L. (1991). Will national tests improve student learning? (CSE Tech. Rep. 342). Los
Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Shepard, L. A., & Dougherty, K. C. (1991, April). Effects of high-stakes testing on instruction.
Paper presented at the annual meeting of the American Educational Research Association
and National Council on Measurement in Education, Chicago.
Smith, M. L. (1994). Old and new beliefs about measurement-driven instruction: “The more
things change, the more they stay the same” (CSE Tech. Rep. 373). Los Angeles: Center for
Research on Evaluation, Standards, and Student Testing.
Smith, M. L., Edelsky, C., Draper, K., Rottenberg, C., & Cherland, M. (1991). The role of test-
ing in elementary schools (CSE Tech. Rep. 321). Los Angeles: Center for Research on Evalua-
tion, Standards, and Student Testing.
Smith, M. L., Noble, A., Heinecke, W., Seck, M., Parish, C., Cabay, M., Junker, S., Haag, S.,
Tayler, K., Safran, Y., Penley, Y., & Bradshaw, A. (1997). Reforming schools by reforming
assessment: Consequences of the Arizona Student Assessment Program (ASAP): Equity and
teacher capacity building (CSE Tech. Rep. 425). Los Angeles: Center for Research on Eval-
uation, Standards, and Student Testing.
Smith, M. L., & Rottenberg, C. (1991). Unintended consequences of external testing in ele-
mentary schools. Educational Measurement: Issues and Practice, 10(4), 7–11.
Smith, M. S., & O’Day, J. (1990). Systemic school reform. In Politics of Education Associa-
tion yearbook 1990 (pp. 233–267). London: Taylor & Francis.
Smith, M. S., O’Day, J., & Cohen, D. K. (1990). National curriculum American style: Can
it be done? What might it look like? American Educator, 14(4), 10–17, 40–47.
Solano-Flores, G., & Nelson-Barber, S. (2001). On the cultural validity of science assess-
ments. Journal of Research in Science Teaching, 38, 553–573.
Spillane, J. P., & Thompson, C. L. (1997). Reconstructing conceptions of local capacity: The
local education agency’s capacity for ambitious instructional reform. Educational Evalua-
tion and Policy Analysis, 19, 185–203.
Stake, R. (1999). The goods on American education. Phi Delta Kappan, 80, 668–670.
Stecher, B. M. (2002). Consequences of large-scale, high-stakes testing on school and class-
room practices. In L. S. Hamilton, B. M. Stecher, & S. P. Klein (Eds.), Making sense of test-
based accountability in education (pp. 79–100). Santa Monica, CA: RAND.
Stecher, B. M., & Barron, S. I. (1999). Quadrennial milepost accountability testing in Kentucky
(CSE Tech. Rep. 505). Los Angeles: Center for Research on Evaluation, Standards, and
Student Testing.
Stecher, B. M., Barron, S. I., Chun, T., & Ross, K. (2000). The effects of the Washington state
education reform on schools and classrooms (CSE Tech. Rep. 525). Los Angeles: Center for
Research on Evaluation, Standards, and Student Testing.
Stecher, B. M., Barron, S. I., Kaganoff, T., & Goodwin, J. (1998). The effects of standards-
based assessment on classroom practices: Results of the 1996–97 RAND survey of Kentucky
teachers of mathematics and writing (CSE Tech. Rep. 482). Los Angeles: Center for Research
on Evaluation, Standards, and Student Testing.
Stecher, B. M., & Chun, T. (2001). School and classroom practices during two years of educa-
tion reform in Washington state (CSE Tech. Rep. 550). Los Angeles: Center for Research on
Evaluation, Standards, and Student Testing.
Stecher, B. M., Hamilton, L. S., & Gonzalez, G. (2003). Working smarter to leave no child
behind. Santa Monica, CA: RAND.
Stecher, B. M., & Mitchell, K. J. (1995). Portfolio driven reform: Vermont teachers’ under-
standing of mathematical problem solving (CSE Tech. Rep. 400). Los Angeles: Center for
Research on Evaluation, Standards, and Student Testing.
Stone, C. A., & Lane, S. (2003). Consequences of a state accountability program: Examining
relationships between school performance gains and teacher, student, and school variables.
Applied Measurement in Education, 16, 1–26.
Stufflebeam, D. L., Jaeger, R. M., & Scriven, M. (1991). Summative evaluation of the
National Assessment Governing Board’s inaugural 1990–91 effort to set achievement levels on
the National Assessment of Educational Progress. Kalamazoo: Western Michigan University
Evaluation Center.
Swanson, C., & Stevenson, D. L. (2002). Standards-based reform in practice: Evidence on
state policy and classroom instruction from the NAEP state assessments. Educational Eval-
uation and Policy Analysis, 24, 1–27.
Taylor, G., Shepard, L., Kinner, F., & Rosenthal, J. (2003). A survey of teachers’ perspectives
on high-stakes testing in Colorado: What gets taught, what gets lost (CSE Tech. Rep. 588). Los
Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Tepper, R. L. (2002). The influence of high-stakes testing on instructional practice in Chicago.
Unpublished doctoral dissertation, Harris Graduate School of Public Policy, University of
Chicago.
Texas Education Agency. (2000, May 17). Texas TAAS passing rates hit seven-year high; four
out of every five students pass exam (press release). Austin, TX: Author.
Tindal, G., Heath, B., Hollenbeck, K., Almond, P., & Harniss, M. (1998). Accommodating
students with disabilities on large-scale tests: An experimental study. Exceptional Children,
64, 439–450.
Tittle, C. (1989). Validity: Whose construction is it in the teaching and learning context?
Educational Measurement: Issues and Practice, 8, 5–13.
U.S. Congress, Office of Technology Assessment. (1992). Testing in America’s schools: Asking
the right questions (Publication OTA-SET-519). Washington, DC: U.S. Government Print-
ing Office.
U.S. Department of Education. (2002). No Child Left Behind fact sheet. Washington, DC:
U.S. Government Printing Office.
U.S. General Accounting Office. (2003). Title I: Characteristics of tests will influence expenses;
information sharing may help states realize efficiencies (Publication GAO-03-389). Washing-
ton, DC: Author.
Ward, J. (1980). Teachers and testing: A survey of knowledge and attitudes. In L. M. Rudner
(Ed.), Testing in our schools (pp. 48–72). Washington, DC: National Institute of Education.
Webb, N. L. (1997). Criteria for alignment of expectations and assessments in mathematics and
science education (Research Monograph 8). Washington, DC: Council of Chief State
School Officers.
Webster, W., & Mendro, R. (1997). The Dallas Value-Added Accountability System. In
J. Millman (Ed.), Grading teachers, grading schools: Is student achievement a valid evaluation
measure? (pp. 81–99). Thousand Oaks, CA: Corwin Press.
Wenger, E. (1987). Artificial intelligence and tutoring systems. Los Altos, CA: Morgan Kaufmann.
Wiggins, G. (1992). Creating tests worth taking. Educational Leadership, 49, 26–33.
Winfield, L. F. (1990). School competency testing reforms and student achievement: Explor-
ing a national perspective. Educational Evaluation and Policy Analysis, 12, 157–173.
Wolf, S. A., Borko, H., McIver, M. C., & Elliott, R. (1999). “No excuses”: School reform efforts
in exemplary schools of Kentucky (CSE Tech. Rep. 514). Los Angeles: Center for Research
on Evaluation, Standards, and Student Testing.
Wolf, S. A., & McIver, M. C. (1999). When process becomes policy. Phi Delta Kappan, 80,
401–406.
Zernike, K. (2001, May 4). Suburban mothers succeed in their boycott of an 8th-grade test.
New York Times, p. A19.
Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP his-
tory assessment. Journal of Educational Measurement, 26, 55–66.