You are on page 1of 45

Review of Research in

Education http://rre.aera.net

Chapter 2: Assessment as a Policy Tool


Laura Hamilton
REVIEW OF RESEARCH IN EDUCATION 2003 27: 25
DOI: 10.3102/0091732X027001025

The online version of this article can be found at:


http://rre.sagepub.com/content/27/1/25

Published on behalf of

American Educational Research Association

and

http://www.sagepublications.com

Additional services and information for Review of Research in Education can be found at:

Email Alerts: http://rre.aera.net/alerts

Subscriptions: http://rre.aera.net/subscriptions

Reprints: http://www.aera.net/reprints

Permissions: http://www.aera.net/permissions

Citations: http://rre.sagepub.com/content/27/1/25.refs.html

>> Version of Record - Jan 1, 2003

What is This?
Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014
Chapter 2

Assessment as a Policy Tool

LAURA HAMILTON
RAND Corporation

T his chapter focuses on the use of tests or assessments1 as instruments for promot-
ing educational change. Recent large-scale education reform efforts, along with
state and federal legislation, illustrate the growing importance that policymakers
and education reformers are attaching to accountability for student performance. The
emphasis on testing and accountability may be attributable to a number of factors,
including the relatively low cost of these reforms; the fact that they can be externally
mandated and, to some extent, controlled; and the relative speed with which these
policies can be implemented (Linn, 2000). Despite their popularity, in most cases these
reforms are not guided by a careful investigation of the probable consequences of using
tests as accountability tools.
In this chapter, I provide some brief background on recent and current uses of large-
scale achievement tests and discuss what is currently known about the effects of large-
scale testing on several outcomes. I provide some recommendations to foster more
effective use of tests and discuss some of the limitations of relying heavily on tests as
policy tools. The focus of the chapter is on tests used at the K–12 level in the United
States, but much of the discussion is applicable to postsecondary testing and to test use
in other countries. Also, the chapter focuses primarily on large-scale, externally man-
dated tests, such as those administered as part of district and state testing programs, but
it also addresses classroom-based assessments as a potentially important component of
any testing policy initiative. Finally, the chapter addresses tests that are used to support
accountability systems and to provide instructional feedback to teachers; it does not
address tests intended primarily for selecting students into programs or institutions or
for diagnosing individual learning needs (e.g., college admissions tests or IQ tests; for
a recent review of the history and uses of these types of tests, see Gipps, 1999).
When examining the use of tests, it is critical to take into consideration the purpose
of the testing system. The National Research Council Committee on the Foundations
of Assessment (2001) noted three broad purposes for large-scale achievement tests:
(a) assessment to assist learning, also called formative assessment; (b) assessment of indi-
vidual achievement, also called summative assessment, and (c) assessment to evaluate
the quality and effectiveness of educational programs. Most uses of tests fall into one

25

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


26 Review of Research in Education, 27

of these categories, and most emphasize the use of tests as tools for measuring a con-
struct such as mathematics achievement. Increasingly, however, tests are being adopted
as components of broader reform efforts and are being designed not only to produce
information but to create improvements in the educational system by signaling what
students are expected to learn and by creating motivational conditions that promote cer-
tain outcomes. Even when tests are not explicitly intended to serve as levers to change
instruction, their use may lead to consequences that are serious and unanticipated by
their developers or users. The importance of examining consequences of test use has
been recognized by professional organizations that publish guidelines for appropriate
use of tests. Most notably, the Standards for Educational and Psychological Testing
(American Educational Research Association [AERA], American Psychological Asso-
ciation [APA], & National Council on Measurement in Education [NCME], 1999)
specify that consequences should be examined as part of any validity investigation (see
also Cronbach, 1988; Messick, 1989). An examination of consequences is especially
important for testing programs that are intended to serve as policy tools.
Before discussing what is known about the effects of testing, it is important to keep
in mind what a test is, and what kinds of inferences it can support. By their nature,
tests cannot tell us everything we would like to know about a student’s competencies.
Performance on a test provides information about a sample of examinee behavior under
certain, very specific conditions. For certain kinds of tests, such as essay tests or port-
folios, test performance may represent a direct sample of the behaviors in which we are
interested, whereas for the majority of large-scale testing programs, the sample of
behavior that the test provides is an indicator of a much broader set of proficiencies,
most of which cannot be tested directly (Haertel, 1999). From the sample, users make
inferences about how scores generalize to the broader domain in which they are inter-
ested, and the degree to which these inferences are warranted is in part a function of
the adequacy with which the test samples the domain. When users of test results inter-
pret scores on a standardized, multiple-choice mathematics achievement test, for exam-
ple, it is likely that their inferences about what these scores mean extend beyond the
ability to answer certain types of closed-ended questions; in other words, the users’
target of inference is broader than the specific behaviors elicited by the test (Koretz,
McCaffrey, & Hamilton, 2001). To the extent that test performance serves as an accu-
rate indicator of the broader skills in which users are interested, users’ inferences will
be appropriate. However, if test performance fails to provide a reasonable indication of
students’ mastery of the knowledge and skills in which users are interested, the validity
of users’ inferences will be compromised. Some of the unintended negative conse-
quences of testing discussed in this chapter contribute to a mismatch between the infer-
ences warranted on the basis of test scores and the inferences users make about them.

BRIEF HISTORY OF THE USE


OF LARGE-SCALE TESTS AS POLICY TOOLS
Large-scale, standardized testing as we tend to think of it originated in the United
States in the mid-19th century. In the 1840s, an examination designed to monitor

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 27

schools’ effectiveness was implemented in Boston (Resnick, 1982), and this test had
many of the features associated with today’s large-scale tests. In particular, it was
intended to provide efficient measurement for large numbers of students and to facil-
itate comparisons across classrooms and schools. Testing took on a new role during the
years of World War I, that of selecting individuals into programs or institutions. The
first large-scale group intelligence test, the Army Alpha, was published in 1917, and
the first standardized achievement test battery, the Stanford Achievement Tests, was
published in 1923 (Resnick, 1982). The achievement tests that followed in the sub-
sequent three decades were primarily intended to assess the competencies of individual
students and evaluate the effectiveness of specific curriculum programs (Goslin, 1963).
Over the following years, the use of testing expanded dramatically, in terms of both the
numbers of students affected and the purposes for which tests were used (Haney, 1981).
The creation of the National Assessment of Educational Progress (NAEP) and the
enactment of the original Title I legislation led to the first formal uses of tests as mon-
itoring devices (Hamilton & Koretz, 2002; Koretz, 1992) and may be considered the
precursors to today’s widespread use of tests as tools for holding educators accountable
for student performance (Roeber, 1988).
High-stakes uses of tests for individuals for purposes other than selection were rare
after World War II but began to increase in the 1970s, when minimum competency
testing became widespread (Jaeger, 1982). The minimum competency testing move-
ment emphasized the need to ensure that students demonstrated a grasp of basic skills
and, in many instances, led states or districts to prohibit failing students from gradu-
ating or from being promoted to the next grade. Thus, this movement represents
the first formal use of tests as tools to hold students and teachers accountable for
performance in recent decades (Hamilton & Koretz, 2002). In addition, minimum-
competency tests were intended to serve as signals to students and teachers of what
should be taught and learned, and this movement marked a shift toward what Popham
and others (Popham, 1987; Popham, Cruse, Rankin, Sandifer, & Williams, 1985)
called measurement-driven instruction, reflecting a belief that instruction should be
shaped by tests.
This emphasis on using tests to change instruction continued into the 1980s, a
decade characterized by deep concern over what was perceived to be poor performance
on the part of American students. These concerns are expressed vividly in A Nation at
Risk (National Commission on Excellence in Education, 1983), which led to a nation-
wide reform effort that included an increased reliance on testing (Pipho, 1985) as well
as an expansion of the kinds of stakes attached to scores, including school-level incen-
tives such as financial rewards or interventions (Koretz, 1992). The presumed positive
effects of testing on both student and teacher motivation have represented a primary
rationale for expanding the stakes attached to scores (National Council on Education
Standards and Testing, 1992). In addition, the focus on minimum competency shifted
to a call for high, rigorous standards for all students, and for tests that would be aligned
with those standards and would encourage teachers to teach to them (National Coun-
cil on Education Standards and Testing, 1992; Resnick & Resnick, 1992; Smith &
O’Day, 1990).2 Clear links among testing, standards, and curriculum, in addition to

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


28 Review of Research in Education, 27

formal stakes, were believed to enhance motivation (Smith, O’Day, & Cohen, 1990).
At the same time, testing has continued to be used as a mechanism to document the
performance of American students and thereby justify the need for reforms (Linn, 1993;
U.S. Congress, Office of Technology Assessment, 1992). All of these trends reflect
a gradual but steady shift from the use of tests as measurement instruments designed
to produce information to a reliance on tests to influence policy and instruction. This
dual use of tests has continued to the present day.

CURRENT POLICY CONTEXT


Much of the public’s and the media’s attention with respect to testing is focused
on the recently enacted federal education legislation, the No Child Left Behind Act
of 2001 (Public Law 107-110), or NCLB. The law continues a trend set by earlier
federal and state policies, many of which emphasized standards, assessments, and con-
sequences tied to performance on the assessments. In the 2001–2002 school year,
49 states and the District of Columbia were implementing statewide testing programs,
and 17 of these states were using results as a basis for school closure or reconstitution
(Meyer, Orlofsky, Skinner, & Spicer, 2002). Most of these programs were in place
before the enactment of NCLB, but NCLB ensures that these state programs will
continue and, in some cases, expand. In addition to the NCLB requirements, which
impose stakes primarily on schools and districts, a number of states’ testing programs
impose stakes on individual students through high school exit exams or promotional-
gates policies.
The current policy context is characterized by the use of tests in what may be called
a test-based accountability system. These systems involve four major elements: goals,
expressed in the form of standards; measures of performance (i.e., tests); targets for per-
formance (measures of desired status or amount of change; the adequate yearly progress
provision in NCLB is an example); and consequences attached to schools’ success or fail-
ure at meeting the targets (Hamilton & Koretz, 2002; see also Stecher, Hamilton, &
Gonzalez, 2003).3 The National Research Council (1999b) distinguishes between low-
stakes and high-stakes tests; the former have no “significant, tangible, or direct conse-
quences attached to the results, with information alone assumed to be a significant
incentive for people to act” (p. 35). High-stakes tests, in contrast, reflect the belief that
“the promise of rewards or the threat of sanctions is needed to ensure change” (p. 35).
However, some studies suggest that educators often behave as if tests have high stakes
associated with them, even when the results are used only for information purposes
(Corbett & Wilson, 1991; Madaus, 1988; McDonnell & Choisser, 1997), and that
teachers often work to avoid the stigma associated with a low rating even when they
are not directly threatened by sanctions (Goldhaber & Hannaway, 2001), thereby blur-
ring the distinction between high- and low-stakes testing programs. In any case, the
frequency and severity of stakes is growing for both students and educators (National
Research Council, 1999b), and the use of tests in accountability systems creates a con-
text in which educators and others are likely to pay attention to scores and to act in

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 29

ways intended to maximize test performance. Indeed, most advocates of test-based


accountability hope that tests will influence the behaviors of teachers, principals, and
students in positive ways (Achieve, Inc., 2000).
A common feature among accountability tests is the set of purposes they are designed
to serve—tests used in accountability systems are intended to influence instruction, to
focus public attention on student achievement, and to motivate both educators and
students to work harder (Haertel, 1999; Linn, 2000). Advocates of test-based account-
ability also frequently assert that the tests used in accountability systems can serve
a formative purpose, providing feedback that helps teachers identify students who are
having trouble (or, in the language that is currently popular among policymakers,
students who are being “left behind”). In addition, large-scale tests may be useful to
teachers for providing diagnostic information about students’ relative strengths and
weaknesses, for example, whether a student’s math skills are weak or strong relative to
his or her reading skills. This information can help teachers adjust their instructional
focus to meet individual students’ needs. If large-scale testing programs are to serve
these formative purposes well, they must meet a number of conditions, including clear
and timely reporting of results (Stecher et al., 2003). Some of the research presented
later examines whether these conditions are met in practice.
Large-scale tests may also serve the purpose of surveying the curricula being imple-
mented in schools and classrooms. At best, any externally mandated test will be only
partially aligned with the specific curriculum and instructional practices of a school or
classroom, but tests can provide information on the extent to which schools’ curric-
ula cover important topics or skills and, as discussed later in this chapter, can actually
shape curricula through the signals and incentives built into the testing and account-
ability system.
An important feature that characterizes current test-based accountability programs
is the way in which scores are reported. The specific form of reporting has implica-
tions for the utility of tests, particularly for instructional purposes, and tests should be
designed from the outset with a reporting strategy in mind (National Research Coun-
cil, 2001). A key distinction is that between norm-referenced reporting, which involves
describing a student’s performance relative to that of his or her peers, and criterion-
referenced reporting, in which a student’s performance is described according to some
fixed level of performance. Although a single test may support both types of reporting,
most tests are designed in ways that make one or the other approach preferable.
Criterion-referenced reporting gained widespread acceptance during the minimum
competency testing movement, but until recently many large-scale testing programs
continued to rely primarily on norm-referenced scores such as percentile ranks or
grade-equivalent scores. The increased use of tests as accountability tools, along with
the desires of some educators and policymakers to avoid relying on ranking students
against one another, has led to growth in the use of criterion-referenced reporting
(Hamilton & Koretz, 2002). The utility of criterion-referenced reporting was described
by Glaser (1963), who noted the value of this form of reporting for providing infor-
mation about what students have and have not accomplished. Since then, although

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


30 Review of Research in Education, 27

many assessment systems still rely heavily on norm-referenced reporting, criterion-


referenced reporting, particularly the use of performance levels or cut scores, has
become increasingly common and has been mandated in several rounds of Title I leg-
islation, including NCLB.4 Reporting in terms of specific knowledge or skills mastered
has intuitive appeal, but it is not clear that users of test results always know how to
interpret information about performance levels (Koretz & Deibert, 1996), a topic to
which I return subsequently.
Once the decision is made regarding how to report an individual’s performance,
accountability systems require some form of aggregation of results across individuals.
A hallmark of NCLB and many state accountability systems is a requirement that scores
be reported not only at the level of schools, but for members of subgroups within those
schools. NCLB, for example, requires score reports disaggregated by race/ethnicity, gen-
der, disability status, English proficiency, and status as economically disadvantaged.
Although it is difficult to predict whether subgroup reporting will lead to the desired
outcome of ensuring that adequate attention is paid to all groups, there is evidence that
when separate targets are established for subgroups, diverse schools will be penalized
unfairly owing to the increased likelihood that they will fail to meet targets for one or
more groups as a result of volatility in scores (Kane & Staiger, 2002).

WHAT DO LARGE-SCALE TESTS MEASURE?


The characteristics of today’s test-based accountability systems (e.g., high stakes,
criterion-referenced reporting, subgroup performance reporting requirements) affect
the validity of interpretations from those tests and the kinds of consequences that will
ultimately be associated with their use. But perhaps the most important factor influ-
encing these outcomes is the test itself. As discussed earlier, users of large-scale tests
make inferences about what the scores from those tests mean, and their inferences may
not always be consistent with the skills and knowledge that the test actually measures.
Despite calls for improving students’ capacity to reason at high levels and solve complex
problems (see, e.g., National Commission on Excellence in Education, 1983), many
assessment experts have expressed concerns that the emphasis of most commonly used
large-scale tests is on basic facts and procedures (National Research Council, 2001).
The tests used in formal test-based accountability systems vary in scope and format,
but the multiple-choice format is by far the most common one used (“Quality Counts,”
2002). Although standardized multiple-choice tests certainly can be designed to tap
some kinds of problem-solving and reasoning skills, and many of them do (Hambleton
& Murphy, 1992; Hamilton, Nussbaum, & Snow, 1997), most such tests are inade-
quate for measuring certain important educational outcomes such as organization of
knowledge, strategies for problem solving, and metacognitive skills (Glaser, Linn, &
Bohrnstedt, 1997; National Research Council, 1999a, 2001).
Critics of multiple-choice tests have advocated for the use of other forms of assess-
ment, often called performance assessments. This type of test “requires examinees to
construct/supply answers, perform, or produce something for evaluation”—in other

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 31

words, any test that does not use a selected-response (multiple-choice, true–false) format
(Madaus & O’Dwyer, 1999, p. 689). Performance assessments require the subjective
judgment of a rater, unlike selected-response items that are typically machine scored.
Advocates of performance assessments have claimed that this form of assessment will
improve instruction and facilitate more valid inferences about examinees’ capabilities,
particularly with respect to problem solving and higher order thinking skills (Madaus,
1993). However, validity studies have demonstrated that performance tests frequently
fail to tap the processes and skills their developers intended (Baxter & Glaser, 1998;
Hamilton et al., 1997; Linn, Baker, & Dunbar, 1991), and concerns about technical
quality and costs are likely to prohibit states from using performance assessments in
their accountability systems (Linn, 1993; Mehrens, 1998; National Center for Educa-
tion Statistics [NCES], 1996). To illustrate the latter point, the U.S. General Account-
ing Office (2003) recently estimated the cost to states of implementing NCLB using
only multiple-choice tests as approximately $1.9 billion, whereas the cost if states also
include a small number of hand-scored open-response items such as essays would be
about $5.3 billion. The magnitude of this difference suggests that many states will be
reluctant to adopt non-multiple-choice formats, at least until the testing technology
becomes less expensive.
As a result of the technical and cost constraints inherent in testing large num-
bers of students across multiple classrooms, schools, and districts, to date most
large-scale testing programs have been shown to assess a limited number of desired
outcomes. Tests’ failure to measure the full range of achievement outcomes that
schools are believed to influence limits their utility for instructional purposes and
threatens the validity of inferences users make from test results. These threats to
validity are of particular concern when results have stakes attached for students or
educators. The fact that tests typically assess only a subset of important skills and
knowledge has implications for the effects of large-scale testing programs, as I dis-
cuss subsequently.

THE ROLE OF CLASSROOM-BASED ASSESSMENT


Until now, the discussion has focused on large-scale tests, particularly those used in
test-based accountability systems. Despite their prominence in state and local educa-
tion policies, these tests represent a small fraction of the tests that students take during
the school year. Classroom-based assessments5—those used by teachers to improve
instruction and student learning (National Research Council, 2001)—are often omit-
ted from policy discussions but can exert a powerful influence on student learning and
other educational outcomes, as I discuss later in the chapter. There is an increasing
recognition in the education practitioner and research communities that classroom-
based assessments can usefully complement large-scale assessments as instruments to
promote educational change and can contribute to enhanced validity of inferences
about student knowledge and skills, particularly when there is a need to track student
growth over time (National Research Council, 2001).

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


32 Review of Research in Education, 27

In the next section, I return to the topic of large-scale testing and examine what is
known about the effects of large-scale testing on student achievement and other out-
comes. At the end of that section, I discuss the evidence on effects of classroom-based
assessment. The chapter concludes with a set of suggestions for the design of testing
programs that will promote beneficial outcomes.

WHAT DO WE KNOW ABOUT THE EFFECTS


OF LARGE-SCALE ASSESSMENT?
The focus of this section is on large-scale testing with formal stakes attached (here-
after called high-stakes testing), because high-stakes tests are so central to current policy
discussions. Although numerous studies have examined the effects of high-stakes test-
ing, the majority of these investigations have failed to reach the standards of quality that
would be required to make strong inferences based upon them (Mehrens, 1998). In
part, this is a result of the difficulty inherent in studying this topic: Such studies require
access to schools and educators, and cooperation for such research is often difficult to
obtain. In addition, it is nearly impossible for researchers to set up the kind of experi-
mental design that is most appropriate for examining cause-and-effect relationships
(Reckase, 1997). Another difficulty arises from the diversity of assessment programs and
accountability policies across states and districts; the effects are likely to depend in large
part on the specific features of those programs and policies (Swanson & Stevenson,
2002), making it difficult to identify universal effects. Finally, even studies that are
intended to be merely descriptive often suffer from poor measurement of the construct
of interest, as well as biased samples that may result from nonrepresentative sampling
or nonrandom refusal to participate in the research. Nonetheless, the existing body of
research does provide some evidence of effects on certain outcomes and is useful for
thinking of ways to improve assessment and accountability. I focus on five broad out-
comes: instructional practice (broadly defined to include actions taken by teachers and
administrators to influence the instruction provided by the school), school and class-
room climate, student achievement, equity, and the validity of information from tests.
I also present some results from studies of the effects of classroom-based assessment.

Effects on Instructional Practice


An understanding of how testing affects instructional practice is critical not only for
evaluating the extent to which testing policies have promoted better instruction, but
also for interpreting the scores from tests. Most testing and accountability policies are
intended to promote a number of changes in practice, including improving the qual-
ity of teaching, encouraging the adoption of high-quality curricula, and enhancing the
effectiveness of school staff as a whole. To the extent that these goals are attained, stu-
dent achievement should be expected to increase. Some kinds of changes in practice,
however, may lead to outcomes that are unanticipated or are contrary to what the poli-
cies were intended to accomplish. These changes may adversely affect student out-
comes as well as the quality of information produced by tests.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 33

Of particular concern is the way in which educators may reallocate their efforts away
from content that is not tested and toward content that is tested. The extent to which
the validity of inferences may be compromised by such reallocation depends on the
nature of the reallocation and on the specific inferences that users make. If teachers
focus more effort on their states’ content standards, and if users interpret scores as indi-
cators of proficiency on those standards, it is likely that users will make reasonably valid
inferences from any changes in test scores that result. However, if teachers emphasize
aspects of the test that are incidental to the construct the test is designed to measure
(e.g., certain item styles or formats), or if they emphasize the specific skills included in
the test without addressing skills or knowledge that are not tested but that are part of
users’ targets of inference, validity may be compromised. Koretz et al. (2001) identi-
fied seven types of teacher responses to high-stakes testing, some of which they called
positive (providing more instructional time, working harder to cover more material,
working more effectively), one of which they labeled negative (cheating), and three of
which they said were ambiguous—that is, their impact may be positive or negative
depending on the nature of the response and the context in which it takes place. These
three ambiguous responses—reallocating instructional time, aligning instruction with
standards,6 and coaching by focusing on incidental aspects of the test7—are addressed
by some of the studies reviewed subsequently.
Before discussing the evidence on the effects of testing on practice, it is worth point-
ing out that testing is likely to serve as one influence among many, and the effects of
testing will in most cases interact with other factors such as teachers’ own beliefs and
knowledge about pedagogy and subject matter, their professional development experi-
ences, their access to necessary resources including appropriate curriculum and instruc-
tional materials, and the responses of their colleagues and supervisors (Cimbricz, 2003;
Cohen, 1995; Haertel, 1999; O’Day, Goertz, & Floden, 1995). Therefore, the effects
of testing may be limited to surface-level responses rather than deeper changes in the
nature of instruction (Firestone, Mayrowetz, & Fairman, 1998). Also, the effects of test-
ing on instruction are likely to vary by stakes. If tests do not have clear consequences
attached to them, teachers may pay little attention. However, when test scores are asso-
ciated with consequences that are important or meaningful to teachers, it is likely that
instruction will be affected. The empirical evidence, though not extensive, supports
this distinction (Mehrens, 1998), particularly a recent national survey of teachers con-
ducted by Pedulla et al. (2003). Most of the studies of effects on practice, particularly
those involving surveys, report average responses that mask some of these important
interactions and influences.

Classroom-Level Effects
There is some evidence that large-scale testing has led to some of the outcomes that
Koretz et al. (2001) classified as beneficial responses. Research suggests that teachers
may respond by focusing their efforts more strongly on achievement than they had in
the past (Wolf, Borko, McIver, & Elliott, 1999) and by working harder (Bishop &

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


34 Review of Research in Education, 27

Mane, 1999). Additional evidence of positive effects was reported by Shepard and
Dougherty (1991), whose research indicated that many teachers in two districts with
high-stakes tests believed that the test results were helpful for identifying student
strengths and weaknesses and for attracting resources for students who needed them.
Other studies indicate neutral or negative effects. Firestone et al. (1998) found little
evidence of significant changes among teachers facing a high-stakes testing program in
Maryland, and Jones et al. (1999) reported roughly equal percentages of teachers who
increased and decreased their use of various practices. Using classroom artifacts and
interviews with teachers, McDonnell and Choisser (1997) found that although assess-
ment programs in North Carolina and Kentucky led to changes in instructional
approaches, the depth and complexity of instruction were not affected.
Much of the literature that examines reallocation suggests that tests may exert a neg-
ative effect on curriculum and instructional practice by narrowing teachers’ focus exces-
sively. Several studies demonstrate that teachers tend to deemphasize subjects such as
science, social studies, art, and writing that are not part of many testing programs
(Jones et al., 1999; Koretz, Barron, Mitchell, & Stecher, 1996; Shepard & Dougherty,
1991; Smith, Edelsky, Draper, Rottenberg, & Cherland, 1991; Stecher, Barron,
Chun, & Ross, 2000). In Kentucky in the late 1990s, for example, where some sub-
jects were tested in fourth grade and others were tested in fifth grade, teachers in the
grades at which a subject was tested reported spending more time teaching that subject
than teachers in the grades at which the subject was not tested (Stecher & Barron,
1999). Teachers in Colorado reported that their practices were positively influenced by
the state standards but that the state’s high-stakes testing program tended to have a
negative influence, particularly by causing teachers to reduce time spent on social stud-
ies and science and to eliminate projects, lab work, and other activities not represented
in the test. Moreover, these changes were more frequent at low-performing than at
high-performing schools (Taylor, Shepard, Kinner, & Rosenthal, 2003). These Col-
orado teachers also believed that the scores were not very useful for instructional feed-
back and expressed concern that the state’s accountability system would result in funds
being taken away from needy schools (Taylor et al., 2003).
Reallocation also occurs within subjects, across tested and untested topics and skills.
Some teachers report altering the sequence in which they present topics to accommo-
date the testing schedule, making sure that tested topics are presented before the test-
ing date and saving other topics for the end of the school year (Corbett & Wilson, 1988;
Darling-Hammond & Wise, 1985). Perhaps more important, several studies indi-
cate that teachers have increased coverage on tested topics and skills while decreasing
emphasis on topics and skills not included in the test. In a study of two districts con-
ducted by Shepard and Dougherty (1991), substantial majorities of teachers reported
increasing the amount of time allocated to basic skills, vocabulary, and computation in
mathematics. Romberg, Zarinia, and Williams (1989) found that among a national
sample of eight-grade mathematics teachers, a majority had increased coverage of basic
skills and computation while decreasing emphasis on extended projects and other activ-
ities not emphasized by most tests. In language arts, teachers in Arizona reported

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 35

neglecting nontested parts of the curriculum, including certain types of writing (Smith
et al., 1991).
Some responses focus heavily on the specific format in which test questions are pre-
sented: Teachers in two Arizona schools studied by Smith and Rottenberg (1991)
reported having students solve only the types of mathematics word problems that
appeared on the state test, and in a study conducted by Darling-Hammond and Wise
(1985), teachers reported reducing their reliance on essay tests and administering more
quizzes designed to mirror the format of items on the standardized tests given in their
schools. In the Shepard and Dougherty (1991) study of two districts, teachers of writ-
ing said they had begun to emphasize having students look for mistakes in written
work rather than produce their own writing, as a result of the format of the writing test
used in those districts. The practice of adopting instructional materials (including class-
room assessments) that are designed to mirror the format of the state test is more com-
monly reported among teachers in states with high-stakes testing programs than in
other states (Pedulla et al., 2003).
Studies of test preparation, which generally address activities such as practicing on
released forms of the test, indicate that in at least some schools, substantial amounts of
instructional time are consumed by these activities. Tepper (2002) found that teachers
in Chicago increased time spent on test preparation substantially when a high-stakes
testing program was introduced. Smith (1994) reported up to 100 hours per course
among teachers in Arizona, and Jones et al. (1999) found that a majority of teachers
in North Carolina said they devoted more than 20% of instructional time to test prac-
tice. In Colorado, teachers reported spending large amounts of time on practice tests
and teaching of test-taking strategies, particularly at lower performing schools (Taylor
et al., 2003). The Pedulla et al. (2003) comparison of high- and low-stakes states
indicates that teachers in high-stakes states spend more time on test preparation,
begin it earlier in the year, and are more likely to use specific types of materials—
those that resemble the state test, are prepared by the state or commercial publishers,
or include released items—than are teachers in low-stakes states (see also Corbett &
Wilson, 1991).
The sole unambiguously negative response identified by Koretz et al. (2001) was
cheating, which can take many forms including failing to follow test-administration
instructions, inappropriately exposing students to copies of the test, and changing stu-
dent responses before submitting answer sheets for scoring. Although there are no
comprehensive studies of the frequency of cheating, sizable minorities of teachers in
two states reported that inappropriate test-administration practices such as rephrasing
questions during testing occurred in their schools (Koretz, Barron, et al., 1996; Koretz,
Mitchell, Barron, & Keith, 1996). Jacob and Levitt (2002) found that instances of
cheating increased when high-stakes testing was introduced in Chicago, but they also
noted that this increase was responsible for only a small portion of the test-score gains
observed there.
Together, these findings suggest that high-stakes testing does influence instruction,
at times in significant ways. Indeed, tests seem to exert a more powerful influence than

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


36 Review of Research in Education, 27

standards, even though the latter are intended to be the primary vehicle through which
information about instructional goals is conveyed (see, e.g., Clarke et al., 2003). Some
of these responses may be a result of teachers’ inability to distinguish between ethical
and unethical test-preparation practices; the boundaries are not always clear, particu-
larly with regard to practices such as focusing on the specific objectives measured by a
test (Mehrens & Kaminski, 1989). Because of the weaknesses inherent in most efforts
to measure instructional practice, it is not always possible to determine whether these
effects are beneficial or harmful or to determine with certainty which features of the
testing programs contributed to the effects. However, the frequency and extent of real-
location, both within and across subjects, should certainly be kept in mind by users
who are interpreting results from high-stakes tests and by policymakers who wish to
use testing as a means to shape curriculum and instruction.

Does Performance Assessment Reduce Narrowing?


Performance assessment has often been viewed as a solution to the problem of nar-
rowing of curriculum and instruction. Advocates often discuss the notion of “tests
worth teaching to,” and they have touted the power of performance assessments to
promote instruction that emphasizes active learning and problem solving while also
serving as models of good instruction (Resnick & Resnick, 1992; Wiggins, 1992).
Some evidence suggests that large-scale performance assessment has, in at least some
cases, promoted an increased use of practices consistent with the goals of the tests’
developers and has led teachers to adopt more innovative instructional methods (Borko
& Elliott, 1999; Lane, Parke, & Stone, in press; Wolf & McIver, 1999). Koretz,
Stecher, Klein, and McCaffrey (1994) studied the implementation of Vermont’s
statewide portfolio assessment program and found that teachers reported increasing
their emphasis on mathematical problem solving and representations. Mathematics tests
that require students to explain their answers have been found to lead to an increased
emphasis on explanation in math classes (Taylor et al., 2003). When the assessment
includes a writing component (either a separate essay-based test or a requirement to
write responses on a test of another subject, such as mathematics), teachers are likely
to respond by increasing the time students spend writing (Koretz, Barron, et al., 1996;
Koretz & Hamilton, 2003; Stecher, Barron, Kaganoff, & Goodwin, 1998). Finally,
research by Stone and Lane (2003) suggests that some aspects of instructional change,
such as increased use of reform-oriented problems, are associated with improved per-
formance on Maryland’s performance-based state test; these results point to the impor-
tance of examining instructional practices when attempting to discern the effects of
testing on student achievement.
At the same time, there is evidence that performance assessment can in some
instances lead to narrowing of instruction and fail to induce teachers to focus on desired
skills and processes (Koretz & Barron, 1998; Miller & Seraphine, 1993). Even though
the previously cited literature suggests that performance assessment has led to an in-
creased focus on problem solving, teachers often emphasize the specific types of prob-

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 37

lem solving tapped by the test and ignore broader conceptions of problem solving
(Stecher & Mitchell, 1995). Not only do teachers pay attention to the test items, but
they also shift their instruction and evaluation strategies to match the rubrics that are
used to score the assessments (Mabry, 1999). As Firestone et al. (1998) point out, teach-
ers’ beliefs and knowledge of pedagogy and subject matter are likely to exert a greater
influence on practice than can be achieved with any testing program, and because most
teachers are accustomed to focusing on small, discrete problems and covering many
topics in a somewhat shallow way, the implementation of performance assessment alone
is unlikely to affect practice beyond fairly small, surface-level changes.
The limited generalizability of scores on performance assessments is another prob-
lem that is partly related to the possible effects of performance assessment on curricu-
lum narrowing. Several studies have documented that scores from various types of
performance assessments tend to have low levels of reliability and do not generalize well
to other kinds of assessments measuring similar constructs (Baker, O’Neil, & Linn,
1993; Dunbar, Koretz, & Hoover, 1991; Miller & Seraphine, 1993; Shavelson,
Baxter, & Pine, 1992). Although raters are a commonly cited source of variability, for
most performance assessment applications, task sampling is a larger problem than rater
sampling (Dunbar et al., 1991). The specific format that is used has a large effect on
performance and can reduce generalizability. Baxter, Shavelson, Goldman, and Pine
(1992), for example, used several methods to measure students’ science achievement
and found that changes in scores on a notebook-based task did not correlate well with
changes in scores on tasks that involved direct observation of students’ performance.
This failure to generalize not only reduces the utility of performance assessments as a
method for gathering information about student skills and knowledge, but may exac-
erbate the problem of curriculum narrowing by encouraging educators to focus on a
specific task type as the most promising means of raising scores. Some of the problems
that have been identified are attributable in part to the specific features of performance
assessments that are used for external accountability purposes (see Haertel, 1999) but
also to the limitations of tests as policy instruments, a topic to which I return in the
final section of the chapter.

School-Level Effects
The instructional effects of testing can extend beyond the classroom. Research has
addressed school-level responses that are likely to lead to changes in the nature and
quality of curriculum and instruction provided to students. Some studies that included
principal surveys found that many principals reported increasing teacher profes-
sional development opportunities, adding summer or after-school sessions to provide
additional instruction to low-performing students, and revising curriculum programs
(Stecher et al., 2000; Stecher & Chun, 2001). On the other hand, some kinds of
responses reported by principals may be problematic in that they appear to be designed
to improve test scores without necessarily improving the overall quality of education
offered to students in the school. For example, approximately one third of principals

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


38 Review of Research in Education, 27

in Maryland reported reassigning teachers between tested and untested grades to im-
prove the quality of instruction in the grades for which the test was administered
(Koretz, Mitchell, et al., 1996). In general, there is evidence that, when faced with
strong incentive systems, school personnel tend to focus more on short-term test-score
gains than on long-term instructional improvement (O’Day, in press). Other studies
have reported frequent use of student-level incentives such as field trips or parties as
rewards for good test performance (Stecher et al., 2000); whether this represents a
desirable or harmful practice obviously depends on how it is implemented and how
students respond.

Effects on School and Classroom Climate


School climate is a function of many factors, only a few of which are discussed here.
I focus primarily on the ways in which testing and accountability may affect factors
that relate to teachers’ satisfaction with their working conditions. Perhaps most impor-
tant, teacher stress and morale have been reported to be negatively affected by high-
stakes testing. Koretz and colleagues (Koretz, Barron, et al., 1996; Koretz, Mitchell,
et al., 1996) found that a majority of teachers in each of two states attributed a decline
in teacher morale to the state’s high-stakes test. Reduced morale was also reported by
Taylor et al. (2003) among teachers in Colorado. Pedulla et al. (2003) found that,
among a national sample of teachers, those teaching in states with high-stakes testing
programs reported feeling more pressure than those in states without such programs,
and more teachers in high-stakes states said that teachers at their school wanted to
transfer out of tested grades. This study also revealed that a majority of teachers, par-
ticularly at the elementary level, believed that the teaching practices they adopted as a
result of the state test contradicted their views of what good teaching should look like.
On the other hand, it is possible for high-stakes testing to improve teachers’ moti-
vation. Incentives may be effective at motivating teachers to work harder or more
effectively, but to do so they must be personally significant enough to the teachers
to compensate for the extra work and stress, and they must be based on clearly spec-
ified, noncompeting goals as well as an effective system for providing feedback to
teachers (Kelley, Odden, Milanowski, & Heneman, 2000). Testing may also improve
teachers’ motivation and morale if it is accompanied by efforts on the part of the
school administration to provide appropriate learning opportunities (Borko, Elliott,
& Uchiyama, 1999).
Some studies suggest that teachers find standards and assessments helpful for focus-
ing their instruction and obtaining feedback about the effects of their instruction
(Mabry, Poole, Redmond, & Schultz, 2003), but other studies indicate that at least
some teachers do not perceive tests as instructionally useful (Clarke et al., 2003; Taylor
et al., 2003). It is not entirely surprising that teachers do not always find tests useful;
as Linn (2001) points out, “such tests are more suitable for providing global infor-
mation about achievement than they are the kind of detailed information that is
required for diagnostic purposes” (p. 3). In addition, many teachers believe large-scale

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 39

tests are poor measures of students’ skills and knowledge, particularly in the case of
special education students, minority students, and English language learners (Pedulla
et al., 2003).
Students’ emotional responses to testing represent another critical component of
school climate. Majorities of teachers believe that testing increases students’ stress lev-
els and negatively affects student morale and self-confidence, according to several
studies that have surveyed teachers (Koretz, Barron, et al., 1996; Mabry et al., 2003;
Miller, 1998). Comparing states with high-stakes tests and those without such tests,
Pedulla et al. (2003) found that teachers in high-stakes states were more likely to
report that students felt intense pressure to do well on tests, and although a majority
in both types of states believed student morale was generally high, teachers in low-
stakes states were more likely to rate student morale as high than were teachers in
high-stakes states. Although these studies are suggestive, they rely on teacher percep-
tions, and there is little direct evidence of how testing actually affects student morale
(Mehrens, 1998). A few studies suggest that high-stakes testing may improve stu-
dents’ motivation to learn (Betts & Costrell, 2001; Roderick & Engel, 2001), but the
motivational effects are likely to vary according to the age of the student (Phillips &
Chin, 2001) and the specific features of the incentive system. In particular, to the
extent that students view the tests as too difficult or the goals as unrealistic, the moti-
vational effects may be negligible or even negative (Linn, 1993). The overall lack of
evidence regarding student morale, stress, and motivation is due in part to the diffi-
culty that researchers have in gaining access to students and measuring their levels of
these constructs (Stecher, 2002).
Parent involvement and support for testing may influence the climate of the school
as well. Nationwide, parental support for some aspects of today’s testing policies has
been fairly high in recent years (Public Agenda, 2000), but in certain settings, particu-
larly among parents of children in high-scoring suburban schools, testing has been met
with intense parental resistance (see, e.g., Zernike, 2001) and has led to the formation
of organized parent groups such as the Parents’ Coalition to Stop High-Stakes Testing
in New York (McDonnell, 2002). It is not clear how these reactions affect school cli-
mate, but as the provisions of NCLB begin to affect families it will be important to
monitor parents’ responses.

Effects on Student Achievement


Arguably the primary rationale for test-based accountability systems is to improve
student achievement. Advocates of such systems have pointed to increases in test scores
that occurred after accountability policies were enacted. Two of the most prominent
examples of statewide testing programs lauded for their positive effects on student
achievement are Texas and North Carolina, where test scores rose dramatically after
the introduction of test-based accountability systems (Grissmer & Flanagan, 1998;
Texas Education Agency, 2000). It is difficult to attribute changes in test scores directly
to accountability systems in part because such systems are enacted in the context of

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


40 Review of Research in Education, 27

many other education reform efforts, and achievement trends may be confounded with
characteristics of states and districts that influence both achievement and the likelihood
that accountability policies will be adopted (Carnoy & Loeb, 2002). However, as with
the research on effects on practice, there is a growing body of evidence that may pro-
vide some indication of what we can expect as high-stakes testing and accountability
become more widespread. When reviewing this research, it is important to distinguish
between studies that examine gains on the high-stakes test itself and those that incor-
porate information from another test of the same subject. Gains that are observed only
on the test used in the accountability system may not provide sufficient evidence of
true gains in achievement, because scores on accountability tests may become inflated,
a problem I discuss subsequently.
Some research suggests a link between accountability and student achievement on
the high-stakes test, particularly when accountability systems include high school exit
exams (Fredericksen, 1994; Winfield, 1990), though one study that examined high
school exit exams and that controlled for individual student characteristics (unlike
most of the research on this topic) found no such relationship (Jacob, 2001). At the
lower grades, Roderick, Jacob, and Bryk (2002) examined achievement trends among
students in Chicago who were subjected to test-based grade promotion policies.
Increases in gains in the grades at which students were held accountable were substan-
tial in both reading and mathematics and tended to be larger for students attending
low-performing schools than for students of similar ability attending higher perform-
ing schools. A study of the Dallas school accountability program by Ladd (1999) indi-
cated that the performance of seventh-grade students in Dallas improved more rapidly
than the performance of seventh graders in surrounding districts after the accountabil-
ity system was introduced. She found no such effect for third graders but attributed
this in part to problems with the testing system and exclusion rates at that grade.
Studies examining changes in test scores using measures other than the primary
accountability test have produced mixed results. In a study of states’ NAEP trends,
Grissmer, Flanagan, Kawata, and Williamson (2000) attributed improvements on
NAEP to the creation of test-based accountability systems in certain states, though
this study did not empirically test the link between accountability systems and per-
formance on NAEP. A comparison of countries and Canadian provinces showed
that those with exit exams had higher average achievement among middle-school-
aged students on external tests (including the Third International Mathematics and
Science Study [TIMSS]) than those without exit exams (Bishop, 1998); however, it
was not possible to control for all of the factors that may have contributed to those
differences. More recently, Carnoy and Loeb (2002) conducted what is to date the
most comprehensive analysis of the relationship between state-level accountability
policies and NAEP gains. These researchers controlled for contextual factors at the
state level, including demographics and funding, as well as for prior student achieve-
ment. Their results suggest a link between state accountability policies and gains on
NAEP mathematics scores. Analyses conducted by Amrein-Beardsley and Berliner
(2003), however, attributed much of the apparent gain in NAEP scores among states

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 41

with high-stakes tests to increased rates of exclusion in those states; in other words,
the increases in the high-stakes states appear to be due in large part to changes in the
tested population rather than real effects of the testing policies. Overall, then, the
evidence, though not extensive, points to some positive relationships between high-
stakes testing and student achievement but also raises questions about the sources of
those relationships, and the evidence is not sufficient to infer a causal relationship
between stakes and NAEP scores.
Despite the evidence of improved achievement on non-accountability tests, how-
ever, the increases on NAEP have been several times smaller than the increases on state
tests (Klein, Hamilton, McCaffrey, & Stecher, 2000; Koretz & Barron, 1998). Most
of these studies examined aggregate achievement trends. Jacob (2002), in contrast, was
able to track individual students over time using data from the Chicago Public Schools.
In this study, scores on the Iowa Test of Basic Skills, which had been improving over
time, increased more rapidly after the district introduced an accountability system that
imposed stakes on schools (probation, reassignment of staff) and on students (reten-
tion in grade). Scores on a low-stakes state test also improved during this time, but the
rate of improvement on the low-stakes test did not change when the accountability sys-
tem was implemented. Together, these results indicate that gains on accountability
tests tend not to generalize to other measures of the same constructs, although small
gains are sometimes observed on those other measures, as discussed in the previous
paragraph. Jacob also found that scores on tests of science and social studies, which
were not part of the accountability system, increased much less than scores on reading
and mathematics. A study by Deere and Strayer (2003) revealed a similar pattern in
Texas. The failure of gains to generalize to tests other than the one used for account-
ability purposes raises concerns about the validity of interpretations of scores on state
tests, discussed subsequently.

Effects on Equity
One of the rationales for many test-based accountability policies, and one that is
inherent in the title of the new federal legislation, is that such policies will increase
equity by ensuring that all students achieve at some predesignated level of perfor-
mance. At the same time, critics of these policies worry that certain groups of students
will suffer adverse consequences. Several aspects of equity have been addressed by
research, though in most cases the evidence base is limited.
State accountability systems such as the one implemented in Texas have achieved
recognition not only for raising average student achievement but for reducing differ-
ences among racial/ethnic and socioeconomic groups. Few studies have empirically
examined effects of high-stakes testing on achievement among different groups of
students. The Chicago study by Roderick et al. (2002) revealed some differences
in achievement gains across student ability groups, with lower achieving students
appearing to gain more than higher achieving students in reading but less in math.
Furthermore, this study revealed that reading scores among the highest-performing

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


42 Review of Research in Education, 27

third-grade students actually declined after the accountability system was introduced
(relative to expected trends), especially in the low-performing schools that were most
strongly affected by the accountability policies. In the Dallas study, Ladd (1999)
reported positive effects of accountability for Hispanic and White students but no
effects for Black students; there was no clear explanation for this difference. Examin-
ing differences across racial/ethnic groups on NAEP, Carnoy and Loeb (2002) found
that gains attributable to accountability were greater for minority students than for
White students, a result that may be attributable in part to the lower average starting
scores for minority students. However, the reduction in gaps in performance between
minority and White students on NAEP is typically much smaller than the reduction
in gaps on the state tests (see Haney, 2000; Klein et al., 2000).
The source of these differences is not clear, but some of them may be attributable to
differences in how teachers target instruction across student groups. Students at the
cusp of passing state tests may receive targeted instruction to improve their perfor-
mance; many educators refer to these students as “bubble kids” (Taylor et al., 2003).
This practice is more common in states with high-stakes tests than in those without
(Pedulla et al., 2003). In addition, teachers and administrators in schools serving poor
and minority students may be especially likely to engage in practices designed to raise
test scores, including providing extensive test preparation and narrowing the curricu-
lum to focus on tested topics (McNeil, 2000; Shepard, 1991; Smith et al., 1997). With-
out better evidence on the source of score gains among different groups of students, it
is impossible to determine whether they represent improved quality of instruction;
however, the conflicting NAEP results suggest that at least some of the effects are due
to targeted test-preparation efforts.
Many testing critics claim that high-stakes testing has led to disproportionate num-
bers of minority, low-income, and special needs students being retained in grade, drop-
ping out of school, or being prevented from graduating, as well as to a loss of resources
at schools serving these students (Darling-Hammond, 2003; McNeil, 2000). Teachers
in the Pedulla et al. (2003) study reported increased retention and dropout rates as a
result of testing requirements, particularly those teaching in states with high-stakes
programs. Of course, it is impossible to determine whether these teachers’ impressions
are consistent with what is actually happening. Empirical evidence of the effects of
accountability on retention and dropout rates is, as with most of the topics discussed
so far, mixed. Carnoy and Loeb’s (2002) study of state accountability systems found
no relationship with retention or dropout rates except for Hispanic students, for whom
the results were inconclusive. Haney’s (2000) analysis of Texas data suggested an
increase in retention after Texas implemented its high-stakes testing program. Carnoy,
Loeb, and Smith (2001) reanalyzed Haney’s data and found that the timing of the
increase was not consistent with the implementation of the most recent high-stakes
testing program, but could be attributable to earlier waves of testing and accountabil-
ity. Studies conducted by Jacob (2001) and Lillard and DeCicca (2001) provide evi-
dence that increased high school dropout rates may be associated with high school exit
exam policies. In contrast to these findings, Ladd (1999) reported that high school

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 43

dropout rates in Dallas fell relative to those in other districts after Dallas introduced
its test-based accountability system, though it is difficult to determine with certainty
whether the relationship was causal.
Some students may be subjected to inappropriate placements or exclusion from test-
ing as schools strategize to maximize performance. Figlio and Getzler (2002) examined
data from six counties in Florida and found that administrators frequently reclassified
students as disabled to exempt them from testing. This practice was especially common
among low-performing schools (see also Darling-Hammond, 1991; Haney, 2000).
Jacob’s (2002) study of Chicago’s accountability system revealed a similar phenome-
non—the data showed a large increase in the proportions of students placed in special
education classes or excluded from testing after the district’s accountability system was
introduced, and retention in grade also increased, even in grades that were not explic-
itly subject to the accountability system’s test-based promotion and retention policies.
Similarly, Deere and Strayer’s (2003) study suggests that administrators in Texas en-
gaged in efforts to exempt certain students from high-stakes testing in a strategic man-
ner. These actions threaten the validity of inferences about score changes and have
increasingly been the subject of media attention (see, e.g., Schemo, 2003).
School-level labeling, rewards, and sanctions may affect some types of schools more
than others, which in turn may disproportionately affect students from certain lan-
guage, socioeconomic, or racial/ethnic groups. Parkes and Stevens (2003) illustrate the
problem in the context of Florida’s accountability system. In one recent year, approx-
imately 17% of schools with high percentages of English language learners (defined as
greater than 30% of the school’s student population) received an “A” rating in the state’s
accountability system, whereas 33% of schools with lower percentages of English
language learners (less than 30% of the school’s population) received this rating. As a
result, students in schools with high percentages of English language learners were
denied reward money that was distributed to other schools. The effects of such dis-
crepancies on the educational experiences of individual students are unknown, but the
rewards and penalties, as well as other consequences of being labeled as a success or fail-
ure, could in fact exert a significant influence on students. As illustrated by Kane and
Staiger (2002), owing to provisions that require each subgroup to meet a specific tar-
get, schools serving diverse populations of students are more likely than homogeneous
schools to be labeled as making insufficient progress purely as a result of statistical
error. Of course, the subgroup provisions may ultimately lead to positive effects as well,
and it is too early to say whether the effects on balance will be beneficial or harmful.
Aside from the effects of specific accountability policies such as exit exams and
school reconstitution, it is important to examine whether the tests themselves are
characterized by bias against one group or another, which could negatively affect
equity. While a thorough exploration of fairness in testing is beyond the scope of this
chapter and has been addressed elsewhere (see, e.g., Cole & Moss, 1989), it is worth
stating that users of tests must be vigilant about possible threats to fairness in the tests
as well as in the broader accountability systems, and validation efforts must seek to
ensure that inferences made from test scores have comparable degrees of validity

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


44 Review of Research in Education, 27

across student groups, including groups defined by racial/ethnic, gender, socioeco-


nomic, language, and disability status (National Research Council, 2001). Fairness of
test items is typically investigated through a combination of expert panels and statis-
tical (i.e., differential item functioning) methods (Camilli & Shepard, 1994), though
there is an increasing emphasis on the need to examine cognitive processes directly to
explore the sources of any differences detected through statistical methods (Hamilton,
1999; Lane, Wang, & Magone, 1996; Zwick & Ercikan, 1989).
Most currently used large-scale tests have been shown to support reasonably valid
inferences for most subgroups, though there is currently a paucity of research on fair-
ness for students with disabilities and English language learners. Both of these groups
are being included in large-scale test administrations at much higher rates than in the
past, in large part as a result of district, state, and federal mandates for inclusion, and
both groups may be offered accommodations or modifications to the test or adminis-
tration conditions. Under some conditions accommodations may enhance the validity
of large-scale assessments for students with disabilities (Tindal, Heath, Hollenbeck,
Almond, & Harniss, 1998), whereas in other contexts there is evidence that accom-
modations may inflate scores (Koretz & Hamilton, 2000). The latter finding may be
partially attributed to the lack of standardization across schools and districts in who
receives which accommodations and how those accommodations are given; even in
state testing programs that publish reasonably clear guidelines for use of accommoda-
tions, there appears to be extensive variation in how these guidelines are implemented
at the local level (Koretz, 1997). Clearly, more research on the validity of large-scale
tests for students with disabilities is needed. In the meantime, it is reasonable to expect
that the ambitious targets for these students imposed by NCLB will be difficult to meet
(Linn, 2000).
Some promoters of performance assessment hoped that such assessments would nar-
row racial/ethnic and socioeconomic gaps in performance as well as other group dif-
ferences (Haertel, 1999); in general, however, performance assessment has not been
shown to reduce group differences. Linn et al. (1991) reported that racial group dif-
ferences on NAEP essay questions were of approximately the same magnitude as those
for multiple-choice items, and Klein et al. (1997) found that the use of hands-on sci-
ence tasks did not reduce gender or racial/ethnic group differences as compared with
multiple-choice science items. The specific type of reasoning and sources of knowledge
required by a task appear to be more important determinants of group differences than
format (Hamilton, 1998). Attention must also be paid to the language and cultural
demands of the tasks, which are sometimes substantial for performance assessments
(Baker & O’Neil, 1995; Solano-Flores & Nelson-Barber, 2001).

Effects of Test-Based Accountability on


the Validity of Information From Tests
Several features of current test-based accountability systems are likely to affect the
validity of information provided by the tests used in those systems. In this section, I

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 45

focus on two—the effects of the high stakes that are attached to performance and the
effects of the specific kinds of reporting strategies used in most systems.

Effects of High Stakes


If tests are intended to support decisions about changes in instruction or about
student or school rewards and sanctions, it is necessary to examine the validity of
inferences made from scores on those tests. One of the largest threats to validity in
high-stakes testing contexts is test score inflation, which “refers to increases in scores
that are not accompanied by commensurate increases in the proficiency scores are
intended to represent” (Koretz, 2003a, p. 9). This phenomenon is illustrated by the
earlier-discussed findings indicating that score gains on high-stakes tests do not always
generalize to other tests measuring the same or similar constructs.
The first widely reported evidence of score inflation was gathered by a physician
named John Cannell (1988), who discovered that most districts and states had
obtained average test scores that were above the national average as defined by test pub-
lisher norms. The phenomenon was labeled the “Lake Wobegon effect” (Koretz, 1988)
and has been documented by several studies since Cannell’s. Linn, Graue, and Sanders
(1990) showed that performance as measured against test publisher norms increased
throughout the 1980s for most published tests, but these gains were not replicated on
NAEP. Other studies that compared state test performance against NAEP produced
similar findings: large gains in state scores but much smaller or negligible gains in
NAEP scores for the same time period (Klein et al., 2000; Koretz & Barron, 1998).
Koretz, Linn, Dunbar, and Shepard (1991) provided clear evidence of inflation for a
district that switched from one published test to another; scores dropped dramatically
when the new test was introduced, but rapidly increased to approximately the level of
scores on the first test. When the first test was then readministered, scores were as low
as they were on the second test during its first year of administration. This pattern
suggests that the district adapted to the specific test being administered and that its
efforts to improve performance on that test did not generalize to another measure of
the same construct.
Another approach to comparing scores on high-stakes and audit tests is to examine
correlations between scores for students or schools. Greene, Winters, and Forster (2003)
conducted a study that presented correlations between scores on high- and low-stakes
tests for several states and districts. The study was intended to demonstrate that scores
on high-stakes tests generalize to lower stakes tests. Similarly, Hannaway and McKay
(2001) reported strong correlations between high- and low-stakes test scores in Texas,
noting that schools that performed well on one measure also did well on the other mea-
sure. This type of evidence is informative but limited; high correlations do not neces-
sarily indicate a lack of score inflation, and examination of mean trends in addition to
correlations is necessary to determine whether inflation may be occurring (Koretz et al.,
2001). Overall, then, while some of the evidence on the generalizability of gains is
inconclusive, there are several studies that provide clear indications of score inflation.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


46 Review of Research in Education, 27

Non-multiple-choice tests have not been shown to solve the problem of score infla-
tion. Koretz and Barron (1998) found extensive score inflation on Kentucky’s open-
response test items, with gains on the state test approximately four times the magnitude
of the state’s gains on NAEP. As discussed earlier in the section on curriculum nar-
rowing, it is likely that educators adapt their instruction to specific features of perfor-
mance tasks, which leads to improvements in performance on that type of task but not
on the broader construct the task was designed to measure.
There is much we do not know about score inflation, including what kinds of tests
are most susceptible and what kinds of students or schools are most likely to exhibit
inflated gains. We do know that high stakes and unrealistic goals are likely to exacer-
bate score inflation, as teachers take shortcuts when they believe goals are unattain-
able (Koretz, 2003a). There is insufficient information to determine the extent to
which the discrepancies in trends that have been reported threaten the validity of
inferences that users make. It is likely that most parents, policymakers, and other users
assume that performance on a given test generalizes beyond that specific measure, but
we do not know how broadly most users wish to generalize or how they would react
to test scores if they understood the problem of inflation. Research is needed to deter-
mine what kinds of inferences are being made on the basis of today’s high-stakes tests;
this is a critical part of any validation effort and will contribute to the development of
approaches for identifying score inflation (Koretz et al., 2001).

Effects of Reporting Strategies on Validity


NCLB and most state systems rely heavily on criterion-referenced or standards-
based reporting strategies that involve setting a small number of cut scores and report-
ing the percentages of students whose scores exceeded those cut points. Although
advocates of this type of reporting typically assert that it provides better information
for users than is available through norm-referenced reporting, there are some draw-
backs associated with criterion-referenced reporting. Converting a continuous distribu-
tion into categories results in a loss of information and masks some score changes
while exaggerating others (Koretz, 2003a). In addition, despite well-developed psycho-
metric technology, the process of setting performance standards or levels is inherently
judgmental and may result in different standards if different methods or judges are
involved (Linn, Koretz, Baker, & Burstein, 1992; Stufflebeam, Jaeger, & Scriven,
1991), but the judgmental nature of the task is rarely communicated to the public in
media reports (Koretz & Deibert, 1996).
Of critical importance given the current emphasis on subgroup differences is the
fact that the use of standards-based reporting that groups students into categories may
distort measures of gain or of differences between groups. A simulation conducted by
Koretz (2003a) shows that even when members of two groups (in this case, Black and
White students) achieve gains of the same magnitude, changes in the percentage of
students above a cut score suggest very different rates of progress for the two groups.
Similarly, if two schools achieve similar average gains but one school starts with stu-

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 47

dents just below the cut score and moves them above it, that school will register a
much larger gain on the percentage-proficient metric than a school that moves stu-
dents within the not-proficient or proficient categories rather than across the bound-
ary. This type of reporting therefore creates incentives for teachers to focus on the
“bubble kids” described earlier and perhaps to shortchange students who are less likely
to move from one performance level to the next (Hamilton & Koretz, 2002).
Another potential drawback of criterion-referenced reporting is that users of test
results do not always know how to interpret performance-level information. Koretz
and Deibert (1993, 1996) found widespread misinterpretation of NAEP achievement
levels in print media articles. This included oversimplification of the meanings of the
levels and a failure to recognize that the underlying performance was continuous
rather than discrete. Hambleton and Slater (1995) documented additional problems
with the interpretation of the NAEP achievement levels among policymakers and
educators. A combination of norm-referenced and criterion-referenced reporting may
be most informative for many users, including the media, policymakers, and parents.
The popularity of state and country rankings on national and international assess-
ments such as NAEP and TIMSS, even when the individual-level scores on those tests
are reported according to performance levels, illustrates the desire users have for some
sort of normative information (Hamilton & Koretz, 2002). Newspaper reports of
school rankings within states provide another example and suggest that there is a
desire for normative information among consumers of test scores.
A related reporting issue is the method used for aggregating scores across students
to obtain a measure of performance (sometimes called an accountability index) for a
group (e.g., a school). The way that an accountability index is constructed—for exam-
ple, whether it involves a single-year average, a gain score from one year to the next,
or a score adjusted for student background characteristics—also affects the kinds of
inferences supported and the validity of decisions made based on the index. NCLB
relies on a single, statewide target for all schools to meet, rather than rewarding or
penalizing schools based on the magnitude of gains or losses. However, many states
and districts continue to operate dual systems that involve the latter approach, and
some are experimenting with more sophisticated methods of measuring school perfor-
mance, such as value-added modeling (Sanders & Horn, 1998; Webster & Mendro,
1997). Inherent in each of these different approaches is some notion of what schools
should be held accountable for—whether, for example, a school serving low-income
or low-performing students should be expected to achieve the same level of performance
as one serving higher income or higher performing students, or whether the former
school should only be held accountable for promoting a similar amount of progress.
The choice of approach can dramatically affect rankings of schools (Clotfelter &
Ladd, 1996), and therefore the decisions made on the basis of the accountability
index, so it is a matter that must be considered seriously by those who are responsible
for designing accountability systems. It is also important to communicate to users that
even the most statistically sophisticated approaches can lead to inappropriate infer-
ences. McCaffrey, Lockwood, Koretz, Hamilton, and Barney (in press), for example,

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


48 Review of Research in Education, 27

show that the value-added models that are currently used may confound teacher
effects with the effects of student background characteristics under certain condi-
tions, though users of results from applications of value-added modeling may be led
to believe that such models are able to isolate teacher effects from other influences
(see also Kupermintz, 2002).
Similarly, decisions about whether to base judgments on a single-year, spring-to-
spring change; on a multiyear change; or on a fall-to-spring gain will affect inferences
about schools’ achievement growth. Fall-to-spring changes may be corrupted by prob-
lems related to scale conversion, practice effects, and administration dates (Linn,
Dunbar, Harnisch, & Hastings, 1982), whereas spring-to-spring changes will con-
found academic year learning with summer trends and may create a bias as a result of
socioeconomic differences in summer learning growth (Alexander, Entwisle, & Olson,
2001). Some researchers have suggested that any single-year measure of change is inher-
ently unstable owing to changes in the composition of student cohorts and other
factors contributing to error (Kane & Staiger, 2002; Linn & Haug, 2002) and have
recommended the use of multiyear changes to reduce this instability (but see Rogosa,
2003, for an alternative perspective on the problem of instability of change). There is
no clear consensus on which approach is most desirable; again, the benefits and limi-
tations of each should be carefully weighed in light of the goals of the specific system
being implemented.
Other features of test-based accountability systems, in addition to the two discussed
here, are likely to affect the validity and utility of information from those tests. For
example, NCLB requires states to use a single test form for students at different levels
of proficiency, a requirement that is likely to compromise the utility of information for
some students: Tests designed to discriminate well among average students will pro-
vide imprecise information at the extremes of the distribution. In particular, requiring
students with moderate or severe cognitive disabilities to take the same test that other
students take will result in a test that fails to measure the performance of these students
accurately (Koretz, 2003a). Although many of the provisions may be intended to pro-
mote effective instruction and more equitable results, the ways in which these provi-
sions affect scores must be taken into consideration when interpreting performance
results.

Effects of Classroom-Based Assessment


The previous discussion revealed the diverse and sometimes significant effects that
large-scale, high-stakes testing can have on schools and students. Because of the rela-
tive prominence of classroom-based assessment in students’ lives, it is worth consid-
ering what is known about the effects of that type of assessment. Many classroom
assessments have features that large-scale assessments lack and that are potentially use-
ful for instruction. In particular, they often provide results in a timely manner and in
a form that allows teachers to diagnose individual strengths and weaknesses on par-
ticular skills or topics (National Research Council, 2001). Black and William (1998)

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 49

published a comprehensive review of research on this topic. Their synthesis of quan-


titative studies found that improving teachers’ classroom assessment capabilities could
lead to student achievement gains of more than half a standard deviation, which is
large relative to the effects of most educational interventions. Positive effects were
most likely to be observed when students received feedback on their performance and
when they were given advice on how to improve (see also Kluger & DeNisi, 1996,
and Sadler, 1989, for evidence on the effectiveness of instructional feedback and guid-
ance). A review by Bangert-Drowns, Kulik, and Kulik (1991) summarized evidence
on how the effects of classroom-based assessment varied by the frequency of such
assessment; they concluded that testing was most likely to be effective if administered
once or twice per week. Together, these studies provide some guidance for develop-
ing an effective classroom-based assessment system.
At the same time, research suggests that a substantial amount of effort and addi-
tional training will be necessary to ensure that teachers are equipped to implement
high-quality classroom-based assessments. Reviews of teachers’ classroom assessment
practices have identified several weaknesses, including a focus on superficial kinds of
learning, a lack of cross-classroom collaboration on assessment, and an emphasis on
grading and normative information rather than on providing feedback to promote
learning (Black, 1993; Crooks, 1988). The increasing prominence and importance of
high-stakes testing in teachers’ lives, moreover, means that any attempt to promote bet-
ter use of classroom-based assessment must be informed by an understanding of how
teachers’ actions are shaped by this external form of testing. Teachers often model their
own classroom tests after the large-scale, standardized tests with which they are famil-
iar, thereby producing classroom assessments that are not as useful or beneficial as they
could be (Bennett, Wragg, Carre, & Carter, 1992; Gifford & O’Connor, 1992; Linn,
2000; Mabry et al., 2003). Some organizations have begun to try to bridge the two
forms of assessment by providing “interim” or “benchmark” tests that mirror the state
test but provide ongoing, formative feedback. The effects of this type of bridging are
not yet known, but it is one promising approach to helping teachers make better use
of assessment information and, possibly, reducing their reliance on a single test form
as an indicator of what they should be teaching.

IMPROVING THE UTILITY OF LARGE-SCALE ASSESSMENT


The previous sections revealed that both large-scale and classroom-based assess-
ment can exert significant effects on students, and that additional research is needed
to understand the contextual factors that influence these effects. Testing policies take
many forms, and the specific features of these policies will influence their outcomes.
This section is intended to provide general guidelines for promoting effective testing
policies, given what we currently know about testing’s effects. It emphasizes three
categories of actions: better test design, better accountability system design, and the
incorporation of classroom-based assessment into large-scale testing and account-
ability systems.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


50 Review of Research in Education, 27

There are several documents that are widely accessible and that provide guidelines
for appropriate test use. Most important, perhaps, are the Standards for Educational
and Psychological Testing (AERA, APA, & NCME, 1999), which delineate specific
guidelines for ensuring test validity, reliability, and fairness. In addition, CRESST
has published standards for accountability systems (Baker, Linn, Herman, & Koretz,
2002) that provide suggestions for designing test-based accountability systems that
promote desired outcomes. The American Educational Research Association’s posi-
tion statement on high-stakes testing (AERA, 2000) also offers guidelines for appro-
priate test use. This section draws from these and other sources to suggest ways to
improve the use of large-scale assessments, with an emphasis on policy-related uses
of such tests. The discussion focuses on practices that have been demonstrated empir-
ically to improve the interpretation or use of test results or that have been repeatedly
advocated by testing and accountability experts. It is not a complete list, but provides
a broad set of suggestions that can serve as a starting point in the design of an effec-
tive test-based accountability system.

Improving Test Design and Validation


Begin with a clear definition of the construct to be measured. The design of a test must
be informed by an understanding of what the test is intended to measure, and efforts
must be made to communicate this definition to all users of the test results (Millman
& Greene, 1989). If a test is to serve its intended purposes well, its content must be
consistent with the visions of its developers but also with the inferences users make.
Users’ inferences are more likely to be consistent with those of the test developers if the
constructs the test is intended to measure are communicated clearly to all users than if
users are left to form their own impressions. Information must be gathered to under-
stand what users think test results mean and to evaluate the degree to which these infer-
ences are supported by the test (Koretz et al., 2001; Stake, 1999). The gathering of this
information should be considered an integral part of the validation process.
Conduct validity investigations that illuminate the knowledge and cognitive processes
tapped by the test items. Understanding the cognitive processes tapped by test items is
critical for evaluating alignment among standards, tests, and curriculum, as well as for
helping teachers focus on these processes rather than only on the content that is evident
through superficial inspection of items. Analysis of cognitive processes may include
analysis of think-aloud protocols, in which students are asked to work through a task
or item while verbalizing their solution strategies (Ericsson & Simon, 1984). It also
may include asking students to discuss their reasons for selecting or producing a par-
ticular solution and analyzing the solution path or errors that students make as they
work through the problem (Messick, 1989; National Research Council, 1991).
Include a thorough investigation of fairness as part of the validation effort. Traditional
psychometric methods for detecting differential item functioning should be accom-
panied by cognitive process investigations to identify possible reasons for such differ-
ences (see, e.g., Lane et al., 1996). Investigations of fairness must pay special attention

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 51

to students with disabilities and English language learners, particularly with respect to
the validity of scores that result from accommodated administrations (Abedi, 2003;
Koretz & Hamilton, 2000; Tindal et al., 1998).
Design tests to reduce score inflation. Test scores that are influenced by score infla-
tion are of limited utility for helping parents understand their children’s performance
and for contributing to effective decision making on the part of educators and poli-
cymakers. Those responsible for selecting or developing large-scale tests for account-
ability systems should design these systems in ways that minimize the likelihood of
score inflation. Approaches to reducing score inflation include (but are not limited to)
changing the test items each year and varying the formats of the items rather than
relying on a single format such as multiple choice. Teacher and principal professional
development, as discussed subsequently, may also help avoid the kind of inappropri-
ate narrowing that leads to score inflation.
Utilize technology to improve testing. Information technology is increasingly being
used to enhance testing programs at all levels of the education system. Although there
may be significant logistical barriers to overcome for states or districts to adopt large-
scale, computer-based testing programs, technology has the potential to improve test-
ing systems in a number of ways. Perhaps most important, computers can provide a
cost-effective way to assess skills and knowledge that are difficult or even impossible
to measure with paper and pencil; the architecture test described by Katz, Martinez,
Sheehan, and Tatsuoka (1993) and the examples provided by Bennett (1998) illustrate
this potential. To the extent that instruction involves technology, the use of technol-
ogy for testing may improve alignment between instruction and assessment, and there-
fore produce more valid information about student proficiency. For example, Russell
and Haney (1997) examined the writing performance of students who were accustomed
to composing essays on computers in their writing classes. These students received
higher scores on a writing test that used computers than on a paper-and-pencil writing
test; they wrote more and displayed better organizational skills. This study suggests that
the validity of the test for evaluating instructional effects was enhanced when com-
puters were used. Finally, computers may facilitate a shift away from the single-form
standardized test, which may be especially subject to score inflation—the use of com-
puterized adaptive testing (CAT) has particular appeal in this regard. Although certain
forms of CAT appear to be at odds with NCLB’s prohibition against out-of-level test-
ing (Olson, 2003), it is likely that a solution will be worked out to allow some form of
such testing in state assessment programs.

Improving Accountability System Design


Include an audit testing mechanism. The validity of inferences users (including poli-
cymakers, educators, parents, students, and other members of the public) make from
scores on high-stakes tests may be enhanced if users are given good information about
how performance on the high-stakes test compares with that on other tests of the same
content. Although it is not necessarily a straightforward task to determine how much

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


52 Review of Research in Education, 27

score inflation has occurred based on discrepancies in trends, a necessarily first step is
to collect the information necessary to investigate the phenomenon. An audit test such
as NAEP can provide such information and can reveal score inflation in its early stages
so that additional steps can be taken to reduce it. NCLB requires samples of students
in each state to participate in NAEP “in order to help the U.S. Department of Educa-
tion verify the results of statewide assessments” (U.S. Department of Education, 2002).
Interpretation of discrepancies between NAEP and state test score trends is not
straightforward. A number of factors other than inflation can contribute to differences
in how the two tests function; these include differences in student motivation, defini-
tions of performance levels, content sampled by the tests, and test administration
rules (National Assessment Governing Board, 2002). In addition, there are a number
of ways to display the data from the two tests, including reporting scale score aver-
ages, tallying the percentages of students meeting each performance level, examining
the magnitudes of racial/ethnic gaps in performance, and displaying the entire distri-
bution of scores on each test, and these have implications for the interpretation of dis-
crepancies (National Assessment Governing Board, 2002). If an audit test such as
NAEP is to be useful, educators and policymakers will need clear guidance on how to
examine and interpret the results.
Present test results in a format useful for teachers. Interviews with teachers suggest that
clearer and more timely reporting of results would improve the utility of scores for
instructional purposes (see, e.g., Clarke et al., 2003). Reports should include informa-
tion about individual students’ performance as well as information about the accuracy
of scores. In addition, parents and educators need to be given clear, accessible guidance
to help them interpret and use information from large-scale tests, and this information
must be relevant to the ways in which test scores are used. A standard internal consis-
tency reliability coefficient, for example, may be interpreted by users as an indicator of
the trustworthiness and stability of scores, but it is not useful for understanding rates
of error in a testing program that uses performance levels or cut scores; instead, mis-
classification rates would be more informative. The utility of score reports would also
be enhanced by a reporting strategy that breaks down student performance according
to specific standards or topic areas, particularly those that are deemed to be most impor-
tant, rather than reporting only a global score (Commission on Instructionally Sup-
portive Assessment, 2001). Of course, because subscores are likely to have significant
measurement error associated with them, reporting information about reliability is
important at the subscore level as well as at the total score level.
Measure instructional practices and other responses of educators in addition to student
achievement. Clearly, the reactions of educators to the accountability system are criti-
cal determinants of whether the system raises student achievement and of the validity
of information the system produces. Systems should monitor practices, ideally through
a variety of formal and informal methods such as interviews or surveys administered to
samples of teachers across schools (Hamilton & Stecher, 2002). Efforts should be made
to triangulate data from multiple sources, such as by surveying or interviewing students
in addition to teachers (Lane, Parke, & Stone, 1998). Information on how teachers

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 53

change their practices, what topics do and do not get taught, and whether the incen-
tive system has affected morale would help administrators and policymakers gauge the
effectiveness of the system and modify it if it is not working as intended. Efforts to dis-
tinguish between the practices used by successful and unsuccessful schools, or to link
specific practices with score gains (see, e.g., Lane et al., in press; Stecher et al., 2000;
Stone & Lane, 2003), could be particularly valuable. Measurement of practice is not
straightforward; addressing the diversity of practice across classrooms, the possibility of
response bias, the likely political backlash, and the high costs involved are all impedi-
ments to implementing a large-scale system that measures practice (Koretz, 2003b).
Despite these problems, the feasibility of monitoring classroom practices and other
responses, even among a sample of teachers and schools, should be explored.
Conduct ongoing studies of alignment among standards, assessment, curriculum, and
instruction. Although most states claim to have tests that are aligned with standards,
a 2001 review conducted by Achieve, Inc., of 10 states indicated that only one,
Massachusetts, had attained an acceptable degree of alignment (Achieve, Inc., 2002).
In particular, Achieve’s studies of alignment have shown that while individual items
are well aligned to the specific standard to which they are mapped, many standards are
omitted from the test, especially the more challenging standards that focus on higher
level cognitive processes (Rothman, Slattery, Vranek, & Resnick, 2002). As the Achieve
authors point out, this pattern is likely to lead to instruction focused on lower level
skills as teachers become more familiar with the content of the tests. Efforts to ensure
alignment need to take into consideration the cognitive complexity and level of chal-
lenge associated with test items and standards, in addition to examining content match.
The rubric used by Achieve in the study just mentioned provides a good example of a
methodology for doing so. Examinations of alignment with curriculum and instruc-
tion must also address the cognitive complexity of tasks in addition to their content.
In addition to Achieve’s approach, methods for evaluating multiple dimensions of
alignment among tests, standards, curriculum, and instruction have been developed
by Porter (2002) and Webb (1997). Some of the difficulties inherent in alignment
studies are described by Bhola, Impara, and Buckendahl (2003) and by Koretz and
Hamilton (in press); these include differences in the specificity of standards across
states and the difficulty of determining what cognitive processes an item requires, par-
ticularly in light of the fact that the cognitive demands of an item may vary as a func-
tion of examinee proficiency. Despite these difficulties, the importance of alignment
to the proper functioning of accountability systems requires efforts to evaluate its
multiple dimensions.
Provide professional development to help teachers implement the standards and engage
in appropriate test preparation. A study by Smith and colleagues (1997) revealed that
fewer than one fifth of responding teachers in Arizona believed that they had received
adequate professional development to respond to the state’s testing program. Teach-
ers typically receive little exposure to techniques of assessment development and inter-
pretation during their preservice training (Cizek, 2000; Ward, 1980), and until this
changes, in-service professional development will be vital. Participants in an interview

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


54 Review of Research in Education, 27

study conducted by Clarke et al. (2003) mentioned the availability of professional


development as a critical factor determining how standards-based reform is likely to
affect classroom practices. Districts and schools both play important roles in providing
teachers with the necessary support and resources to implement change in the class-
room (Massell, Kirst, & Hoppe, 1997) and should both be attending to the professional
development needs of teachers. Schools and districts also need to work to promote
effective networks and communication processes so that teachers and higher level
administrators share a common vision about how testing and accountability policies
should be implemented (Spillane & Thompson, 1997). To the extent possible, pro-
fessional development should help teachers identify and learn to use instructional
strategies that are known to be effective; without a repertoire of such strategies, teach-
ers working in high-stakes testing contexts may resort to rote forms of test prepara-
tion (Airasian, 1988). Validity investigations should obtain input from teachers and,
when feasible, students on the usefulness of the test for their purposes, particularly
when tests are intended to support instructional change (Tittle, 1989).
Utilize technology to improve accountability system design. Information technology not
only has great potential to improve test design but may also facilitate the adoption of
effective test-based accountability systems. The timeliness of feedback that computer-
ized test administration provides can help make assessment results more useful to edu-
cators, students, parents, and policymakers, and computer technology, including the
Internet, may provide ways to improve the cost-effectiveness of testing, enhance test
security, and reduce score inflation (Hamilton, Klein, & Lorie, 2000). States and dis-
tricts may currently lack the infrastructure necessary to take advantage of the benefits
that computers and the Internet have to offer, but it is likely that future versions of dis-
trict and state accountability systems will rely heavily on technology. Technology also
plays a key role in improving state and district data systems. Many of these entities cur-
rently lack a system that can track individual growth over time or produce information
in a useful format; improving these systems is critical to the development of testing
policies that promote good decision making (Hamilton & Stecher, 2002).

Incorporating Classroom-Based Assessment


Into Test-Based Accountability Systems
Supplement large-scale tests with some form of classroom-based assessment. The brief
review of research on classroom assessment provided earlier suggests some clear ben-
efits that are derived from a good classroom-based assessment system. Providing such
a system as an integral part of a large-scale testing and accountability system, rather
than relying on teachers or local administrators to find or develop their own systems,
can enhance the degree of alignment among classroom activities, state standards, and
state tests, thereby producing a system that reflects a coherent vision of learning across
levels and components of the system (National Research Council, 2001); in addition,
it may reduce the tendency of teachers to develop instruments that look just like the
external test. Also, external providers are more likely than classroom teachers to have

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 55

the resources and skills necessary to produce classroom assessment systems that have
scoring and analysis tools designed to provide useful information; teachers often lack
the knowledge needed to develop effective assessments, and even if they possess this
knowledge, time constraints are likely to prevent them from producing systems that
provide the kind of summary information that is likely to be useful (Cizek, 2000;
Dwyer, 1998).
Utilize technology to improve the administration, scoring, and reporting of results from
classroom assessments. Information technology has been used in a wide range of class-
room assessment applications. Some of these are “benchmark” or “interim” assessment
systems designed to resemble the large-scale test and to provide ongoing feedback to
teachers on students’ success at meeting state standards. Other applications of tech-
nology provide rich, diagnostic information that is unavailable through traditional,
paper-and-pencil forms of testing (Koedinger, Anderson, Hadley, & Mark, 1997;
National Research Council, 2001; Wenger, 1987). Technology can provide oppor-
tunities for individualizing assessment and feedback and can maintain records of
students’ strategies as they progress through a problem or task, providing important
information that is unavailable when only final responses are observed. This type of
application serves a somewhat different role from the interim assessments that are
designed to look like the state test, and can facilitate the integration of instruction and
assessment (Bennett, 1998).
Provide professional development to help educators make better use of classroom-based
assessment. As with large-scale assessment, teachers and principals need to be offered
professional development designed to help them use and interpret results from assess-
ments. In addition, educators should be given training to help them improve their own
assessment-development skills (see, e.g., Commission on Instructionally Supportive
Assessment, 2001).
These guidelines are not intended to be exhaustive, but they do provide a summary
of advice that has been given by a wide variety of researchers and educators and can be
used as input to the development and evaluation of accountability systems that rely on
tests to shape policy and practice. It is important to recognize, however, that any orga-
nization or institution responsible for implementing a large-scale testing program faces
inevitable constraints in regard to time, personnel, and financial resources. These con-
straints require a prioritization of actions to promote high-quality testing in the con-
text of limited resources. Some of the suggestions just presented, such as incorporation
of information technology into testing programs, represent potentially promising
approaches that may improve test quality in the long run but may not be immediately
necessary. Others, such as the need for better validity investigations, deserve higher pri-
ority. Policymakers, administrators, and others who are charged with the task of devel-
oping or modifying a large-scale testing program need to weigh the options and their
likely effects on the quality of the program.
This list of guidelines illustrates the degree to which the effects of any large-scale
testing system will depend to a great extent on the details of implementation—what is
measured, how the measures are validated, what support is given to educators, and so

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


56 Review of Research in Education, 27

on. At the same time, testing policies should not be viewed as panaceas for education
reform. There are limitations inherent in this approach to improving education policy
and practice. I turn to some of these limitations in the final section of this chapter.

LIMITATIONS OF LARGE-SCALE TESTS AS POLICY TOOLS


Policymakers and others who wish to reform the education system have put forth
an almost unlimited number of recommendations, but few proposed approaches have
gained the wide, bipartisan support of test-based accountability. Even the most care-
fully constructed accountability system, however, is unlikely to meet the high ambi-
tions of accountability proponents and is bound to produce some unintended and
perhaps undesirable consequences. There are a number of reasons why this is the case:
(a) Accountability systems typically pin too many hopes on a single battery of tests, and
we know that overemphasis on a single set of measures can produce negative conse-
quences; (2) other factors in addition to the testing system will influence what students,
parents, educators, and policymakers do, and it is impossible to design a system that
controls the responses of all of these actors; and (3) despite the large body of research
and the growing consensus among professionals regarding the negative effects of test-
ing and how to prevent them, there is little evidence that points conclusively to one
particular approach or another, and we simply do not know enough to design a bul-
letproof system. I discuss each of these reasons briefly.
As mentioned earlier in this chapter and in almost every set of guidelines for testing
that has been published, using a single test (or set of tests) for multiple purposes can
diminish the value of that test for any single purpose. Most of today’s test-based
accountability systems rely on tests to produce information for external monitoring
purposes, to help teachers shape their instruction, to motivate teachers and students to
improve their performance, to inform parental choice, and to facilitate decisions about
which schools to reward and which to close. These purposes are frequently in conflict
with one another. For example, the characteristics of a test that make it suitable for pro-
viding instructional feedback (e.g., short testing time, individualized administra-
tion, heavy reliance on specific prior instruction) will tend to make it inappropriate for
accountability purposes, and vice versa (Mehrens, 1998). The systems of large-scale and
classroom-based assessments described earlier address this to some degree, but these sys-
tems inherently acknowledge that more than one set of instruments will be necessary
to serve both sets of purposes. Similarly, the value of a testing system for external mon-
itoring purposes may be diminished by reallocation if high stakes are attached to scores.
In short, pinning too many hopes to a single testing program is unlikely to produce the
desired results and may lead to adverse consequences.
Similarly, it is widely recognized in the education and measurement communities
that a single test score should not be the sole basis for important decisions about indi-
viduals or groups. According to the National Research Council (1999b), “scores from
large-scale assessments should never be the only sources of information used to make
a promotion or retention decision” (p. 286), and most professional organizations make

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 57

similar assertions about the need to use multiple measures and avoid reliance on a sin-
gle test score for any high-stakes decision (AERA, APA, & NCME, 1999; Baker et al.,
2002; National Association for the Education of Young Children, 1988). Despite this
consensus regarding the need for multiple measures, many state and district testing sys-
tems violate this admonition by using tests to deny diplomas, retain students in grade,
or reward or penalize educators. One of the problems is a lack of agreement on the
meaning of “multiple measures”—that is, if a math score and a reading score are com-
bined, or if students get five chances to pass an exit exam, does this constitute relying
on a single test score? Even within the professional education and measurement com-
munities, there is disagreement about how to define “multiple measure” or “single test
score” (Mehrens, 2000; Phillips, 2000).
A second reason for the limited utility of testing as a policy tool is the wide range of
other factors influencing the actions of participants in the system. The success of any
test-based accountability system will depend to a large degree on the capacity (includ-
ing human, financial, and material resources) of teachers, administrators, parents, stu-
dents, and policymakers to respond to the system in effective ways. Thus, factors such
as amount and quality of curriculum materials, availability and appropriateness of
training, and quality of collegial relationships within a school will affect what teachers
do, but teachers’ actions will also be influenced by their own prior knowledge and
belief systems about how students learn and what they should learn. Parents’ capacity
to use test-score data to make good decisions about homework help, tutoring, or school
choice will depend in part on how the system makes data accessible to them, but also
on parents’ own skill levels at interpreting data, their prior experiences with testing, and
their beliefs about what test scores mean. Accountability systems can be designed to
maximize the likelihood of desirable responses, but ultimately they cannot guarantee
any specific outcome.
Finally, there is simply too much that we currently do not know about how to
design testing policies that promote desirable outcomes and prevent undesirable ones.
The knowledge base on this topic grows every year and will surely increase as states,
districts, and schools learn from their attempts to meet the requirements of NCLB,
but the search for answers to questions about how to minimize score inflation and
promote effective instruction is likely to continue for many years. In the meantime,
it will be important to continue to gather evidence, both from large-scale studies and
from the individual experiences of teachers, administrators, and others affected by
test-based accountability, and to make that evidence available so that it can inform
efforts to improve accountability systems.

NOTES
1
Although some writers distinguish between the terms test and assessment, subsuming “test”
under the broader category of “assessment” (see, e.g., Cizek, 2000), these terms are used inter-
changeably throughout this chapter.
2
Although the topic of standards is closely intertwined with that of large-scale testing, the
former topic is not addressed here because it is the subject of another chapter in this volume.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


58 Review of Research in Education, 27

3
Although this chapter focuses on testing in the United States, it is worth noting that high-
stakes testing, particularly in the form of exit exams, is prevalent in other nations (Phelps, 2000).
Some of the research discussed here was conducted in other countries, and most of the issues
presented in this chapter are relevant beyond the borders of the United States.
4
As several measurement experts have pointed out, criterion-referenced reporting does not
necessarily require the use of cut scores (Glaser, 1963; Glass, 1976; Linn, 1994), though the
need to identify students achieving minimum competency and, more recently, the emphasis
on standards-based reporting have led to a heavy reliance on cut scores in large-scale testing
programs.
5
Classroom-based assessments are sometimes referred to as formative assessments, but both
classroom-based and large-scale assessments may be used for both formative and summative
purposes, so I rely on the term classroom-based assessments throughout the chapter.
6
Although aligning instruction with standards is generally believed to be a desirable response,
the quality and breadth of the standards, the nature of the alignment (e.g., whether instruction
focuses only on a subset of standards), and the correspondence between the standards and users’
inferences influence whether aligning has positive or negative effects on validity.
7
Excessive coaching of this sort would generally be considered a negative response, but to the
extent that a moderate amount of such coaching is necessary to ensure the validity of the test—
for example, in situations in which some students may have had practice with a particular for-
mat and others have not—it may be considered a desirable activity if it does not unduly detract
from the overall quality and content of instruction.

REFERENCES
Abedi, J. (2003). Impact of student language background on content-based performance: Analyses
of extant data (CSE Tech. Rep. 603). Los Angeles: Center for Research on Evaluation, Stan-
dards, and Student Testing.
Achieve, Inc. (2000). Setting the record straight (Achieve Policy Brief No. 1). Washington, DC:
Author.
Achieve, Inc. (2002). Aiming higher: The next decade of education reform in Maryland. Wash-
ington, DC: Author.
Airasian, P. W. (1988). Measurement-driven instruction: A closer look. Educational Measure-
ment: Issues and Practice, 7(4), 6–11.
Alexander, K. L., Entwisle, D. R., & Olson, L. S. (2001). Schools, achievement, and inequal-
ity: A seasonal perspective. Educational Evaluation and Policy Analysis, 23, 171–191.
American Educational Research Association. (2000). Position statement of the American
Educational Research Association concerning high-stakes testing in pre-K–12 education.
Educational Researcher, 29(8), 24–25.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1999). Standards for educational and psychological
testing. Washington, DC: Author.
Amrein-Beardsley, A. A., & Berliner, D. C. (2003). Re-analysis of NAEP math and reading
scores in states with and without high-stakes tests: Response to Rosenshine. Education Policy
Analysis Archives, 11(25). Retrieved from http://epaa.asu.edu/epaa/v11n25/
Baker, E. L., Linn, R. L., Herman, J. L., & Koretz, D. M. (2002, Winter). Standards for edu-
cational accountability systems. CRESST Line, pp. 1–4.
Baker, E. L., & O’Neil, H. (1995). Diversity, assessment, and equity in educational reform.
In M. Nettles & A. Nettles (Eds.), Equity and excellence in educational testing and assessment
(pp. 69–87). Boston: Kluwer.
Baker, E. L., O’Neil, H. F., & Linn, R. L. (1993). Policy and validity prospects for performance-
based assessment. American Psychologist, 48, 1210–1218.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 59

Bangert-Drowns, R. L., Kulik, J. A., & Kulik, C. C. (1991). Effects of frequent classroom
testing. Journal of Educational Research, 85, 89–99.
Baxter, G. P., & Glaser, R. (1998). Investigating the cognitive complexity of science assess-
ments. Educational Measurement: Issues and Practice, 17(3), 37–45.
Baxter, G. P., Shavelson, R. J., Goldman, S. R., & Pine, J. (1992). Evaluation of procedure-based
scoring for hands-on science achievement. Journal of Educational Measurement, 29, 1–17.
Bennett, R. E. (1998). Reinventing assessment: Speculations on the future of large-scale educa-
tional testing. Princeton, NJ: Educational Testing Service.
Bennett, S. N., Wragg, E. C., Carre, C. G., & Carter, D. G. S. (1992). A longitudinal study
of primary teachers’ perceived competence in, and concerns about, national curriculum
implementation. Research Papers in Education, 7, 53–78.
Betts, J. R., & Costrell, R. M. (2001). Incentives and equity under standards-based reform.
In D. Ravitch (Ed.), Brookings papers on education policy: 2001 (pp. 9–74). Washington,
DC: Brookings Institution Press.
Bhola, D. S., Impara, J. C., & Buckendahl, C. W. (2003). Aligning tests with states’ content
standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3), 21–29.
Bishop, J. (1998). Do curriculum-based external exit exam systems enhance student achievement?
(CPRE Research Rep. RR-40). Philadelphia: Consortium for Policy Research in Education.
Bishop, J. H., & Mane, F. (1999). The New York state reform strategy: The incentive effects of
minimum competency exams. Philadelphia: National Center on Education in Inner Cities.
Black, P. J. (1993). Formative and summative assessment by teachers. Studies in Science Edu-
cation, 21, 49–97.
Black, P., & William, D. (1998). Assessment and classroom learning. Assessment in Education,
5, 7–73.
Borko, H., & Elliott, R. (1999). Hands-on pedagogy versus hands-off accountability. Phi
Delta Kappan, 80, 394–400.
Borko, H., Elliott, R., & Uchiyama, K. (1999, April). Professional development: A key to Ken-
tucky’s educational reform effort. Paper presented at the annual meeting of the American
Educational Research Association, Montreal.
Camilli, G., & Shepard, L. A. (1994). Methods for detecting biased test items. Thousand Oaks,
CA: Sage.
Cannell, J. J. (1988). Nationally normed elementary achievement testing in America’s public
schools: How all 50 states are above the national average. Educational Measurement: Issues
and Practice, 7(2), 5–9.
Carnoy, M., & Loeb, S. (2002). Does external accountability affect student outcomes? A
cross-state analysis. Educational Evaluation and Policy Analysis, 24, 305–331.
Carnoy, M., Loeb, S., & Smith, T. (2001). Do higher state test scores in Texas make for better
high school outcomes? (CPRE Research Rep. RR-047). Philadelphia: Consortium for Policy
Research in Education.
Cimbricz, S. (2003). State-mandated testing and teachers’ beliefs and practice. Education Pol-
icy Analysis Archives, 10(2). Retrieved from http://epaa.asu.edu/epaa/v10n2.html
Cizek, G. J. (2000). Pockets of resistance in the education revolution. Educational Measure-
ment: Issues and Practice, 19(1), 16–23, 33.
Clarke, M., Shore, A., Rhoades, K., Abrams, L., Miao, J., & Li, J. (2003). Perceived effects of
state-mandated testing programs on teaching and learning: Findings from interviews with edu-
cators in low-, medium-, and high-stakes states. Boston: National Board on Educational Test-
ing and Public Policy.
Clotfelter, C. T., & Ladd, H. F. (1996). Recognizing and rewarding success in public schools.
In H. F. Ladd (Ed.), Holding schools accountable: Performance-based reform in education
(pp. 23–63). Washington, DC: Brookings Institution Press.
Cohen, D. K. (1995). What is the system in systemic reform? Educational Researcher, 24(9),
11–17.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


60 Review of Research in Education, 27

Cole, N. S., & Moss, P. A. (1989). Bias in test use. In R. L. Linn (Ed.), Educational measure-
ment (3rd ed., pp. 201–219). New York: Macmillan.
Commission on Instructionally Supportive Assessment. (2001). Building tests to support
instruction and accountability (report prepared for American Association of School Admin-
istrators, National Association of Elementary School Principals, National Association of
Secondary School Principals, National Education Association, and National Middle School
Association). Washington, DC: Author.
Corbett, H. D., & Wilson, B. L. (1988). Raising the stakes in statewide mandatory minimum
competency testing. In W. L. Boyd & C. T. Kerchner (Eds.), The politics of excellence and
choice in education: The 1987 Politics of Education Association yearbook (pp. 27–39). New
York: Falmer Press.
Corbett, H. D., & Wilson, B. L. (1991). Two state minimum competency testing programs and
their effects on curriculum and instruction. In R. E. Stake (Ed.), Advances in program evalu-
ation: Vol. I. Effects of mandated assessment on teaching (pp. 7–40). Greenwich, CT: JAI Press.
Cronbach, L. J. (1988). Five perspectives on the validity argument. In H. Wainer & H. I.
Braun (Eds.), Test validity (pp. 3–17). Hillsdale, NJ: Erlbaum.
Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Review of
Educational Research, 58, 438–481.
Darling-Hammond, L. (1991). The implications of testing policy for quality and equality.
Phi Delta Kappan, 73, 220–225.
Darling-Hammond, L. (2003). Standards and assessments: Where we are and what we need.
Teachers College Record. Retrieved from http://www.tcrecord.org/Content.asp?ContentID=
11109
Darling-Hammond, L., & Wise, A. E. (1985). Beyond standardization: State standards and
school improvement. Elementary School Journal, 85, 315–336.
Deere, D., & Strayer, W. (2003). Competitive incentives: School accountability and student out-
comes in Texas. Unpublished working paper.
Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development
and use of performance assessments. Applied Measurement in Education, 4, 289–303.
Dwyer, C. A. (1998). Assessment and classroom learning: Theory and practice. Assessment in
Education, 5, 131–137.
Ericsson, K. A., & Simon, H. A. (1984). Protocol analysis: Verbal reports as data. Cambridge,
MA: MIT Press.
Figlio, D. N., & Getzler, L. S. (2002). Accountability, ability and disability: Gaming the sys-
tem (NBER Working Paper W9307). Cambridge, MA: National Bureau of Economic
Research.
Firestone, W., Mayrowetz, D., & Fairman, J. (1998). Performance-based assessment and
instructional change: The effects of testing in Maine and Maryland. Educational Evaluation
and Policy Analysis, 20, 95–113.
Frederickson, N. (1994). The influence of minimum competency tests on teaching and learning.
Princeton, NJ: Educational Testing Service, Policy Information Center.
Gifford, B. R., & O’Connor, M. C. (1992). Changing assessments: Alternative views of apti-
tude, achievement, and instruction. Boston: Kluwer.
Gipps, C. (1999). Socio-cultural aspects of assessment. In A. Iran-Nejad & P. D. Pearson
(Eds.), Review of research in education (Vol. 24, pp. 355–392). Washington, DC: American
Educational Research Association.
Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some
questions. American Psychologist, 18, 519–521.
Glaser, R., Linn, R., & Bohrnstedt, G. (1997). Assessment in transition: Monitoring the nation’s
educational progress. New York: National Academy of Education.
Glass, G. V. (1976). Standards and criteria. Journal of Educational Measurement, 15, 237–261.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 61

Goldhaber, D., & Hannaway, J. (2001, November). Accountability with a kicker: Observations
on the Florida A+ Accountability Plan. Paper presented at the annual meeting of the Associ-
ation of Public Policy and Management, Washington, DC.
Goslin, D. A. (1963). Teachers and testing. New York: Russell Sage Foundation.
Greene, J. P., Winters, M. A., & Forster, G. (2003). Testing high-stakes tests: Can we believe
the results of accountability tests? (Civic Rep. 33). New York: Manhattan Institute for Policy
Research.
Grissmer, D. W., & Flanagan, A. (1998). Exploring rapid score gains in Texas and North
Carolina. Washington, DC: National Education Goals Panel.
Grissmer, D. W., Flanagan, A., Kawata, J., & Williamson, S. (2000). Improving student
achievement: What state NAEP scores tell us (Publication MR-924-EDU). Santa Monica,
CA: RAND.
Haertel, E. H. (1999). Performance assessment and education reform. Phi Delta Kappan, 80,
662–666.
Hambleton, R. K., & Murphy, E. (1992). A psychometric perspective on authentic measure-
ment. Applied Measurement in Education, 5, 1–16.
Hambleton, R. K., & Slater, S. C. (1995). Do policymakers and educators understand NAEP
reports? Washington, DC: National Center for Education Statistics.
Hamilton, L. S. (1998). Gender differences on high school science achievement tests: Do for-
mat and content matter? Educational Evaluation and Policy Analysis, 20, 179–195.
Hamilton, L. S. (1999). Detecting gender-based differential item functioning on a constructed-
response science test. Applied Measurement in Education, 12, 211–235.
Hamilton, L. S., Klein, S. P., & Lorie, W. (2000). Using Web-based testing for large-scale assess-
ments. Santa Monica, CA: RAND.
Hamilton, L. S., & Koretz, D. M. (2002). Tests and their use in test-based accountability sys-
tems. In L. S. Hamilton, B. M. Stecher, & S. P. Klein (Eds.), Making sense of test-based
accountability in education (pp. 13–49). Santa Monica, CA: RAND.
Hamilton, L. S., Nussbaum, E. M., & Snow, R. E. (1997). Interview procedures for validat-
ing science assessments. Applied Measurement in Education, 10, 181–200.
Hamilton, L. S., & Stecher, B. M. (2002). Improving test-based accountability. In L. S.
Hamilton, B. M. Stecher, & S. P. Klein (Eds.), Making sense of test-based accountability in
education (pp. 121–144). Santa Monica, CA: RAND.
Haney, W. (1981). Validity, Vaudeville, and values: A short history of social concerns over
standardized testing. American Psychologist, 36, 1021–1034.
Haney, W. (2000). The myth of the Texas miracle in education. Education Policy Analysis
Archives, 8(41). Retrieved from http://epaa.asu.edu/epaa/v8n41/
Hannaway, J., & McKay, S. (2001). Taking measure. Education Next, 1(3). Retrieved from
http://www.educationnext.org/20013/6hannaway.html
Jacob, B. A. (2001). Getting tough? The impact of high school graduation exams. Educational
Evaluation and Policy Analysis, 23, 99–122.
Jacob, B. A. (2002). Accountability, incentives, and behavior: The impact of high-stakes testing
in the Chicago Public Schools (NBER Working Paper 8968). Cambridge, MA: National
Bureau of Economic Research.
Jacob, B. A., & Levitt, S. D. (2002). Rotten apples: An investigation of the prevalence and pre-
dictors of teacher cheating (NBER Working Paper 9413). Cambridge, MA: National Bureau
of Economic Research.
Jaeger, R. M. (1982). The final hurdle: Minimum competency achievement testing. In G. R.
Austin & H. Garber (Eds.), The rise and fall of national test scores. New York: Academic Press.
Jones, G., Jones, B. D., Hardin, B., Chapman, L., Yarbrough, T., & Davis, M. (1999). The
impact of high-stakes testing on teachers and students in North Carolina. Phi Delta Kap-
pan, 81, 199–203.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


62 Review of Research in Education, 27

Kane, T. J., & Staiger, D. O. (2002). Volatility in school test scores: Implications for test-based
accountability systems. In D. Ravitch (Ed.), Brookings papers on education policy (pp. 235–269).
Washington, DC: Brookings Institution Press.
Katz, I. R., Martinez, M. E., Sheehan, K. M., & Tatsuoka, K. K. (1993). Extending the rule
space model to a semantically-rich domain: Diagnostic assessment in architecture. Princeton,
NJ: Educational Testing Service.
Kelley, C., Odden, A., Milanowski, A., & Heneman, H. (2000). The motivational effects of
school-based performance awards (CPRE Policy Brief RB-29). Philadelphia: Consortium for
Policy Research in Education.
Klein, S. P., Hamilton, L. S., McCaffrey, D. F., & Stecher, B. M. (2000). What do test scores
in Texas tell us? Santa Monica, CA: RAND.
Klein, S. P., Jovanovic, J., Stecher, B. M., McCaffrey, D., Shavelson, R. J., Haertel, E., Solano-
Flores, G., & Comfort, K. (1997). Gender and racial/ethnic differences on performance
assessments in science. Educational Evaluation and Policy Analysis, 19, 83–97.
Kluger, A. N., & DeNisi, A. (1996). The effect of feedback interventions on performance: A
historical review, a meta-analysis, and a preliminary feedback intervention theory. Psycho-
logical Bulletin, 119, 254–284.
Koedinger, K. R., Anderson, J. R., Hadley, W. H., & Mark, M. A. (1997). Intelligent tutor-
ing goes to school in the big city. International Journal of Artificial Intelligence in Education,
8, 30–43.
Koretz, D. (1988). Arriving at Lake Wobegon: Are standardized tests exaggerating achieve-
ment and distorting instruction? American Educator, 12(2), 8–15, 46–52.
Koretz, D. (1992). State and national assessment. In M. C. Alkin (Ed.), Encyclopedia of edu-
cational research (6th ed., pp. 1262–1267). Washington, DC: American Educational
Research Association.
Koretz, D. (1997). The assessment of students with disabilities in Kentucky (CSE Tech. Rep.
431). Los Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Koretz, D. (2003a, April). Attempting to discern the effects of the NCLB accountability provisions
on learning. Paper presented at the annual meeting of the American Educational Research
Association, Chicago.
Koretz, D. (2003b). Using multiple measures to address perverse incentives and score infla-
tion. Educational Measurement: Issues and Practice, 22(2), 18–26.
Koretz, D. M., & Barron, S. I. (1998). The validity of gains on the Kentucky Instructional
Results Information System (KIRIS). Santa Monica, CA: RAND.
Koretz, D., Barron, S., Mitchell, K., & Stecher, B. (1996). The perceived effects of the Kentucky
Instructional Results Information System (KIRIS) (Publication MR-792-PCT/FF). Santa
Monica, CA: RAND.
Koretz, D. M., & Deibert, E. (1993). Interpretations of National Assessment of Educational
Progress (NAEP) anchor points and achievement levels by the print media in 1991. Santa Mon-
ica, CA: RAND.
Koretz, D., & Deibert, E. (1996). Setting standards and interpreting achievement: A cau-
tionary tale from the National Assessment of Educational Progress. Educational Assessment,
3, 53–81.
Koretz, D. M., & Hamilton, L. S. (2003). Teachers’ responses to high-stakes testing and the
validity of gains: A pilot study (CSE Tech. Rep. 610). Los Angeles: Center for Research on
Evaluation, Standards, and Student Testing.
Koretz, D. M., & Hamilton, L. S. (2000). Assessment of students with disabilities in Ken-
tucky: Inclusion, student performance, and validity. Educational Evaluation and Policy
Analysis, 22, 255–272.
Koretz, D. M., & Hamilton, L. S. (in press). K–12 group testing. In R. Brennan (Ed.), Edu-
cational measurement (4th ed.). Westport, CT: American Council on Education/Praeger.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 63

Koretz, D. M., Linn, R. L., Dunbar, S. B., & Shepard, L. A. (1991, April). The effects of high-
stakes testing on achievement: Preliminary findings about generalization across tests. Paper pre-
sented at the annual meeting of the American Educational Research Association, Chicago.
Koretz, D. M., McCaffrey, D. F., & Hamilton, L. S. (2001). Toward a framework for vali-
dating gains under high-stakes conditions (CSE Tech. Rep. 551). Los Angeles: Center for
Research on Evaluation, Standards, and Student Testing.
Koretz, D., Mitchell, K., Barron, S., & Keith, S. (1996). The perceived effects of the Maryland
School Performance Assessment Program (CSE Tech. Rep. 409). Los Angeles: Center for
Research on Evaluation, Standards, and Student Testing.
Koretz, D., Stecher, B., Klein, S., & McCaffrey, D. (1994). The Vermont portfolio assessment
program: Findings and implications. Educational Measurement: Issues and Practice, 13(3), 5–16.
Kupermintz, H. (2002). Teacher effects as a measure of teacher effectiveness: Construct validity
considerations in TVAAS (Tennessee Value-Added Assessment System) (CSE Tech. Rep. 563).
Los Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Ladd, H. F. (1999). The Dallas school accountability and incentive program: An evaluation
of its impacts on student outcomes. Economics of Education Review, 18, 1–16.
Lane, S., Parke, C. S., & Stone, C. A. (1998). A framework for evaluating the consequences
of assessment programs. Educational Measurement: Issues and Practice, 17(2), 24–28.
Lane, S., Parke, C. S., & Stone, C. A. (in press). The impact of a state performance-based assess-
ment and accountability program on mathematics instruction and student learning: Evidence
from survey data and school performance. Educational Assessment.
Lane, S., Wang, N., & Magone, M. (1996). Gender related differential item functioning on
a middle-school mathematics performance assessment. Educational Measurement: Issues and
Practice, 15(4), 21–27, 31.
Lillard, D. R., & DeCicca, P. P. (2001). Higher standards, more dropouts? Evidence within
and across time. Economics of Education Review, 20, 459–473.
Linn, R. L. (1993). Educational assessment: Expanded expectations and challenges. Educa-
tional Evaluation and Policy Analysis, 15, 1–16.
Linn, R. L. (1994). Criterion-referenced measurement: A valuable perspective clouded by sur-
plus meaning. Educational Measurement: Issues and Practice, 13(4), 12–14.
Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29(2), 4–16.
Linn, R. L. (2001). The design and evaluation of educational assessment and accountability systems
(CSE Tech. Rep. 539). Los Angeles: Center for Research on Evaluation, Standards, and Stu-
dent Testing.
Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex performance-based assessment:
Expectations and validation criteria. Educational Researcher, 20(8), 15–21.
Linn, R. L., Dunbar, S. B., Harnisch, D. L., & Hastings, C. N. (1982). The validity of the Title
I Evaluation and Reporting System. In E. R. House, S. Mathison, J. Pearsol, & H. Preskill
(Eds.), Evaluation studies review annual (Vol. 7, pp. 427–442). Beverly Hills, CA: Sage.
Linn, R. L., Graue, M. E., & Sanders, N. M. (1990). Comparing state and district test results
to national norms: The validity of claims that “everyone is above average.” Educational
Measurement: Issues and Practice, 9, 5–14.
Linn, R. L., & Haug, C. (2002). Stability of school-building accountability scores and gains.
Educational Evaluation and Policy Analysis, 24, 29–36.
Linn, R. L., Koretz, D. M., Baker, E. L., & Burstein, L. (1992). The validity and credibility of
the achievement levels for the 1990 National Assessment of Educational Progress in mathemat-
ics (CSE Tech. Rep. 330). Los Angeles: Center for the Study of Evaluation.
Mabry, L. (1999). Writing to the rubric: Lingering effects of traditional standardized testing
on direct writing assessment. Phi Delta Kappan, 80, 673–679.
Mabry, L., Poole, J., Redmond, L., & Schultz, A. (2003). Local impact of state testing in south-
west Washington. Education Policy Analysis Archives, 11(22). Retrieved from http://epaa.
asu.edu/epaa/v11n22/

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


64 Review of Research in Education, 27

Madaus, G. (1988). The influence of testing on the curriculum. In L. Tanner (Ed.), Critical
issues in curriculum: 87th yearbook of the NSSE, Part 1 (pp. 83–121). Chicago: University
of Chicago Press.
Madaus, G. (1993). A national testing system: Manna from above? A historical/technological
perspective. Educational Assessment, 1, 9–26.
Madaus, G. F., & O’Dwyer, L. M. (1999). A short history of performance assessment:
Lessons learned. Phi Delta Kappan, 80, 688–695.
Massell, D., Kirst, M., & Hoppe, M. (1997). Persistence and change: Standards-based systemic
reform in nine states (CPRE Policy Brief RB-21). Philadelphia: Consortium for Policy Research
in Education.
McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., & Hamilton, L. S. (2003). Evaluating
value-added models for teacher accountability. Santa Monica, CA: RAND.
McDonnell, L. M. (2002). Accountability as seen through a political lens. In L. S. Hamilton,
B. M. Stecher, & S. P. Klein (Eds.), Making sense of test-based accountability in education
(pp. 101–120). Santa Monica, CA: RAND.
McDonnell, L. M., & Choisser, C. (1997). Testing and teaching: Local implementation of new
state assessments (CSE Tech. Rep. 442). Los Angeles: Center for Research on Evaluation,
Standards, and Student Testing.
McNeil, L. M. (2000). Creating new inequalities: Contradictions of reform. Phi Delta Kap-
pan, 81, 729–734.
Mehrens, W. A. (1998). Consequences of assessment: What is the evidence? Education Policy
Analysis Archives, 6(13). Retrieved from http://epaa.asu.edu/epaa/v6n13.html
Mehrens, W. A. (2000). Defending a state graduation test: GI Forum v. Texas Education
Agency. Measurement perspectives from an external evaluator. Applied Measurement in Edu-
cation, 13, 387–401.
Mehrens, W. A., & Kaminski, J. (1989). Methods for improving standardized test scores:
Fruitful, fruitless or fraudulent? Educational Measurement: Issues and Practice, 8(1), 14–22.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–104).
New York: Macmillan.
Meyer, L., Orlofsky, G. F., Skinner, R. A., & Spicer, S. (2002). The state of the states. Edu-
cation Week, 21(17), 68–92.
Miller, M. D. (1998). Teacher uses and perceptions of the impact of statewide performance-based
assessments. Washington, DC: Council of Chief State School Officers, State Education
Assessment Center.
Miller, M. D., & Seraphine, A. E. (1993). Can test scores remain authentic when teaching to
the test? Educational Assessment, 1, 119–129.
Millman, J., & Greene, J. (1989). The specification and development of tests of achievement
and ability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 335–366). New
York: Macmillan.
National Assessment Governing Board. (2002). Using the National Assessment of Educational
Progress to confirm state test results. Washington, DC: Author.
National Association for the Education of Young Children. (1988). NAEYC position state-
ment on standardized testing of young children 3 through 8 years of age, adopted Novem-
ber 1987. Young Children, 43(3), 42–47.
National Center for Education Statistics. (1996). Technical issues in large-scale performance
assessment. Washington, DC: U.S. Department of Education.
National Commission on Excellence in Education. (1983). A nation at risk. Washington, DC:
U.S. Department of Education.
National Council on Education Standards and Testing. (1992). Raising standards for Ameri-
can education. Washington, DC: Author.
National Research Council. (1999a). Grading the nation’s report card: Evaluating NAEP and
transforming the assessment of educational progress. Washington, DC: National Academy Press.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 65

National Research Council. (1999b). High stakes: Testing for tracking, promotion, and gradua-
tion. Washington, DC: National Academy Press.
National Research Council. (2001). Knowing what students know: The science and design of
educational assessment. Washington, DC: National Academy Press.
O’Day, J. (2002). Complexity, accountability, and school improvement. Harvard Educational
Review, 72(3), 293–329.
O’Day, J., Goertz, M. E., & Floden, R. E. (1995). Building capacity for education reform
(CPRE Policy Brief RB-18). Philadelphia: Consortium for Policy Research in Education.
Olson, L. (2003). Legal twists, digital turns. Education Week, 22(35), 11–14, 16.
Parkes, J., & Stevens, J. J. (2003). Legal issues in school accountability systems. Applied Mea-
surement in Education, 16, 141–158.
Pedulla, J. J., Abrams, L. M., Madaus, G. F., Russell, M. K., Ramos, M. A., & Miao, J. (2003).
Perceived effects of state-mandated testing programs on teaching and learning: Findings from a
national survey of teachers. Boston: National Board on Educational Testing and Public Policy.
Phelps, R. P. (2000). Trends in large-scale testing outside the United States. Educational Mea-
surement: Issues and Practice, 19(1), 11–21.
Phillips, M., & Chin, T. (2001). Comment on Betts & Costrell. In D. Ravitch (Ed.), Brookings
papers on education policy: 2001 (pp. 61–66). Washington, DC: Brookings Institution Press.
Phillips, S. E. (2000). GI Forum v. Texas Education Agency: Psychometric evidence. Applied
Measurement in Education, 13, 343–385.
Pipho, C. (1985). Tracking the reforms, Part 5: Testing—Can it measure the success of the
reform movement? Education Week, 4(35), 19.
Popham, W. J. (1987). The merits of measurement-driven instruction. Phi Delta Kappan, 68,
679–682.
Popham, W. J., Cruse, K. L., Rankin, S. C., Sandifer, P. D., & Williams, P. L. (1985). Mea-
surement-driven instruction: It’s on the road. Phi Delta Kappan, 66, 628–634.
Porter, A. C. (2002). Measuring the content of instruction: Uses in research and practice.
Educational Researcher, 31(7), 3–14.
Public Agenda. (2000). Survey finds little sign of backlash against academic standards or stan-
dardized tests. Retrieved from http://www.publicagenda.org/issues/pcc_detail
Quality counts. (2002). Education Week, 21(17). Retrieved from http://www.edweek.com/
sreports/qc02/
Reckase, M. D. (1997, March). Consequential validity from the test developers’ perspective. Paper
presented at the annual meeting of the National Council on Measurement in Education,
Chicago.
Resnick, D. P. (1982). History of educational testing. In A. K. Wigdor & W. R. Garner
(Eds.), Ability testing: Uses, consequences, and controversies, Part II (pp. 173–194). Washing-
ton, DC: National Academy Press.
Resnick, L. B., & Resnick, D. P. (1992). Assessing the thinking curriculum: New tools for
educational reform. In B. R. Gifford & M. C. O’Connor (Eds.), Changing assessment:
Alternative views of aptitude, achievement, and instruction (pp. 37–75). Boston: Kluwer.
Roderick, M., & Engel, M. (2001). The grasshopper and the ant: Motivational responses of
low-achieving students to high-stakes testing. Educational Evaluation and Policy Analysis,
23, 197–227.
Roderick, M., Jacob, B. A., & Bryk, A. S. (2002). The impact of high-stakes testing in Chicago
on student achievement in promotional gate grades. Educational Evaluation and Policy
Analysis, 24, 333–357.
Roeber, E. (1988, February). A history of large-scale testing activities at the state level. Paper pre-
sented at the Indiana Governor’s Symposium on ISTEP, Madison, IN.
Rogosa, D. (2003). Confusions about consistency in improvement. Retrieved from http://www-
stat.stanford.edu/∼rag/api/consist.pdf

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


66 Review of Research in Education, 27

Romberg, T. A., Zarinia, E. A., & Williams, S. R. (1989). The influence of mandated testing
on mathematics instruction: Grade 8 teachers’ perceptions. Madison: National Center for
Research in Mathematical Science Education, University of Wisconsin–Madison.
Rothman, R., Slattery, J. B., Vranek, J. L., & Resnick, L. B. (2002). Benchmarking and align-
ment of standards and testing (CSE Tech. Rep. 566). Los Angeles: Center for Research on
Evaluation, Standards, and Student Testing.
Russell, M., & Haney, W. (1997). Testing writing on computers: An experiment comparing
student performance on tests conducted via computer and via paper-and-pencil. Education
Policy Analysis Archives, 5(3). Retrieved from http://epaa.asu.edu/epaa/v5n3.html
Sadler, R. (1989). Formative assessment and the design of instructional assessments. Instruc-
tional Science, 18, 119–144.
Sanders, W., & Horn, S. (1998). Research findings from the Tennessee Value-Added Assess-
ment System (TVAAS) database: Implications for educational evaluation and research.
Journal of Personnel Evaluation in Education, 12, 247–256.
Schemo, D. J. (2003, July 11). Questions on data cloud luster of Houston schools. New York
Times, p. A1.
Shavelson, R. J., Baxter, G. P., & Pine, J. (1992). Performance assessments: Political rhetoric
and measurement reality. Educational Researcher, 21(4), 22–27.
Shepard, L. (1991). Will national tests improve student learning? (CSE Tech. Rep. 342). Los
Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Shepard, L. A., & Dougherty, K. C. (1991, April). Effects of high-stakes testing on instruction.
Paper presented at the annual meeting of the American Educational Research Association
and National Council on Measurement in Education, Chicago.
Smith, M. L. (1994). Old and new beliefs about measurement-driven instruction: “The more
things change, the more they stay the same” (CSE Tech. Rep. 373). Los Angeles: Center for
Research on Evaluation, Standards, and Student Testing.
Smith, M. L., Edelsky, C., Draper, K., Rottenberg, C., & Cherland, M. (1991). The role of test-
ing in elementary schools (CSE Tech. Rep. 321). Los Angeles: Center for Research on Evalua-
tion, Standards, and Student Testing.
Smith, M. L., Noble, A., Heinecke, W., Seck, M., Parish, C., Cabay, M., Junker, S., Haag, S.,
Tayler, K., Safran, Y., Penley, Y., & Bradshaw, A. (1997). Reforming schools by reforming
assessment: Consequences of the Arizona Student Assessment Program (ASAP): Equity and
teacher capacity building (CSE Tech. Rep. 425). Los Angeles: Center for Research on Eval-
uation, Standards, and Student Testing.
Smith, M. L., & Rottenberg, C. (1991). Unintended consequences of external testing in ele-
mentary schools. Educational Measurement: Issues and Practice, 10(4), 7–11.
Smith, M. S., & O’Day, J. (1990). Systemic school reform. In Politics of Education Associa-
tion yearbook 1990 (pp. 233–267). London: Taylor & Francis.
Smith, M. S., O’Day, J., & Cohen, D. K. (1990). National curriculum American style: Can
it be done? What might it look like? American Educator, 14(4), 10–17, 40–47.
Solano-Flores, G., & Nelson-Barber, S. (2001). On the cultural validity of science assess-
ments. Journal of Research in Science Teaching, 38, 553–573.
Spillane, J. P., & Thompson, C. L. (1997). Reconstructing conceptions of local capacity: The
local education agency’s capacity for ambitious instructional reform. Educational Evalua-
tion and Policy Analysis, 19, 185–203.
Stake, R. (1999). The goods on American education. Phi Delta Kappan, 80, 668–670.
Stecher, B. M. (2002). Consequences of large-scale, high-stakes testing on school and class-
room practices. In L. S. Hamilton, B. M. Stecher, & S. P. Klein (Eds.), Making sense of test-
based accountability in education (pp. 79–100). Santa Monica, CA: RAND.
Stecher, B. M., & Barron, S. I. (1999). Quadrennial milepost accountability testing in Kentucky
(CSE Tech. Rep. 505). Los Angeles: Center for Research on Evaluation, Standards, and
Student Testing.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


Hamilton: Assessment as a Policy Tool 67

Stecher, B. M., Barron, S. I., Chun, T., & Ross, K. (2000). The effects of the Washington state
education reform on schools and classrooms (CSE Tech. Rep. 525). Los Angeles: Center for
Research on Evaluation, Standards, and Student Testing.
Stecher, B. M., Barron, S. I., Kaganoff, T., & Goodwin, J. (1998). The effects of standards-
based assessment on classroom practices: Results of the 1996–97 RAND survey of Kentucky
teachers of mathematics and writing (CSE Tech. Rep. 482). Los Angeles: Center for Research
on Evaluation, Standards, and Student Testing.
Stecher, B. M., & Chun, T. (2001). School and classroom practices during two years of educa-
tion reform in Washington state (CSE Tech. Rep. 550). Los Angeles: Center for Research on
Evaluation, Standards, and Student Testing.
Stecher, B. M., Hamilton, L. S., & Gonzalez, G. (2003). Working smarter to leave no child
behind. Santa Monica, CA: RAND.
Stecher, B. M., & Mitchell, K. J. (1995). Portfolio driven reform: Vermont teachers’ under-
standing of mathematical problem solving (CSE Tech. Rep. 400). Los Angeles: Center for
Research on Evaluation, Standards, and Student Testing.
Stone, C. A., & Lane, S. (2003). Consequences of a state accountability program: Examining
relationships between school performance gains and teacher, student, and school variables.
Applied Measurement in Education, 16, 1–26.
Stufflebeam, D. L., Jaeger, R. M., & Scriven, M. (1991). Summative evaluation of the
National Assessment Governing Board’s inaugural 1990–91 effort to set achievement levels on
the National Assessment of Educational Progress. Kalamazoo: Western Michigan University
Evaluation Center.
Swanson, C., & Stevenson, D. L. (2002). Standards-based reform in practice: Evidence on
state policy and classroom instruction from the NAEP state assessments. Educational Eval-
uation and Policy Analysis, 24, 1–27.
Taylor, G., Shepard, L., Kinner, F., & Rosenthal, J. (2003). A survey of teachers’ perspectives
on high-stakes testing in Colorado: What gets taught, what gets lost (CSE Tech. Rep. 588). Los
Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Tepper, R. L. (2002). The influence of high-stakes testing on instructional practice in Chicago.
Unpublished doctoral dissertation, Harris Graduate School of Public Policy, University of
Chicago.
Texas Education Agency. (2000, May 17). Texas TAAS passing rates hit seven-year high; four
out of every five students pass exam (press release). Austin, TX: Author.
Tindal, G., Heath, B., Hollenbeck, K., Almond, P., & Harniss, M. (1998). Accommodating
students with disabilities on large-scale tests: An experimental study. Exceptional Children,
64, 439–450.
Tittle, C. (1989). Validity: Whose construction is it in the teaching and learning context?
Educational Measurement: Issues and Practice, 8, 5–13.
U.S. Congress, Office of Technology Assessment. (1992). Testing in America’s schools: Asking
the right questions (Publication OTA-SET-519). Washington, DC: U.S. Government Print-
ing Office.
U.S. Department of Education. (2002). No Child Left Behind fact sheet. Washington, DC:
U.S. Government Printing Office.
U.S. General Accounting Office. (2003). Title I: Characteristics of tests will influence expenses;
information sharing may help states realize efficiencies (Publication GAO-03-389). Washing-
ton, DC: Author.
Ward, J. (1980). Teachers and testing: A survey of knowledge and attitudes. In L. M. Rudner
(Ed.), Testing in our schools (pp. 48–72). Washington, DC: National Institute of Education.
Webb, N. L. (1997). Criteria for alignment of expectations and assessments in mathematics and
science education (Research Monograph 8). Washington, DC: Council of Chief State
School Officers.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014


68 Review of Research in Education, 27

Webster, W., & Mendro, R. (1997). The Dallas Value-Added Accountability System. In
J. Millman (Ed.), Grading teachers, grading schools: Is student achievement a valid evaluation
measure? (pp. 81–99). Thousand Oaks, CA: Corwin Press.
Wenger, E. (1987). Artificial intelligence and tutoring systems. Los Altos, CA: Morgan Kaufmann.
Wiggins, G. (1992). Creating tests worth taking. Educational Leadership, 49, 26–33.
Winfield, L. F. (1990). School competency testing reforms and student achievement: Explor-
ing a national perspective. Educational Evaluation and Policy Analysis, 12, 157–173.
Wolf, S. A., Borko, H., McIver, M. C., & Elliott, R. (1999). “No excuses”: School reform efforts
in exemplary schools of Kentucky (CSE Tech. Rep. 514). Los Angeles: Center for Research
on Evaluation, Standards, and Student Testing.
Wolf, S. A., & McIver, M. C. (1999). When process becomes policy. Phi Delta Kappan, 80,
401–406.
Zernike, K. (2001, May 4). Suburban mothers succeed in their boycott of an 8th-grade test.
New York Times, p. A19.
Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP his-
tory assessment. Journal of Educational Measurement, 26, 55–66.

Downloaded from http://rre.aera.net at PRINCETON UNIV LIBRARY on January 13, 2014

You might also like